# DataLab Getting Started in Scala

The Bigstep DataLab is a open data exploration service that offers data science, analytics and technology experimentation, built on our SparkArray, DataLake and on our highly flexible and high performance bare-metal infrastructure.

This tutorial assumes some programming experience.

## Uploading Data

A private datalake (HDFS service) is used to store the data that the SparkArray uses. To upload data to a Bigstep Datalake, one would typically:
1. upload data to the home directory of the datalake using commands like "-put"
2. execute commands like "-ls" to ensure data was uploaded in the datalake

```
dl -ls /
16/09/26 17:18:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxrwxrwx   - hdfs supergroup          0 2018-09-15 13:12 /data_lake/dl1234/baseball
drwxrwxrwt   - hdfs supergroup          0 2018-09-15 12:08 /data_lake/dl1234/tmp
drwxr-xr-x   - hdfs supergroup          0 2018-09-15 12:08 /data_lake/dl1234/tmp/user
```
You can also execute the same commands on the master container!

Data can be uploaded to the DataLake also by using the File Browser that is available in the DataLake File Browser tab in our user interface.

In [None]:
/* Allow the use of shell operations */
import sys.process._

/* Download data locally from the internet */
"wget http://www.exploredata.net/ftp/MLB2008.csv".!

/* Copy the downloaded file to Bigstep DataLake, using the relative path of the user's DataLake home directory */
"dl -put MLB2008.csv /" !

## Initialize Spark Context

For all Spark functions to be available, a Spark context is initialized by default in the current notebook.

In [None]:
/* import a SparkSession */
import org.apache.spark.sql.SparkSession

/* The SparkSession is conencted to the current Spark Master. Retrieve the current SparkSession using: */
spark

/* Retrieve the current SparkContext using: */
sc

## RDDs

An Resilient Distributed Dataset is an array that is spread across multiple servers. It allows the programmer to abstract away the complexity of transforming large volumes of distributed data.

In [None]:
"wget http://seanlahman.com/files/database/baseballdatabank-master_2016-03-02.zip".!

"apt-get install -y unzip".!
"unzip baseballdatabank-master_2016-03-02.zip".!
"rm -rf baseballdatabank-master_2016-03-02.zip".!

"dl -put baseballdatabank-master/core/AllstarFull.csv /".!

In [None]:
val textFile = sc.textFile("/AllstarFull.csv")

In [None]:
textFile.count()

In [None]:
textFile.first()

In [None]:
val linesWithRuth = textFile.filter( line => line.contains("ruth"))

In [None]:
linesWithRuth.count()

In [None]:
linesWithRuth.collect()

In [None]:
linesWithRuth.saveAsTextFile("/lines_with_ruth-file")

# DataFrames and SparkSQL

A SparkDataFrame can also be registered as a temporary view in Spark SQL and that allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a SparkDataFrame.

Spark 2.3.0 has a built-in CSV reader:

In [None]:
"dl -chmod 740 /AllstarFull.csv" !

In [None]:
val allstar = spark.read.option("header", "true").csv("/AllstarFull.csv")

In [None]:
allstar.take(10)

In [None]:
allstar

In [None]:
allstar.show()

In [None]:
/* register this table as a "table" within the sql context. */
allstar.createOrReplaceTempView("allstar")

/* SQL can be run over DataFrames that have been registered as a table. */
val player = spark.sql("SELECT * FROM allstar WHERE playerID like '%ruth%' and yearID<1935")

In [None]:
player.show()

## ParquetFiles

Parquet files are typically much faster and take up less space than CSVs and the DataLake supports them as well. As Spark is a clustering system the parquet files are composed out of many fragments generated by the workres independently. The collection of files is operated as a single big table by SparkSQL.

To write the dataframe:

In [None]:
import java.util.Calendar

val now = Calendar.getInstance().getTimeInMillis()

val path="/allstar-"+now.toString()+".parquet"
player.write.format("parquet").save(path)

"dl -ls /".!


Read the dataframe:

In [None]:
val dfParquet = spark.read.parquet(path)
dfParquet.createOrReplaceTempView("player")
spark.sql("SELECT playerID,YearID FROM player").show()

## Resources 

[Apache Spark 2.3.0 Programming Guide](http://spark.apache.org/docs/latest/sql-programming-guide.html)