Using Scala Analyze Data In Spark

Scala is a programming language that incorporates object oriented and functional programming styles. It is one of the programming languages along Java and Python that can be used to develop Spark applications. Python and Scala shells that enable interactive analyze the data which is available. Scala is a comprehensive programming language that cannot be covered in a short tutorial like this. This tutorial will focus on explaining the basic structure of a Spark application and demonstrate to the reader how to quickly become productive using Spark. Some of concepts that will be briefly demonstrated are: declaring variables, creating resilient distributed datasets (RDD) and transforming RDDs.
The structure of a Scala program is highlighted below:

  • The first step is reading data from a Hadoop supported storage system such as local file system, Hbase, Cassandra, HDFS or S3 and creating a resilient distributed dataset (RDD).
  • After a RDD has been created you perform transformations and actions that create new datasets or return new values respectively.
Before beginning anything in Spark it is fundamental to have a good understanding of RDD. In Spark the fundamental abstraction of working with data is the RDD. An RDD is composed of elements automatically distributed through out the cluster from where parallel operations are performed. These data chunks which are stored on different nodes are called partitions. RDDs can be created by reading in data or transforming existing RDDs.
To demonstrate concepts in this course the Scala shell will be used for interactive analysis then at the end we will create a complete application and submit it to a cluster on Amazon EMR. We will install Spark on local machine. This is an introductory level tutorial so we will run examples on a single node using a small dataset. Download Scala from http://www.scala-lang.org/download/2.10.1.html. Move to the directory Spark was installed, untar, and move it to an installation directory. To create applications for Spark 1.6.1 you need to use Scala 2.10. If you intend to use a higher Scala version, download Spark source code and compile it with support for that version.
cd ~/Downloads
sudo tar xzvf scala-2.10.1.tgz
sudo mv  scala-2.10.1 usr/local/scala
Open bashrc in a text editor and add path to bin directory using this command sudo gedit ~/.bashrc
export PATH = $PATH:/usr/local/scala/scala-2.10.1 /bin
Reload bashrc by running source ~/.bashrc
Run scala -version to confirm installation was successful

Download Spark from http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz. Move to download directory, untar and move Spark to it’s installation folder.

 cd ~/Download
sudo tar xzvf  spark-1.6.1-bin-hadoop2.6.tgz
sudo mv   spark-1.6.1-bin-hadoop2.6 usr/local/spark
Open bashrc in a text editor and add path to Spark home and bin directory. Use this command sudo gedit ~/.bashrc
export SPARK_HOME=/usr/local/spark/spark-1.6.1-bin-hadoop2.6


Reload .bashrc by running source ~/.bashrc. We make spark-env.sh from the template provided by copying and renaming the template. This file contains settings that are referenced by spark. We set java version that will be used and the amount of memory.
cp /usr/local/spark/spark-1.6.1-bin-hadoop2.6/spark-env.sh.template /usr/local/spark/spark-1.6.1-bin-hadoop2.6/spark-env.sh
sudo gedit /usr/local/spark/spark-1.6.1-bin-hadoop2.6/spark-env.sh
Add the line below to spark-env.sh to point Spark to the correct java installation
export JAVA_HOME=/usr/lib/jvm/java-8-oracle 
We can test if our installation was sucessful by running spark-shell at command line.

If you correctly installed Spark you should get output similar to that above. If you get memory errors close some applications. To exit Scala shell use ctrl+D keys. We will use this shell to demonstrate basic concepts in Spark.
The first step when creating a spark application is to instantiate a SparkContext that instructs Spark on how to access a cluster. When using a shell this object is automatically created and availed as sc.
In Scala variables are defined using the keyword var or val, followed by variable name, full colon, data type, equals sign, finally the value. A variable declared with val keyword is immutable therefore it cannot be changed. A variable created with var is mutable so its value can be changed. Declaring data type is optional because scala can infer the type. A notable difference from other languages is that all variables must be initialized when they are created. Variables is a very broad area in Scala and to have a good understanding refer to language documentation. Examples of variable declaration are shown below.
var counter: Int = 45
var counter1 = 45
val user: String = “eduonix”
var passwd = “@Eduonix”

To read text files sparkContext object provides a textFile method that requires the full path to the file to be read. To load cars.csv file located in downloads use the command below
 val cars = sc.textFile("~/Downloads/cars.csv")

This will load the data into an object called cars from where we can perform operations.

If you get errors when accessing the cars object specify the absolute path as shown below
 val cars = sc.textFile("file:/home/eduonix/Downloads/cars.csv")
Reading data is not limited to text files. Queries written using basic SQL or HiveQL are run by Spark to read data from external sources. In Spark data processing revolves around transformations and operations. Transformations manipulate an existing dataset to return a new dataset. Actions operate on datasets to return values. For data processing Spark provides rdd and dataframe API. The dataframe is similar to tables found in relational databases and dataframe found in R and Python
We can print the first five lines of the data to inspect its structure

From the output above we observe that the data file contains header information which is not part of our data. We remove the header and reload the data again.
 cd ~/Downloads
sed -i 1d cars.csv
Alternatively you can use the substract() method in Spark to remove headers.
We can count the number of observations in our data file using the count method

To provide descriptive statistics on each column we use descriptive method.


Post a Comment