Using Spark On Amazon EMR Procesing Data

Apache Spark is a data processing framework that has been developed to process very large amounts of data very fast. The speed gains are achieved because Spark data processing happens in memory. This is a notable difference from MapReduce which processes data stored on disk. Spark is not a replacement for mapreduce, it provides a solution for running workloads that require low latency. Spark is a very flexible framework that can run standalone, on Apache Mesos or on Hadoop Yarn. To use Spark you dont need any knowledge of Hadoop. Spark can consume data from HDFS, Hbase, Cassandra, Hive or any compatible Hadoop data source.

The Spark framework consists of standard libraries for SQL, data streaming, machine learning and graph data processing. Spark SQL offers a way of querying data using SQL or HiveQL. Spark streaming avails features for handling streaming data from sources such as log files, social media and messaging services like kafka. Machine learning is very resource intensive data science technique where Spark has excelled. Techniques available in the machine learning library include classification, regression, collaborative filtering, clustering and dimensionality reduction. Applications in Spark can be written using R, Python, Scala or Java.
There are different ways of exploiting the Amazon cloud for data processing. One way is creating instances and installing software that you need to use. This has been exhaustively discussed in the creating and installing Hadoop on four instances tutorial. Another approach that can be used is Elastic Mapreduce (EMR) on Amazon. EMR provides a preconfigured Hadoop cluster that makes it easy and fast to run your data processing. In an EMR cluster EC2 instances are used as virtual servers for the name node and data nodes, Amazon S3 is used for data storage and CloudWatch does monitoring of cluster performance.
Each approach has its advantages and disadvantages that make one better than the other depending on use case. EMR provides a very easy way to scale up and down but it could be more expensive as compared to EC2. Custom EC2 instances are more flexible when you intend to run software that is not supported on EMR. It can be very frustrating trying to add unsupported functionality on EMR. EMR clusters are better suited to short lived data processing while EC2 is more appropriate for long running data processing.
To provision an EMR cluster login to the console at https://console.aws.amazon.com/elasticmapreduce and click on create cluster

Give your cluster a name, select it to run in cluster mode and select the Spark application that contains Hadoop and Spark.

Scroll down select instance type, number of instances, EC2 key pair and click on create cluster.

Once the cluster is created you can inspect it. To increase or decrease the number of instances you just click on resize.

You submit work to Spark using console, Amazon CLI or Amazon SDK for Java. In this tutorial we will look at submitting work using the console. Before applications can be submitted they must first be compiled into a jar file. The jar file is then uploaded to Amazon S3 from where it can be deployed. To create a simple application we will use Scala programming language. To develop applications for Spark 1.6.1 we need Scala 2.10.
To install Scala you need at least Java 8 installed, check this by running java -version at the terminal.

To quickly start developing Scala programs download the Scala IDE from http://scala-ide.org/download/sdk.html
Move to the directory it was downloaded and extract. The contents will be extracted into a directory named eclipse. Open the eclipse directory and start the IDE. This is what we will use to develop Scala programs. It offers support for a mix of java/scala programs, error marking, code auto completion, code highlighting and debugging.
cd ~/Downloads
sudo tar xzvf scala-SDK-4.4.1-vfinal-2.11-linux.gtk.x86_64.tar.gz

You will be prompted to specify a directory to store your projects.
To develop a Scala program you create a maven project or import an existing maven project. In this tutorial we will use the simpler approach of importing an existing project. Download a zipped Scala boiler plate from github found on this link https://github.com/H4ml3t/spark-scala-maven-boilerplate-project. Extract and import it into the IDE. From your Scala IDE click on file then import, click on maven, select existing project and click next.

Select root directory and click finish as shown in the screen-shot below.

We open pom.xml and edit dependencies so that we add the correct version of Spark which is 1.6.1. We also add Scala version which is 2.10

To demonstrate how a program is developed we will use the pi calculation example provided on the spark website. Edit MainExample.scala object found in /src/main/scala and add the lines below.
package org.apache.spark.examples
import scala.math.random
import org.apache.spark.sql.SparkSession
 /** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val spark = SparkSession
      .appName("Spark Pi")
    val sc = spark.sparkContext
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = sc.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
Right click on spark-scala-maven-projec, then click on export to save the program and its dependencies as a jar file. This is the jar file that will be uploaded to our cluster for submission to spark. This is the same way spark applications are submitted to any other spark installation.
Scala is an object oriented programming language that is used to develop Spark applications. The language website offers tutorials to start learning the program.
We can use scp to transfer our jar to the master node on our cluster. To get the DNS login to the console and go to cluster details. The user name is hadoop and we will use the .pem key we created when demonstrating how to install Hadoop of four EC2 instances.
cd ~/Downloads
scp -i eduonixhadooptutorial.pem eduonix.jar hadoop@ec2-52-40-142-107.us-west-2.compute.amazonaws.com
In the commands above we specify the .pem key which resides in downloads directory. We also specify we would like to copy eduonix.jar to our remote server which is identified by the indicated DNS. You need to substitute these values with the correct paths on your file system and also substitute the DNS with that which points to your master node.
Once the eduonix.jar is on our master we can submit it to our cluster. From your management console click on add step, then select step type and specify the path to jar file then click on add


Post a Comment