Seting up of Hadoop on 4 Amazon Instances

In the first part of this tutorial provisioning a cluster with four instances on Amazon ec2 was demonstrated. Connecting to the instances using SSH was also explained. The second part of this tutorial will pick up from there and explain how to set up a Hadoop cluster with two data nodes, one primary name node and one secondary name node. Login to your Amazon management console and start your instances.

start instance

To start your cluster, select all instances, click on Actions then start as shown below

start instances

Confirm instance state of each instance is indicated as running
running instance

To connect to a remote server when using SSH we need to specify the path to our .pem key file, the user name and public DNS. To avoid typing all this information we can add it in sshd_config file and just type an easy to remember name that resolves to our instance. Open sshd_config by navigating to the directory where it is stored and opening it in a text editor like gedit and add the lines highlighted in green.
sudo gedit /etc/ssh/sshd_config
Host  namenode1
HostName ec2-52-40-18-74.us-west-2.compute.amazonaws.com
User ubuntu
IdentityFile  ~/Downloads/eduonixhadooptutorial.pem
Host  namenode2
HostName  ec2-52-32-85-4.us-west-2.compute.amazonaws.com
User  ubuntu
IdentityFile  ~/Downloads/eduonixhadooptutorial.pem
Host  datanode1
HostName  ec2-52-37-140-215.us-west-2.compute.amazonaws.com
User  ubuntu
IdentityFile  ~/Downloads/eduonixhadooptutorial.pem
Host  datanode2
HostName  ec2-52-40-57-130.us-west-2.compute.amazonaws.com
User  ubuntu
IdentityFile  ~/Downloads/eduonixhadooptutorial.pem
edit ssh

Save sshd_config and restart SSH by running sudo restart ssh at the terminal
If you get an error “unable to connect to upstart” use the command below

sudo systemctl restart ssh
In a Hadoop cluster the name node needs to communicate with the data nodes. To enable this we need to set up SSH. First we upload the .pem key file from the local machine to the primary name node. To copy files to remote servers we use scp utility. Run scp at the terminal to check it is installed. You should get output similar to that shown below.
scp installed

To copy a file using scp you need to specify the path to .pem key, the file to be copied, user and public DNS of the remote server.
scp -i ~/Downloads/eduonixhadooptutorial.pem eduonixhadooptutorial.pem ubuntu@ec2-52-40-18-74.us-west-2.compute.amazonaws.com
copy pem

After the .pem key has been uploaded SSH into primary node to create an authorization key and add the fingerprint to authorized_keys on name node and the data nodes.
ssh  -i    ~/Downloads/eduonixhadooptutorial.pem ubuntu@ec2-52-40-18-74.us-west-2.compute.amazonaws.com
Alternatively you can use the host we specified in sshd_config like this ssh -i namenode1
ssh to namenode1

Files are uploaded to the working directory on remote server. Check which is the working directory using the command pwd and list the contents using ls to confirm .pem key file was correctly uploaded.
check pem key

Move the private key to .ssh directory and rename it to id_rsa
 mv eduonixhadooptutorial.pem ./.ssh/id_rsa
move pem key

To create a public fingerprint and add it to list of authorized keys use the commands below.
ssh-keygen -f ~/.ssh/id_rsa -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
gen key namenode

The fingerprint we have created above has to be added to authorized_keys of each of the other nodes in the cluster. The commands below will implement that

 cat ~/.ssh/id_rsa.pub | ssh eduonixhadooptutorial.pem ubuntu@ec2-52-32-85-4.us-west-2.compute.amazonaws.com 'cat >> ~/.ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh eduonixhadooptutorial.pem ubuntu@ec2-52-32-85-4.us-west-2.compute.amazonaws.com 'cat >> ~/.ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh eduonixhadooptutorial.pem ubuntu@ec2-52-32-85-4.us-west-2.compute.amazonaws.com 'cat >> ~/.ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh eduonixhadooptutorial.pem ubuntu@ec2-52-32-85-4.us-west-2.compute.amazonaws.com 'cat >> ~/.ssh/authorized_keys'
After setting up SSH so that the nodes are able to communicate we can begin installing hadoop. Open 4 terminals and connect to each of the instances. Java is a prerequisite for installation of Hadoop so we begin by installing Java on each node. Use the commands below on each terminal to install and check Java has been properly installed.
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer 
update repository

install java

java version

After Java has been properly set up, download Hadoop to /Downloads directory, extract it and move to to its installation directory.

wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz -P ~/Downloads
download hadoop

sudo tar zxvf ~/Downloads/hadoop-* -C /usr/local
extract hadoop

Move extracted content to installation folder using the command below.
sudo mv /usr/local/hadoop-* /usr/local/hadoop
Open .bashrc by running this command sudo vi .bashrc and add the lines below. Editing files with vim may be tricky at the beginning but there is a tutorial online. To insert text press ior a. To move the cursor use arrow keys. To exit vim without saving changes press q. To exit vim while saving your changes press escape then shift+zz.
a export JAVA_HOME=/usr
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop 
bashrc edit

Reload .bashrc by running source .bashrc

Type hadoop at the terminal to confirm hadoop commands are available. You should get output as shown below.
confirm hadoop

There are configurations that are applied only to the name node and there are configurations that are applied to all nodes in the cluster. We will begin with settings that are applied to all the nodes. We need to edit hadoop-env.sh on each node and add the path to the correct java installations
sudo vim $HADOOP_CONF_DIR/hadoop-env.sh
export JAVA_HOME=/usr
export java

Next we edit core-site.xml, here we will declare the default hadoop file system using the public dns of the node instead of localhost. You need to substitute the value with public DNS that points to your instance.
 sudo vim $HADOOP_CONF_DIR/core-site.xml


The other file to be edited is yarn-site.xml. Add the lines below to the file. In this file you will need to replace the value of  yarn.resourcemanager.hostname with public DNS that points to your instance
sudo vim $HADOOP_CONF_DIR/yarn-site.xml




 The last file to be edited is mapred-site.xml. A template is provided so we first need to use it to make the file. The commands below will copy the template and rename it.

sudo cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
sudo vim $HADOOP_CONF_DIR/mapred-site.xml



Once we are done with common configurations on all nodes we can move on and add configurations that are specific to the name node. First we need to add the public DNS and host name of each of the nodes to /etc/hosts. The public DNS is obtained by typing this command echo $(hostname) at the terminal. You can also get the host name from the first part of the private DNS (this is the first part that is composed of numbers). Open the file in a text editor and add the public DNS and host name.
sudo vim /etc/hosts

Next we edit hdfs-site.xml to specify a replication factor of 3 will be used and specify the directory where name node will store its data. Create a directory on the namenode for storing data.
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode


Add the lines above
sudo vim $HADOOP_CONF_DIR/hdfs-site.xml 
namenode data

we need to create a masters file in hadoop conf directory and add the host name of name node.
sudo touch $HADOOP_CONF_DIR/masters
sudo vim $HADOOP_CONF_DIR/masters
Edit slaves file found in hadoop conf directory and add the host names of slave nodes.
sudo vim $HADOOP_CONF_DIR/slaves 
At this point the name node configuration is complete. The last thing we need to do is change ownwership of $HADOOP_HOME directory to ubuntu user
sudo chown -R ubuntu $HADOOP_HOME 
Before starting our cluster we need to add a directory in each of the data nodes that will be used to store blocks. Create the directory and add the path in hdfs-site.xml
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode 

Finally change the ownership of $HADOOP_HOME of each of the data nodes to the ubuntu user.
sudo chown -R ubuntu $HADOOP_HOME 
from the name node we format the hdfs file system and start the cluster.
hdfs namenode -format


Post a Comment