SDET-QA Blog: 5/1/18

Apache Hive External Tables

External tables

Data will be available in HDFS. The table is going to create on HDFS data.
We can call this one as schema on data.
At the time of dropping the table it drops only schema, the data will be still available in HDFS as before.
External tables provide an option to create multiple schemas for the data stored
in HDFS instead of deleting the data every time whenever schema updates

Apache Hive Complex Data Types(Collections)

Complex Data Types

arrays: ARRAY
maps: MAP
structs: STRUCT

How to test Python MapReduce Jobs in Hadoop

Example: Count Number of words in a text file (word count)

1) Create Python scripts mapper.py & reducer.py

2) Test mapper.py and reducer.py scripts locally before using them in a MapReduce job.

Test1:

[cloudera@quickstart training]$ echo "abc xyz abc abc abc xyz pqr" | python /home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python /home/cloudera/training/wordcount-python/reducer.py

abc 4

pqr 1

xyz 2

Test2:

[cloudera@quickstart training]$ cat wordcount.txt | python /home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python /home/cloudera/training/wordcount-python/reducer.py

are 3

how 2

is 1

welcome 1

where 1

you 4

3) Create ‘wordcountinput’ directory in HDFS then copy wordcount.txt to HDFS .

[cloudera@quickstart training]$ hdfs dfs -mkdir /wordcountinput

[cloudera@quickstart training]$ hdfs dfs -put wordcount.txt /wordcountinput

4) Execute MapReduce job using streaming jar file .

Location: /usr/lib/hadoop-0.20-mapreduce/contrib/streaming

[cloudera@quickstart training]$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.12.0.jar -Dmapred.reduce.tasks=1 -file /home/cloudera/training/wordcount-python/mapper.py /home/cloudera/training/wordcount-python/reducer.py -mapper "python mapper.py" -reducer "python reducer.py" -input /wordcountinput/wordcount.txt -output /wordcountoutput

5) Check the output

[cloudera@quickstart training]$ hdfs dfs -ls /wordcountoutput

Found 2 items

-rw-r--r-- 1 cloudera supergroup 0 2018-05-24 00:40 /wordcountoutput/_SUCCESS

-rw-r--r-- 1 cloudera supergroup 41 2018-05-24 00:40 /wordcountoutput/part-00000

[cloudera@quickstart training]$ hdfs dfs -cat /wordcountoutput/part-00000

are 3

how 2

is 1

welcome 1

where 1

you 4

How to test Java MapReduce Jobs in Hadoop

Example-1 (wordcount)

Developer activities:

Step1: Develop MapReduce Code

Step2: Unit Testing of Map Reduce code using MRUnit framework

Step3: Create Jar file for MapReduce code

Testing activities:

Step1: Create a new directory in HDFS then copy data file from local to HDFS directory

[cloudera@quickstart training]$ hdfs dfs -mkdir /mapreduceinput

[cloudera@quickstart training]$ hdfs dfs -put wordcount.txt /mapreduceinput

Step2 : Run jar file by providing data file as an input

[cloudera@quickstart training]$ hadoop jar wordcount.jar WordCount /mapreduceinput/wordcount.txt /mapreduceoutput/

Step3: Check output file created on HDFS.

[cloudera@quickstart training]$ hdfs dfs -ls /mapreduceoutput

Found 2 items

-rw-r--r-- 1 cloudera supergroup 0 2018-05-23 22:00 /mapreduceoutput/_SUCCESS

-rw-r--r-- 1 cloudera supergroup 41 2018-05-23 22:00 /mapreduceoutput/part-00000

[cloudera@quickstart training]$ hdfs dfs -cat /mapreduceoutput/part-00000

are 3

how 2

is 1

welcome 1

where 1

you 4

Example-2 (Find out Number of Products Sold in Each Country)

Step1: Create a new directory in HDFS then copy data file from local to HDFS directory

[cloudera@quickstart training]$ hdfs dfs -mkdir /productsalesinput

[cloudera@quickstart training]$ hdfs dfs -put SalesJan2009.csv /productsalesinput

Step2 : Run jar file by providing data file as an input

[cloudera@quickstart training]$ hadoop jar ProductSalesperCountry.jar SalesCountry.SalesCountryDriver /productsalesinput/SalesJan2009.csv /productsalesoutput

Step3: Check output file created on HDFS.

[cloudera@quickstart training]$ hdfs dfs -ls /productsalesoutput

Found 2 items

-rw-r--r-- 1 cloudera supergroup 0 2018-05-23 23:52 /productsalesoutput/_SUCCESS

-rw-r--r-- 1 cloudera supergroup 661 2018-05-23 23:52 /productsalesoutput/part-00000

[cloudera@quickstart training]$ hdfs dfs -cat /productsalesoutput/part-00000

Example-3 (MapReduce Join – Multiple Input Files)

Step1: Create a new directory in HDFS then copy data file from local to HDFS directory

[cloudera@quickstart training]$ hdfs dfs -mkdir /multipleinputs

[cloudera@quickstart training]$ hdfs dfs -put customer.txt /multipleinputs

[cloudera@quickstart training]$ hdfs dfs -put delivery.txt /multipleinputs

Step2 : Run jar file by providing data file as an input

[cloudera@quickstart training]$ hadoop jar MultipleInput.jar /multipleinputs/customer.txt /multipleinputs/delivery.txt /multipleoutput

Step3: Check output file created on HDFS.

[cloudera@quickstart training]$ hdfs dfs -ls /multipleoutput

Found 2 items

-rw-r--r-- 1 cloudera supergroup 0 2018-05-26 23:14 /multipleoutput/_SUCCESS

-rw-r--r-- 1 cloudera supergroup 22 2018-05-26 23:14 /multipleoutput/part-r-00000

[cloudera@quickstart training]$ hdfs dfs -cat /multipleoutput/part-r-00000

mani 0

vijay 1

ravi 1

MRUnit test case for wordcount example

Pre-Requisites

Download the latest version of MRUnit jar from Apache website: https://repository.apache.org/content/repositories/releases/org/apache/mrunit/mrunit/.

mrunit-0.5.0-incubating.jar

Maven pom.xml dependency ( If you are using Maven Project)

org.apache.mrunit
mrunit
0.9.0-incubating
hadoop1

Step1: Create a new Java project in Eclipse then add JUnit Library.

Step2: Add external Jars which are required to run Junit test case

/usr/lib/hadoop

/usr/lib/hadoop-0.20-mapreduce

/home/cloudera/training/MRUnit/mrunit-0.5.0-incubating.jar

In Addition we need to also add wordcount.jar ( The classes from wordcount.jar will be used in JUnit test case)

Step3: Create Junit Test case

Word count MRUnit test case:

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mrunit.MapDriver;

import org.apache.hadoop.mrunit.MapReduceDriver;

import org.apache.hadoop.mrunit.ReduceDriver;

import org.junit.Before;

import org.junit.Test;

public class TestWordCount {

MapReduceDriver mapReduceDriver;

MapDriver mapDriver;

ReduceDriver reduceDriver;

@Before

public void setUp() {

WordMapper mapper = new WordMapper();

SumReducer reducer = new SumReducer();

mapDriver = new MapDriver();

mapDriver.setMapper(mapper);

reduceDriver = new ReduceDriver();

reduceDriver.setReducer(reducer);

mapReduceDriver = new MapReduceDriver();

mapReduceDriver.setMapper(mapper);

mapReduceDriver.setReducer(reducer);

}

@Test

public void testMapper() {

mapDriver.withInput(new LongWritable(1), new Text("cat cat dog"));

mapDriver.withOutput(new Text("cat"), new IntWritable(1));

mapDriver.withOutput(new Text("dog"), new IntWritable(1));

mapDriver.runTest();

}

@Test

public void testReducer() {

List values = new ArrayList();

values.add(new IntWritable(1));

reduceDriver.withInput(new Text("cat"), values);

reduceDriver.withOutput(new Text("cat"), new IntWritable(2));

reduceDriver.runTest();

}

@Test

public void testMapReduce() {

mapReduceDriver.withInput(new LongWritable(1), new Text("cat cat dog"));

mapReduceDriver.addOutput(new Text("cat"), new IntWritable(2));

mapReduceDriver.addOutput(new Text("dog"), new IntWritable(1));

mapReduceDriver.runTest();

}

Step4: Run JUint test case

Step5: Results should be passed.

Hadoop HDFS Commands

HDFS Commands

jps

HDFS Command to print Hadoop processes.

[root@quickstart Desktop]# jps

fsck

HDFS Command to check the health of the Hadoop file system.

[cloudera@quickstart training]$ hdfs fsck /

· ls

HDFS Command to display the list of Files and Directories in HDFS.

[cloudera@quickstart training]$ hdfs dfs -ls /

mkdir

HDFS Command to create the directory in HDFS.

[cloudera@quickstart training]$ hdfs dfs -mkdir /bigdatatesting

[cloudera@quickstart training]$ hdfs dfs -ls /

drwxr-xr-x - cloudera supergroup 0 2018-05-23 00:46 /bigdatatesting

Note: Here we are trying to create a directory named “Bigdatatesting” in HDFS.

touchz

HDFS Command to create a file in HDFS with file size 0 bytes.

[cloudera@quickstart training]$ hdfs dfs -touchz /bigdatatesting/test.dat

[cloudera@quickstart training]$ hdfs dfs -ls /bigdatatesting/

Found 1 items

-rw-r--r-- 1 cloudera supergroup 0 2018-05-23 00:48 /bigdatatesting/test.tx

Note: Here we are trying to create a file named “test.dat” in the directory “bigdatatesting” of hdfs with file size 0 bytes.

HDFS Command to check the file size.

[cloudera@quickstart training]$ hdfs dfs -du -s /bigdatatesting/test.dat

0 0 /bigdatatesting/test.dat

· appendToFile

Appends the contents to the given destination file on HDFS. The destination file will be created if it does not exist.

[cloudera@quickstart training]$ hdfs dfs -appendToFile - /bigdatatesting/test.dat

cat

HDFS Command that reads a file on HDFS and prints the content of that file to the standard output.

[cloudera@quickstart training]$ hdfs dfs -cat /bigdatatesting/test.dat

copyFromLocal

HDFS Command to copy the file from a Local file system to HDFS.

Step1: Create a file in Local File System.

[cloudera@quickstart training]$ cat>> test1.dat

[cloudera@quickstart training]$ ls test1.dat

test1.dat

Step2: Copy file from Local File system to HDFS

[cloudera@quickstart training]$ hdfs dfs -copyFromLocal test1.dat /bigdatatesting/

Note: Here the test is the file present in the local directory /home/cloudera/training and after the command gets executed the test file will be copied in /bigdatatesting directory of HDFS.

copyToLocal

HDFS Command to copy the file from HDFS to Local File System.

Step1: Check test.dat file present in local file system.

[cloudera@quickstart training]$ ls test.dat

ls: cannot access test.dat: No such file or directory

Step2: Copy test.dat file from HDFS to local file system.

[cloudera@quickstart training]$ hdfs dfs -copyToLocal /bigdatatesting/test.dat /home/cloudera/training

Step3: Check again test.dat file present in local file system.

[cloudera@quickstart training]$ ls test.dat

test.dat

Note: Here test.dat is a file present in the bigdatatesting directory of HDFS and after the command gets executed the test.dat file will be copied to local directory /home/Cloudera/training

put

HDFS Command to copy single source or multiple sources from local file system to the destination file system.

Step1: Create a file in Local File System.

[cloudera@quickstart training]$ cat>> test2.dat

[cloudera@quickstart training]$ ls test2.dat

test1.dat

Step2: Copy file from Local File system to HDFS

[cloudera@quickstart training]$ hdfs dfs -put test2.dat /bigdatatesting/

Note: Here the test2.dat is the file present in the local directory /home/cloudera/training and after the command gets executed the test2.dat file will be copied in /bigdatatesting directory of HDFS.

Note: The command put is similar to copyFromLocal command.

· get

HDFS Command to copy files from hdfs to the local file system.

Step1: Create a new file test3.dat on HDFS.

[cloudera@quickstart training]$ hdfs dfs -touchz /bigdatatesting/test3.dat

Step2: Copy test3.dat file from HDFS to local file system.

[cloudera@quickstart training]$ hdfs dfs -get /bigdatatesting/test3.dat /home/cloudera/training

Step3: Check again test3.dat file present in local file system.

[cloudera@quickstart training]$ ls test3.dat

Test3.dat

Note1: Here test3.dat is a file present in the bigdatatesting directory of HDFS and after the command gets executed the test.dat file will be copied to local directory /home/Cloudera/training

Note2: The command get is similar to copyToLocal command

HDFS Command to copy files from source to destination. This command allows multiple sources as well, in which case the destination must be a directory.

[cloudera@quickstart training]$ hdfs dfs -mkdir /hadooptesting/

[cloudera@quickstart training]$ hdfs dfs -cp /bigdatatesting/test.dat /hadooptesting

HDFS Command to move files from source to destination. This command allows multiple sources as well, in which case the destination needs to be a directory.

[cloudera@quickstart training]$ hdfs dfs -mv /bigdatatesting/test1.dat /hadooptesting/

HDFS Command to remove the file from HDFS.

[cloudera@quickstart training]$ hdfs dfs -rm /bigdatatesting/test2.dat

Deleted /bigdatatesting/test2.dat

rm -r

HDFS Command to remove the entire directory and all of its content from HDFS.

[cloudera@quickstart training]$ hdfs dfs -rm -r /hadooptesting

Deleted /hadooptesting

rmdir

HDFS Command to remove the directory if it is empty.

[cloudera@quickstart training]$ hdfs dfs -rmdir /bigdatatesting

usage

HDFS Command that returns the help for an individual command.

[cloudera@quickstart training]$ hdfs dfs -usage mkdir

Note: By using usage command you can get information about any command.

help

HDFS Command that displays help for given command or all commands if none is specified.

[cloudera@quickstart training]$ hdfs dfs -help

SDET-QA Blog

Pages

Apache Hive External Tables

Apache Hive Complex Data Types(Collections)

How to test Python MapReduce Jobs in Hadoop

How to test Java MapReduce Jobs in Hadoop

Hadoop HDFS Commands

Followers

Most Trending

Pages