- Data will be available in HDFS. The table is going to create on HDFS data.
- We can call this one as schema on data.
- At the time of dropping the table it drops only schema, the data will be still available in HDFS as before.
- External tables provide an option to create multiple schemas for the data stored
- in HDFS instead of deleting the data every time whenever schema updates
Apache Hive External Tables
Apache Hive Complex Data Types(Collections)
Complex Data Types
- arrays: ARRAY
- maps: MAP
- structs: STRUCT
How to test Python MapReduce Jobs in Hadoop
Example: Count Number of
words in a text file (word count)
1)
Create Python scripts mapper.py & reducer.py
2)
Test mapper.py and reducer.py scripts locally before using them in a
MapReduce job.
Test1:
[cloudera@quickstart
training]$ echo "abc xyz abc abc abc xyz pqr" | python
/home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python
/home/cloudera/training/wordcount-python/reducer.py
abc 4
pqr 1
xyz 2
Test2:
[cloudera@quickstart
training]$ cat wordcount.txt | python
/home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python
/home/cloudera/training/wordcount-python/reducer.py
are 3
how 2
is 1
welcome 1
where 1
you 4
3) Create ‘wordcountinput’ directory in HDFS
then copy wordcount.txt to HDFS .
[cloudera@quickstart
training]$ hdfs dfs -mkdir /wordcountinput
[cloudera@quickstart
training]$ hdfs dfs -put wordcount.txt /wordcountinput
4)
Execute MapReduce job using streaming jar file .
Location:
/usr/lib/hadoop-0.20-mapreduce/contrib/streaming
[cloudera@quickstart
training]$ hadoop jar
/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.12.0.jar
-Dmapred.reduce.tasks=1 -file /home/cloudera/training/wordcount-python/mapper.py
/home/cloudera/training/wordcount-python/reducer.py -mapper "python
mapper.py" -reducer "python reducer.py" -input
/wordcountinput/wordcount.txt -output /wordcountoutput
5)
Check the output
[cloudera@quickstart
training]$ hdfs dfs -ls /wordcountoutput
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2018-05-24 00:40
/wordcountoutput/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 41 2018-05-24 00:40
/wordcountoutput/part-00000
[cloudera@quickstart
training]$ hdfs dfs -cat /wordcountoutput/part-00000
are 3
how 2
is 1
welcome 1
where 1
you 4
How to test Java MapReduce Jobs in Hadoop
Example-1
(wordcount)
Developer activities:
Step1:
Develop MapReduce Code
Step2: Unit Testing of Map Reduce code using MRUnit
framework
Step3:
Create Jar file for MapReduce code
Testing activities:
Step1: Create a new directory in HDFS then copy data
file from local to HDFS directory
[cloudera@quickstart
training]$ hdfs dfs -mkdir /mapreduceinput
[cloudera@quickstart
training]$ hdfs dfs -put wordcount.txt /mapreduceinput
Step2
: Run jar file by providing data file as
an input
[cloudera@quickstart
training]$ hadoop jar wordcount.jar WordCount /mapreduceinput/wordcount.txt
/mapreduceoutput/
Step3: Check output file created on HDFS.
[cloudera@quickstart
training]$ hdfs dfs -ls /mapreduceoutput
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2018-05-23 22:00
/mapreduceoutput/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 41 2018-05-23 22:00
/mapreduceoutput/part-00000
[cloudera@quickstart
training]$ hdfs dfs -cat /mapreduceoutput/part-00000
are 3
how 2
is 1
welcome 1
where 1
you 4
Example-2 (Find
out Number of Products Sold in Each Country)
Step1: Create a new directory in HDFS then copy data
file from local to HDFS directory
[cloudera@quickstart
training]$ hdfs dfs -mkdir /productsalesinput
[cloudera@quickstart
training]$ hdfs dfs -put SalesJan2009.csv /productsalesinput
Step2
: Run jar file by providing data file as
an input
[cloudera@quickstart
training]$ hadoop jar ProductSalesperCountry.jar
SalesCountry.SalesCountryDriver /productsalesinput/SalesJan2009.csv /productsalesoutput
Step3: Check output file created on HDFS.
[cloudera@quickstart
training]$ hdfs dfs -ls /productsalesoutput
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2018-05-23 23:52
/productsalesoutput/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 661 2018-05-23 23:52
/productsalesoutput/part-00000
[cloudera@quickstart
training]$ hdfs dfs -cat /productsalesoutput/part-00000
Example-3
(MapReduce Join – Multiple Input Files)
Step1: Create a new directory in HDFS then copy data
file from local to HDFS directory
[cloudera@quickstart
training]$ hdfs dfs -mkdir /multipleinputs
[cloudera@quickstart
training]$ hdfs dfs -put customer.txt /multipleinputs
[cloudera@quickstart
training]$ hdfs dfs -put delivery.txt /multipleinputs
Step2
: Run jar file by providing data file as
an input
[cloudera@quickstart
training]$ hadoop jar MultipleInput.jar /multipleinputs/customer.txt
/multipleinputs/delivery.txt /multipleoutput
Step3: Check output file created on HDFS.
[cloudera@quickstart
training]$ hdfs dfs -ls /multipleoutput
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2018-05-26 23:14
/multipleoutput/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 22 2018-05-26 23:14
/multipleoutput/part-r-00000
[cloudera@quickstart
training]$ hdfs dfs -cat /multipleoutput/part-r-00000
mani 0
vijay 1
ravi 1
MRUnit test case for wordcount example
Pre-Requisites
Download the latest version of MRUnit jar from
Apache website: https://repository.apache.org/content/repositories/releases/org/apache/mrunit/mrunit/.
mrunit-0.5.0-incubating.jar
Maven pom.xml dependency ( If you are using
Maven Project)
Step1: Create a new Java project in Eclipse then add
JUnit Library.
Step2: Add
external Jars which are required to run Junit test case
/usr/lib/hadoop
/usr/lib/hadoop-0.20-mapreduce
/home/cloudera/training/MRUnit/mrunit-0.5.0-incubating.jar
In Addition we need to
also add wordcount.jar ( The classes from wordcount.jar will be used in JUnit test case)
Step3:
Create Junit Test case
Word
count MRUnit test case:
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mrunit.MapDriver;
import org.apache.hadoop.mrunit.MapReduceDriver;
import
org.apache.hadoop.mrunit.ReduceDriver;
import org.junit.Before;
import org.junit.Test;
public class TestWordCount {
MapReduceDriver mapReduceDriver;
MapDriver mapDriver;
ReduceDriver reduceDriver;
@Before
public void setUp() {
WordMapper mapper = new WordMapper();
SumReducer reducer = new SumReducer();
mapDriver = new MapDriver();
mapDriver.setMapper(mapper);
reduceDriver = new ReduceDriver();
reduceDriver.setReducer(reducer);
mapReduceDriver = new MapReduceDriver();
mapReduceDriver.setMapper(mapper);
mapReduceDriver.setReducer(reducer);
}
@Test
public void testMapper() {
mapDriver.withInput(new LongWritable(1), new Text("cat cat dog"));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("cat"), new IntWritable(1));
mapDriver.withOutput(new Text("dog"), new IntWritable(1));
mapDriver.runTest();
}
@Test
public void testReducer() {
List values = new ArrayList();
values.add(new IntWritable(1));
values.add(new IntWritable(1));
reduceDriver.withInput(new Text("cat"), values);
reduceDriver.withOutput(new Text("cat"), new IntWritable(2));
reduceDriver.runTest();
}
@Test
public void testMapReduce() {
mapReduceDriver.withInput(new LongWritable(1), new Text("cat cat dog"));
mapReduceDriver.addOutput(new Text("cat"), new IntWritable(2));
mapReduceDriver.addOutput(new Text("dog"), new IntWritable(1));
mapReduceDriver.runTest();
}
}
Step4: Run JUint test case
Step5: Results should be passed.
Hadoop HDFS Commands
HDFS Commands
- jps
HDFS Command to print Hadoop
processes.
[root@quickstart Desktop]# jps
- fsck
HDFS Command to check the health of
the Hadoop file system.
[cloudera@quickstart
training]$ hdfs fsck /
·
ls
HDFS Command to display the list of
Files and Directories in HDFS.
[cloudera@quickstart
training]$ hdfs dfs -ls /
- mkdir
HDFS Command to create the directory
in HDFS.
[cloudera@quickstart
training]$ hdfs dfs -mkdir /bigdatatesting
[cloudera@quickstart
training]$ hdfs dfs -ls /
drwxr-xr-x - cloudera supergroup 0 2018-05-23 00:46 /bigdatatesting
Note: Here we are trying to create a directory
named “Bigdatatesting” in HDFS.
- touchz
HDFS Command to create a file in
HDFS with file size 0 bytes.
[cloudera@quickstart
training]$ hdfs dfs -touchz /bigdatatesting/test.dat
[cloudera@quickstart
training]$ hdfs dfs -ls /bigdatatesting/
Found 1 items
-rw-r--r-- 1 cloudera supergroup 0 2018-05-23 00:48
/bigdatatesting/test.tx
Note: Here we are trying to create a file named “test.dat” in the directory “bigdatatesting” of hdfs with file size 0 bytes.
- du
HDFS Command to check the file
size.
[cloudera@quickstart
training]$ hdfs dfs -du -s /bigdatatesting/test.dat
0
0 /bigdatatesting/test.dat
·
appendToFile
Appends the contents to the given
destination file on HDFS. The destination file will be created if it does not
exist.
[cloudera@quickstart
training]$ hdfs dfs -appendToFile - /bigdatatesting/test.dat
- cat
HDFS Command that reads a file
on HDFS and prints the content of that file to the standard output.
[cloudera@quickstart training]$ hdfs dfs -cat
/bigdatatesting/test.dat
- copyFromLocal
HDFS Command to copy the file from a
Local file system to HDFS.
Step1:
Create a file in Local File System.
[cloudera@quickstart
training]$ cat>> test1.dat
[cloudera@quickstart
training]$ ls test1.dat
test1.dat
Step2:
Copy file from Local File system to HDFS
[cloudera@quickstart
training]$ hdfs dfs -copyFromLocal test1.dat /bigdatatesting/
Note: Here the test is the file present in the local
directory /home/cloudera/training and after the command gets executed the test
file will be copied in /bigdatatesting directory of HDFS.
- copyToLocal
HDFS Command to copy the file from
HDFS to Local File System.
Step1:
Check test.dat file present in local file system.
[cloudera@quickstart
training]$ ls test.dat
ls: cannot access test.dat: No such
file or directory
Step2:
Copy test.dat file from HDFS to local file system.
[cloudera@quickstart
training]$ hdfs dfs -copyToLocal /bigdatatesting/test.dat
/home/cloudera/training
Step3:
Check again test.dat file present in local file system.
[cloudera@quickstart
training]$ ls test.dat
test.dat
Note: Here test.dat is a file present in the bigdatatesting
directory of HDFS and after the command gets executed the test.dat file will be
copied to local directory /home/Cloudera/training
- put
HDFS Command to copy single source
or multiple sources from local file system to the destination file system.
Step1:
Create a file in Local File System.
[cloudera@quickstart
training]$ cat>> test2.dat
[cloudera@quickstart
training]$ ls test2.dat
test1.dat
Step2:
Copy file from Local File system to HDFS
[cloudera@quickstart
training]$ hdfs dfs -put test2.dat /bigdatatesting/
Note: Here the test2.dat is the file present in the local
directory /home/cloudera/training and after the command gets executed the
test2.dat file will be copied in /bigdatatesting directory of HDFS.
Note: The command put is similar to copyFromLocal
command.
- · get
HDFS Command to copy files from hdfs
to the local file system.
Step1:
Create a new file test3.dat on HDFS.
[cloudera@quickstart
training]$ hdfs dfs -touchz /bigdatatesting/test3.dat
Step2:
Copy test3.dat file from HDFS to local file system.
[cloudera@quickstart
training]$ hdfs dfs -get /bigdatatesting/test3.dat /home/cloudera/training
Step3:
Check again test3.dat file present in local file system.
[cloudera@quickstart
training]$ ls test3.dat
Test3.dat
Note1: Here test3.dat is a file present in the
bigdatatesting directory of HDFS and after the command gets executed the test.dat
file will be copied to local directory /home/Cloudera/training
Note2: The command get is similar to copyToLocal command
- cp
HDFS Command to copy files from
source to destination. This command allows multiple sources as well, in which
case the destination must be a directory.
[cloudera@quickstart
training]$ hdfs dfs -mkdir /hadooptesting/
[cloudera@quickstart
training]$ hdfs dfs -cp /bigdatatesting/test.dat /hadooptesting
- mv
HDFS Command to move files from
source to destination. This command allows multiple sources as well, in which
case the destination needs to be a directory.
[cloudera@quickstart
training]$ hdfs dfs -mv /bigdatatesting/test1.dat /hadooptesting/
- rm
HDFS Command to remove the file from
HDFS.
[cloudera@quickstart
training]$ hdfs dfs -rm /bigdatatesting/test2.dat
Deleted /bigdatatesting/test2.dat
- rm -r
HDFS Command to remove the entire
directory and all of its content from HDFS.
[cloudera@quickstart
training]$ hdfs dfs -rm -r /hadooptesting
Deleted /hadooptesting
- rmdir
HDFS Command to remove the directory
if it is empty.
[cloudera@quickstart
training]$ hdfs dfs -rmdir /bigdatatesting
- usage
HDFS Command that returns the help
for an individual command.
[cloudera@quickstart
training]$ hdfs dfs -usage mkdir
Note: By using usage command you can get
information about any command.
- help
HDFS Command that displays help for
given command or all commands if none is specified.
[cloudera@quickstart
training]$ hdfs dfs -help
Subscribe to:
Posts (Atom)
Popular Posts
- How To Explain Project In Interview Freshers and Experienced
- Selenium Frequently Asked Questions & Answers Part-6
- API/Webservices Testing using RestAssured (Part 1)
- Manual Testing Interview Questions & Answers-PART1
- Java Programs for Selenium
- How to use HashMap in Selenium WebDriver
- ETL Test Scenarios and Test Cases
- Python Interview Questions and Answers Part-1
- Manual & Automation Testing Free Video Tutorials | YouTube Playlists