Apache Hive External Tables


External tables

  • Data will be available in HDFS. The table is going to create on HDFS data.
  • We can call this one as schema on data.
  • At the time of dropping the table it drops only schema, the data will be still available in HDFS as before.
  • External tables provide an option to create multiple schemas for the data stored
  • in HDFS instead of deleting the data every time whenever schema updates

Apache Hive Complex Data Types(Collections)


Complex Data Types
  1. arrays: ARRAY
  2. maps: MAP
  3. structs: STRUCT

How to test Python MapReduce Jobs in Hadoop


Example:  Count Number of words in a text file (word count)
1) Create Python scripts mapper.py & reducer.py 

2) Test mapper.py and reducer.py scripts locally before using them in a  MapReduce job.  

Test1:
[cloudera@quickstart training]$ echo "abc xyz abc abc abc xyz pqr" | python /home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python /home/cloudera/training/wordcount-python/reducer.py
abc      4
pqr      1
xyz      2

Test2:
[cloudera@quickstart training]$ cat wordcount.txt | python /home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python /home/cloudera/training/wordcount-python/reducer.py
are       3
how     2
is         1
welcome         1
where 1
you      4

3)  Create ‘wordcountinput’ directory in HDFS then copy wordcount.txt to HDFS .
[cloudera@quickstart training]$ hdfs dfs -mkdir /wordcountinput
[cloudera@quickstart training]$ hdfs dfs -put wordcount.txt /wordcountinput

4) Execute MapReduce job using streaming jar file .
 Location: /usr/lib/hadoop-0.20-mapreduce/contrib/streaming

[cloudera@quickstart training]$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.12.0.jar -Dmapred.reduce.tasks=1 -file /home/cloudera/training/wordcount-python/mapper.py /home/cloudera/training/wordcount-python/reducer.py -mapper "python mapper.py" -reducer "python reducer.py" -input /wordcountinput/wordcount.txt -output /wordcountoutput

5) Check the output 
[cloudera@quickstart training]$ hdfs dfs -ls /wordcountoutput
Found 2 items
-rw-r--r--   1 cloudera supergroup          0 2018-05-24 00:40 /wordcountoutput/_SUCCESS
-rw-r--r--   1 cloudera supergroup         41 2018-05-24 00:40 /wordcountoutput/part-00000

[cloudera@quickstart training]$ hdfs dfs -cat /wordcountoutput/part-00000
are       3
how     2
is         1
welcome         1
where 1
you      4



How to test Java MapReduce Jobs in Hadoop


Example-1 (wordcount)

Developer activities:
Step1: Develop MapReduce Code
Step2:  Unit Testing of Map Reduce code using MRUnit framework
Step3: Create Jar file for MapReduce code

Testing activities:
Step1:  Create a new directory in HDFS then copy data file from local to HDFS directory
[cloudera@quickstart training]$ hdfs dfs -mkdir /mapreduceinput
[cloudera@quickstart training]$ hdfs dfs -put wordcount.txt /mapreduceinput
Step2 : Run  jar file by providing data file as an input
[cloudera@quickstart training]$ hadoop jar wordcount.jar WordCount /mapreduceinput/wordcount.txt /mapreduceoutput/
Step3:  Check output file created on HDFS.
[cloudera@quickstart training]$ hdfs dfs -ls /mapreduceoutput
Found 2 items
-rw-r--r--   1 cloudera supergroup          0 2018-05-23 22:00 /mapreduceoutput/_SUCCESS
-rw-r--r--   1 cloudera supergroup         41 2018-05-23 22:00 /mapreduceoutput/part-00000
[cloudera@quickstart training]$ hdfs dfs -cat /mapreduceoutput/part-00000
are       3
how     2
is         1
welcome         1
where 1
you      4


Example-2 (Find out Number of Products Sold in Each Country)

Step1:  Create a new directory in HDFS then copy data file from local to HDFS directory
[cloudera@quickstart training]$ hdfs dfs -mkdir /productsalesinput
[cloudera@quickstart training]$ hdfs dfs -put SalesJan2009.csv /productsalesinput

Step2 : Run  jar file by providing data file as an input
[cloudera@quickstart training]$ hadoop jar ProductSalesperCountry.jar SalesCountry.SalesCountryDriver /productsalesinput/SalesJan2009.csv /productsalesoutput

Step3:  Check output file created on HDFS.
[cloudera@quickstart training]$ hdfs dfs -ls /productsalesoutput
Found 2 items
-rw-r--r--   1 cloudera supergroup          0 2018-05-23 23:52 /productsalesoutput/_SUCCESS
-rw-r--r--   1 cloudera supergroup        661 2018-05-23 23:52 /productsalesoutput/part-00000
[cloudera@quickstart training]$ hdfs dfs -cat /productsalesoutput/part-00000

Example-3 (MapReduce Join – Multiple Input Files)

Step1:  Create a new directory in HDFS then copy data file from local to HDFS directory
[cloudera@quickstart training]$ hdfs dfs -mkdir /multipleinputs
[cloudera@quickstart training]$ hdfs dfs -put customer.txt /multipleinputs
[cloudera@quickstart training]$ hdfs dfs -put delivery.txt /multipleinputs

Step2 : Run  jar file by providing data file as an input
[cloudera@quickstart training]$ hadoop jar MultipleInput.jar /multipleinputs/customer.txt /multipleinputs/delivery.txt /multipleoutput
Step3:  Check output file created on HDFS.
[cloudera@quickstart training]$ hdfs dfs -ls /multipleoutput
Found 2 items
-rw-r--r--   1 cloudera supergroup          0 2018-05-26 23:14 /multipleoutput/_SUCCESS
-rw-r--r--   1 cloudera supergroup         22 2018-05-26 23:14 /multipleoutput/part-r-00000
[cloudera@quickstart training]$ hdfs dfs -cat /multipleoutput/part-r-00000
mani   0
vijay    1
ravi      1

MRUnit test case for wordcount example

Pre-Requisites
Download the latest version of MRUnit jar from Apache  website: https://repository.apache.org/content/repositories/releases/org/apache/mrunit/mrunit/
mrunit-0.5.0-incubating.jar

Maven pom.xml dependency ( If you are using Maven Project)

org.apache.mrunit
mrunit
0.9.0-incubating
hadoop1 


Step1: Create a new Java project in Eclipse then add JUnit Library.

Step2:  Add external Jars which are required to run Junit test case
/usr/lib/hadoop  
 /usr/lib/hadoop-0.20-mapreduce
/home/cloudera/training/MRUnit/mrunit-0.5.0-incubating.jar
In Addition we need to also add wordcount.jar ( The classes from wordcount.jar  will be used in JUnit test case)

Step3:  Create Junit Test case

Word count MRUnit test case:
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.MapDriver;
import org.apache.hadoop.mrunit.MapReduceDriver;
import org.apache.hadoop.mrunit.ReduceDriver;
import org.junit.Before;
import org.junit.Test;

public class TestWordCount {
            MapReduceDriver mapReduceDriver;
            MapDriver mapDriver;
            ReduceDriver reduceDriver;

            @Before
            public void setUp() {
                        WordMapper mapper = new WordMapper();
                        SumReducer reducer = new SumReducer();
                        mapDriver = new MapDriver();
                        mapDriver.setMapper(mapper);
                        reduceDriver = new ReduceDriver();
                        reduceDriver.setReducer(reducer);
                        mapReduceDriver = new MapReduceDriver();
                        mapReduceDriver.setMapper(mapper);
                        mapReduceDriver.setReducer(reducer);
            }

            @Test
            public void testMapper() {
                        mapDriver.withInput(new LongWritable(1), new Text("cat cat dog"));
                        mapDriver.withOutput(new Text("cat"), new IntWritable(1));
                        mapDriver.withOutput(new Text("cat"), new IntWritable(1));
                        mapDriver.withOutput(new Text("dog"), new IntWritable(1));
                        mapDriver.runTest();
            }

            @Test
            public void testReducer() {
                        List values = new ArrayList();
                        values.add(new IntWritable(1));
                        values.add(new IntWritable(1));
                        reduceDriver.withInput(new Text("cat"), values);
                        reduceDriver.withOutput(new Text("cat"), new IntWritable(2));
                        reduceDriver.runTest();
            }

            @Test
            public void testMapReduce() {
                        mapReduceDriver.withInput(new LongWritable(1), new Text("cat cat dog"));
                        mapReduceDriver.addOutput(new Text("cat"), new IntWritable(2));
                        mapReduceDriver.addOutput(new Text("dog"), new IntWritable(1));
                        mapReduceDriver.runTest();
            }

}

Step4: Run JUint test case

Step5: Results should be passed.


Hadoop HDFS Commands


HDFS Commands
  • jps
HDFS Command to print Hadoop processes.
[root@quickstart Desktop]# jps

  • fsck
HDFS Command to check the health of the Hadoop file system.
[cloudera@quickstart training]$ hdfs fsck /

·        ls
HDFS Command to display the list of Files and Directories in HDFS.
[cloudera@quickstart training]$ hdfs dfs -ls /

  • mkdir
HDFS Command to create the directory in HDFS.
[cloudera@quickstart training]$ hdfs dfs -mkdir /bigdatatesting
[cloudera@quickstart training]$ hdfs dfs -ls /
drwxr-xr-x   - cloudera supergroup          0 2018-05-23 00:46 /bigdatatesting
Note: Here we are trying to create a directory named “Bigdatatesting” in HDFS.

  • touchz
HDFS Command to create a file in HDFS with file size 0 bytes.
[cloudera@quickstart training]$ hdfs dfs -touchz /bigdatatesting/test.dat
[cloudera@quickstart training]$ hdfs dfs -ls /bigdatatesting/
Found 1 items
-rw-r--r--   1 cloudera supergroup          0 2018-05-23 00:48 /bigdatatesting/test.tx

Note: Here we are trying to create a file named “test.dat” in the directory “bigdatatesting” of hdfs with file size 0 bytes.

  • du
HDFS Command to check the file size. 
[cloudera@quickstart training]$ hdfs dfs -du -s /bigdatatesting/test.dat
0  0  /bigdatatesting/test.dat

·        appendToFile
Appends the contents to the given destination file on HDFS. The destination file will be created if it does not exist.
[cloudera@quickstart training]$ hdfs dfs -appendToFile - /bigdatatesting/test.dat

  • cat
HDFS Command that reads a file on HDFS and prints the content of that file to the standard output.
 [cloudera@quickstart training]$ hdfs dfs -cat /bigdatatesting/test.dat

  • copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.
Step1: Create a file in Local File System.
[cloudera@quickstart training]$ cat>> test1.dat
[cloudera@quickstart training]$ ls test1.dat
test1.dat
Step2: Copy file from Local File system to HDFS
[cloudera@quickstart training]$ hdfs dfs -copyFromLocal test1.dat /bigdatatesting/
Note: Here the test is the file present in the local directory /home/cloudera/training and after the command gets executed the test file will be copied in /bigdatatesting directory of HDFS.

  • copyToLocal
HDFS Command to copy the file from HDFS to Local File System.
Step1: Check test.dat file present in local file system.
[cloudera@quickstart training]$ ls test.dat
ls: cannot access test.dat: No such file or directory
Step2: Copy test.dat file from HDFS to local file system. 
[cloudera@quickstart training]$ hdfs dfs -copyToLocal /bigdatatesting/test.dat /home/cloudera/training
Step3: Check again test.dat file present in local file system.
[cloudera@quickstart training]$ ls test.dat
test.dat
Note: Here test.dat is a file present in the bigdatatesting directory of HDFS and after the command gets executed the test.dat file will be copied to local directory /home/Cloudera/training

  • put
HDFS Command to copy single source or multiple sources from local file system to the destination file system.
Step1: Create a file in Local File System.
[cloudera@quickstart training]$ cat>> test2.dat
[cloudera@quickstart training]$ ls test2.dat
test1.dat
Step2: Copy file from Local File system to HDFS
[cloudera@quickstart training]$ hdfs dfs -put test2.dat /bigdatatesting/
Note: Here the test2.dat is the file present in the local directory /home/cloudera/training and after the command gets executed the test2.dat file will be copied in /bigdatatesting directory of HDFS.
Note:  The command put is similar to copyFromLocal command.
  • ·        get

HDFS Command to copy files from hdfs to the local file system.
Step1: Create a new file test3.dat on HDFS.
[cloudera@quickstart training]$ hdfs dfs -touchz /bigdatatesting/test3.dat

Step2: Copy test3.dat file from HDFS to local file system. 
[cloudera@quickstart training]$ hdfs dfs -get /bigdatatesting/test3.dat /home/cloudera/training

Step3: Check again test3.dat file present in local file system.
[cloudera@quickstart training]$ ls test3.dat
Test3.dat

Note1: Here test3.dat is a file present in the bigdatatesting directory of HDFS and after the command gets executed the test.dat file will be copied to local directory /home/Cloudera/training

Note2: The command get is similar to copyToLocal  command

  • cp
HDFS Command to copy files from source to destination. This command allows multiple sources as well, in which case the destination must be a directory.
[cloudera@quickstart training]$ hdfs dfs -mkdir /hadooptesting/
[cloudera@quickstart training]$ hdfs dfs -cp /bigdatatesting/test.dat /hadooptesting

  • mv
HDFS Command to move files from source to destination. This command allows multiple sources as well, in which case the destination needs to be a directory.
[cloudera@quickstart training]$ hdfs dfs -mv /bigdatatesting/test1.dat /hadooptesting/

  • rm
HDFS Command to remove the file from HDFS.
[cloudera@quickstart training]$ hdfs dfs -rm /bigdatatesting/test2.dat
Deleted /bigdatatesting/test2.dat

  • rm -r
HDFS Command to remove the entire directory and all of its content from HDFS.
[cloudera@quickstart training]$ hdfs dfs -rm -r /hadooptesting
Deleted /hadooptesting

  • rmdir
HDFS Command to remove the directory if it is empty.
[cloudera@quickstart training]$ hdfs dfs -rmdir /bigdatatesting

  • usage
HDFS Command that returns the help for an individual command.
[cloudera@quickstart training]$ hdfs dfs -usage mkdir
Note: By using usage command you can get information about any command.

  • help
HDFS Command that displays help for given command or all commands if none is specified.
[cloudera@quickstart training]$ hdfs dfs -help


Followers