How to test Python MapReduce Jobs in Hadoop


Example:  Count Number of words in a text file (word count)
1) Create Python scripts mapper.py & reducer.py 

2) Test mapper.py and reducer.py scripts locally before using them in a  MapReduce job.  

Test1:
[cloudera@quickstart training]$ echo "abc xyz abc abc abc xyz pqr" | python /home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python /home/cloudera/training/wordcount-python/reducer.py
abc      4
pqr      1
xyz      2

Test2:
[cloudera@quickstart training]$ cat wordcount.txt | python /home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python /home/cloudera/training/wordcount-python/reducer.py
are       3
how     2
is         1
welcome         1
where 1
you      4

3)  Create ‘wordcountinput’ directory in HDFS then copy wordcount.txt to HDFS .
[cloudera@quickstart training]$ hdfs dfs -mkdir /wordcountinput
[cloudera@quickstart training]$ hdfs dfs -put wordcount.txt /wordcountinput

4) Execute MapReduce job using streaming jar file .
 Location: /usr/lib/hadoop-0.20-mapreduce/contrib/streaming

[cloudera@quickstart training]$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.12.0.jar -Dmapred.reduce.tasks=1 -file /home/cloudera/training/wordcount-python/mapper.py /home/cloudera/training/wordcount-python/reducer.py -mapper "python mapper.py" -reducer "python reducer.py" -input /wordcountinput/wordcount.txt -output /wordcountoutput

5) Check the output 
[cloudera@quickstart training]$ hdfs dfs -ls /wordcountoutput
Found 2 items
-rw-r--r--   1 cloudera supergroup          0 2018-05-24 00:40 /wordcountoutput/_SUCCESS
-rw-r--r--   1 cloudera supergroup         41 2018-05-24 00:40 /wordcountoutput/part-00000

[cloudera@quickstart training]$ hdfs dfs -cat /wordcountoutput/part-00000
are       3
how     2
is         1
welcome         1
where 1
you      4



Followers