SDET-QA Blog: How to test Python MapReduce Jobs in Hadoop

Example: Count Number of words in a text file (word count)

1) Create Python scripts mapper.py & reducer.py

2) Test mapper.py and reducer.py scripts locally before using them in a MapReduce job.

Test1:

[cloudera@quickstart training]$ echo "abc xyz abc abc abc xyz pqr" | python /home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python /home/cloudera/training/wordcount-python/reducer.py

abc 4

pqr 1

xyz 2

Test2:

[cloudera@quickstart training]$ cat wordcount.txt | python /home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python /home/cloudera/training/wordcount-python/reducer.py

are 3

how 2

is 1

welcome 1

where 1

you 4

3) Create ‘wordcountinput’ directory in HDFS then copy wordcount.txt to HDFS .

[cloudera@quickstart training]$ hdfs dfs -mkdir /wordcountinput

[cloudera@quickstart training]$ hdfs dfs -put wordcount.txt /wordcountinput

4) Execute MapReduce job using streaming jar file .

Location: /usr/lib/hadoop-0.20-mapreduce/contrib/streaming

[cloudera@quickstart training]$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.12.0.jar -Dmapred.reduce.tasks=1 -file /home/cloudera/training/wordcount-python/mapper.py /home/cloudera/training/wordcount-python/reducer.py -mapper "python mapper.py" -reducer "python reducer.py" -input /wordcountinput/wordcount.txt -output /wordcountoutput

5) Check the output

[cloudera@quickstart training]$ hdfs dfs -ls /wordcountoutput

Found 2 items

-rw-r--r-- 1 cloudera supergroup 0 2018-05-24 00:40 /wordcountoutput/_SUCCESS

-rw-r--r-- 1 cloudera supergroup 41 2018-05-24 00:40 /wordcountoutput/part-00000

[cloudera@quickstart training]$ hdfs dfs -cat /wordcountoutput/part-00000

are 3

how 2

is 1

welcome 1

where 1

you 4

SDET-QA Blog

Pages

How to test Python MapReduce Jobs in Hadoop

Followers

Most Trending

Pages