Example: Count Number of
words in a text file (word count)
1)
Create Python scripts mapper.py & reducer.py
2)
Test mapper.py and reducer.py scripts locally before using them in a
MapReduce job.
Test1:
[cloudera@quickstart
training]$ echo "abc xyz abc abc abc xyz pqr" | python
/home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python
/home/cloudera/training/wordcount-python/reducer.py
abc 4
pqr 1
xyz 2
Test2:
[cloudera@quickstart
training]$ cat wordcount.txt | python
/home/cloudera/training/wordcount-python/mapper.py |sort -k1,1 | python
/home/cloudera/training/wordcount-python/reducer.py
are 3
how 2
is 1
welcome 1
where 1
you 4
3) Create ‘wordcountinput’ directory in HDFS
then copy wordcount.txt to HDFS .
[cloudera@quickstart
training]$ hdfs dfs -mkdir /wordcountinput
[cloudera@quickstart
training]$ hdfs dfs -put wordcount.txt /wordcountinput
4)
Execute MapReduce job using streaming jar file .
Location:
/usr/lib/hadoop-0.20-mapreduce/contrib/streaming
[cloudera@quickstart
training]$ hadoop jar
/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.12.0.jar
-Dmapred.reduce.tasks=1 -file /home/cloudera/training/wordcount-python/mapper.py
/home/cloudera/training/wordcount-python/reducer.py -mapper "python
mapper.py" -reducer "python reducer.py" -input
/wordcountinput/wordcount.txt -output /wordcountoutput
5)
Check the output
[cloudera@quickstart
training]$ hdfs dfs -ls /wordcountoutput
Found 2 items
-rw-r--r-- 1 cloudera supergroup 0 2018-05-24 00:40
/wordcountoutput/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 41 2018-05-24 00:40
/wordcountoutput/part-00000
[cloudera@quickstart
training]$ hdfs dfs -cat /wordcountoutput/part-00000
are 3
how 2
is 1
welcome 1
where 1
you 4