SDET-QA Blog: Hadoop PIG Notes

Pig is a scripting platform for processing and analysing large data sets.
very usefulfor people who did not have java knowledge
used for high level data flow and processing the data available on HDFS.
PIG is named pig because like the animal, it can consume and process any type of data, and has lots of usage in data cleansing.
Internally, whatever you write in Pig, it internally converts to Map reduce(MR) jobs.
Pig is client side installation, it need not sit on hadoop cluster.
Pig script will execute a set of commands, which will be converted to Map Reduce(MR) jobs and submitted to hadoop running locally or remotely.
A hadoop cluster will not care whether the job was submitted from pig or from some other environment.
map reduce programs get executed only when the DUMP or STORE command is called(more on this later).

lot of effort required in writing Map Reduce in Hadoop, but in pig the effort required is very less.
In Hadoop, we have to write toolrunner, Mapper, Reducer, but in Pig, nothing is mandatory as we just write a small script(set of commands)
Hadoop Mapreduce has more functionality than Pig.
since in pig, we just have to write the script, and not the separate toolrunner, mapper, reducer, etc, the development effort while using PIG is very less.
Pig is slightly slower than MR job.

pig execution environment
- it is essentially, the hadoop cluster, where the pig script is submitted to run.
- it can be local hadoop or remote hadoop cluster.
pig latin
- new language, which is compiled to map reduce(MR) jobs
- increases productivity, as less no of lines are required.
- good for non java programmers.
- provides operations like join, group, filter, sort, but we need to write lot of code for join etc in hadoop.
- data flow language instead of procedural language.

LOAD the data from HDFS, and into the Pig program.
data is transformed into appropriate format, may be by GROUP, JOIN etc, or combine two files, FILTER etc or any other built in function.
DUMP the data to screen or STORE the data somewhere.

A = LOAD 'myserver.log' AS (ipaddress:chararray, timestamp:int, url: chararray) using PigStorage();
A = LOAD 'myserver.log' using PigStorage();
B = GROUP A by ipaddress
C = FOREACH B GENERATE ipaddress, COUNT(A);
STORE C INTO 'output.txt'
DUMP C

Data for the below transformations can be found here

SAMPLE
- to get some data from dataset.
- x = SAMPLE c 0.01 == approximate 1% of c into x
LIMIT
- to limit the no of records.
- x = LIMIT c 3
- get only 3 records from c and put in x
- can fetch any random 3, and not exact same set of records every time)
ORDER
- to get the columns in ascending or descending order.
- x = ORDER c by f1 ASC
- sort c by f1 column in asc order.
JOIN
- to join two or more datasets into a single dataset.
- x = JOIN a BY fieldInA, b BY fieldInB, C BY fieldInC
GROUP
- used to group the dataset based on a field.
- B = GROUP A BY age;
UNION
- Combination of one or more data sets.
- a = load 'file1.txt' using PigStorage(',') as (field1:int, field2:int, field3:int)
- b = load 'file2.txt' using PigStorage(',') as (anotherfield1:int, anotherfield2:int, anotherfield3:int)
- c = UNION a,b => union works if both the fields erc have the same format, and datatype in all columns.
- d = DISTINCT c
- f = FILTER c BY f1 > 3

SDET-QA Blog