Basic
- Pig is a scripting platform for processing and analysing large data sets.
- very usefulfor people who did not have java knowledge
- used for high level data flow and processing the data available on HDFS.
- PIG is named pig because like the animal, it can consume and process any type of data, and has lots of usage in data cleansing.
- Internally, whatever you write in Pig, it internally converts to Map reduce(MR) jobs.
- Pig is client side installation, it need not sit on hadoop cluster.
- Pig script will execute a set of commands, which will be converted to Map Reduce(MR) jobs and submitted to hadoop running locally or remotely.
- A hadoop cluster will not care whether the job was submitted from pig or from some other environment.
- map reduce programs get executed only when the DUMP or STORE command is called(more on this later).
Pig Vs Traditional Hadoop Map Reduce(MR).
- lot of effort required in writing Map Reduce in Hadoop, but in pig the effort required is very less.
- In Hadoop, we have to write toolrunner, Mapper, Reducer, but in Pig, nothing is mandatory as we just write a small script(set of commands)
- Hadoop Mapreduce has more functionality than Pig.
- since in pig, we just have to write the script, and not the separate toolrunner, mapper, reducer, etc, the development effort while using PIG is very less.
- Pig is slightly slower than MR job.
Components of PIG
- pig execution environment
- it is essentially, the hadoop cluster, where the pig script is submitted to run.
- it can be local hadoop or remote hadoop cluster.
- pig latin
- new language, which is compiled to map reduce(MR) jobs
- increases productivity, as less no of lines are required.
- good for non java programmers.
- provides operations like join, group, filter, sort, but we need to write lot of code for join etc in hadoop.
- data flow language instead of procedural language.
Data flow in Pig
- LOAD the data from HDFS, and into the Pig program.
- data is transformed into appropriate format, may be by GROUP, JOIN etc, or combine two files, FILTER etc or any other built in function.
- DUMP the data to screen or STORE the data somewhere.
Pig Execution Modes
- local mode
- pig -x local
- to enter into a default shell named grunt
- map reduce mode
- pig
- enter to map reduce mode
Pig Latin Example
- A = LOAD 'myserver.log' AS (ipaddress:chararray, timestamp:int, url: chararray) using PigStorage();
- A = LOAD 'myserver.log' using PigStorage();
- B = GROUP A by ipaddress
- C = FOREACH B GENERATE ipaddress, COUNT(A);
- STORE C INTO 'output.txt'
- DUMP C
Terminology
- atom : any value is called an atom
- tuple : collection of atoms, values (123, abc, xyz)
- bag : collection of tuples {(123,abc,xyz), (sdksjd, 122,skd)}
Transformations in Pig
Data for the below transformations can be found here
- SAMPLE
- to get some data from dataset.
- x = SAMPLE c 0.01 == approximate 1% of c into x
- LIMIT
- to limit the no of records.
- x = LIMIT c 3
- get only 3 records from c and put in x
- can fetch any random 3, and not exact same set of records every time)
- ORDER
- to get the columns in ascending or descending order.
- x = ORDER c by f1 ASC
- sort c by f1 column in asc order.
- JOIN
- to join two or more datasets into a single dataset.
- x = JOIN a BY fieldInA, b BY fieldInB, C BY fieldInC
- GROUP
- used to group the dataset based on a field.
- B = GROUP A BY age;
- UNION
- Combination of one or more data sets.
- a = load 'file1.txt' using PigStorage(',') as (field1:int, field2:int, field3:int)
- b = load 'file2.txt' using PigStorage(',') as (anotherfield1:int, anotherfield2:int, anotherfield3:int)
- c = UNION a,b => union works if both the fields erc have the same format, and datatype in all columns.
- d = DISTINCT c
- f = FILTER c BY f1 > 3
Pig Usage
- processing of logs generated from the servers.
- data processing for search platform
- ad hoc queries across large cluster