Hadoop PIG Notes


Basic

  • Pig is a scripting platform for processing and analysing large data sets.
  • very usefulfor people who did not have java knowledge
  • used for high level data flow and processing the data available on HDFS.
  • PIG is named pig because like the animal, it can consume and process any type of data, and has lots of usage in data cleansing.
  • Internally, whatever you write in Pig, it internally converts to Map reduce(MR) jobs.
  • Pig is client side installation, it need not sit on hadoop cluster.
  • Pig script will execute a set of commands, which will be converted to Map Reduce(MR) jobs and submitted to hadoop running locally or remotely.
  • A hadoop cluster will not care whether the job was submitted from pig or from some other environment.
  • map reduce programs get executed only when the DUMP or STORE command is called(more on this later).

Pig Vs Traditional Hadoop Map Reduce(MR).

  • lot of effort required in writing Map Reduce in Hadoop, but in pig the effort required is very less.
  • In Hadoop, we have to write toolrunner, Mapper, Reducer, but in Pig, nothing is mandatory as we just write a small script(set of commands)
  • Hadoop Mapreduce has more functionality than Pig.
  • since in pig, we just have to write the script, and not the separate toolrunner, mapper, reducer, etc, the development effort while using PIG is very less.
  • Pig is slightly slower than MR job.

Components of PIG

  • pig execution environment
    • it is essentially, the hadoop cluster, where the pig script is submitted to run.
    • it can be local hadoop or remote hadoop cluster.
  • pig latin
    • new language, which is compiled to map reduce(MR) jobs
    • increases productivity, as less no of lines are required.
    • good for non java programmers.
    • provides operations like join, group, filter, sort, but we need to write lot of code for join etc in hadoop.
    • data flow language instead of procedural language.

Data flow in Pig

  • LOAD the data from HDFS, and into the Pig program.
  • data is transformed into appropriate format, may be by GROUPJOIN etc, or combine two files, FILTER etc or any other built in function.
  • DUMP the data to screen or STORE the data somewhere.

Pig Execution Modes

  • local mode
    • pig -x local
    • to enter into a default shell named grunt
  • map reduce mode
    • pig
    • enter to map reduce mode

Pig Latin Example

  • A = LOAD 'myserver.log' AS (ipaddress:chararray, timestamp:int, url: chararray) using PigStorage();
  • A = LOAD 'myserver.log' using PigStorage();
  • B = GROUP A by ipaddress
  • C = FOREACH B GENERATE ipaddress, COUNT(A);
  • STORE C INTO 'output.txt'
  • DUMP C

Terminology

  • atom : any value is called an atom
  • tuple : collection of atoms, values (123, abc, xyz)
  • bag : collection of tuples {(123,abc,xyz), (sdksjd, 122,skd)}

Transformations in Pig

Data for the below transformations can be found here
  • SAMPLE
    • to get some data from dataset.
    • x = SAMPLE c 0.01 == approximate 1% of c into x
  • LIMIT
    • to limit the no of records.
    • x = LIMIT c 3
    • get only 3 records from c and put in x
    • can fetch any random 3, and not exact same set of records every time)
  • ORDER
    • to get the columns in ascending or descending order.
    • x = ORDER c by f1 ASC
    • sort c by f1 column in asc order.
  • JOIN
    • to join two or more datasets into a single dataset.
    • x = JOIN a BY fieldInA, b BY fieldInB, C BY fieldInC
  • GROUP
    • used to group the dataset based on a field.
    • B = GROUP A BY age;
  • UNION
    • Combination of one or more data sets.
    • a = load 'file1.txt' using PigStorage(',') as (field1:int, field2:int, field3:int)
    • b = load 'file2.txt' using PigStorage(',') as (anotherfield1:int, anotherfield2:int, anotherfield3:int)
    • c = UNION a,b => union works if both the fields erc have the same format, and datatype in all columns.
    • d = DISTINCT c
    • f = FILTER c BY f1 > 3

Pig Usage

  • processing of logs generated from the servers.
  • data processing for search platform
  • ad hoc queries across large cluster

Followers