Overview on Apache Pig

What is pig?

  • Implemented by Yahoo.
  • Pig Hadoop echo system s/w from apache foundation used for analysing the data.
  • Pig uses pig latin language.
  • Data flow language.
  • handle structured, semi-structured and un-structured
  • Replacement of mapreduce(not 100%)
  • Pig internally uses MapReduce.

Components/architecture of pig:

Pig Data Model


Pig Execution Modes(connection to the pig)

1) Local mode ---> Data will be used from Local file system. commands will run locally.

[cloudera@quickstart ~]$ pig -x local

2) MapReduce Mode(HDFS mode) ---> Data should be part of HDFS. commands will run in MapReduce(Hadoop)

[cloudera@quickstart ~]$ pig -x mapreduce   (or)    [cloudera@quickstart ~]$ pig

Execution Mechanisms( how many ways we can execute pig scripts)

1) Interactive mode (in grunt shell)
2) Batch mode (in unix/linux prompt)

Interactive mode (in grunt shell)

grunt> customers= LOAD '/home/cloudera/training/pigdata/customers.txt' USING PigStorage(',');
grubt> dump;

Batch mode (in unix/linux prompt)

1) Local mode

[cloudera@quickstart ~]$ cat pig_local.pig
customers= LOAD '/home/cloudera/training/pigdata/customers.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,address:chararray,salary:int);
dump customers;

[cloudera@quickstart ~]$ pig -x local pig_local.pig

2) MapReducemode (HDFS Mode)

[cloudera@quickstart ~]$ cat pig_global.pig
customers= LOAD '/user/cloudera/customers.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,address:chararray,salary:int);
dump customers;

[cloudera@quickstart ~]$ pig -x mapreduce pig_global.pig

[cloudera@quickstart ~]$ hdfs dfsadmin -safemode leave;           ---> Optional

Note:  script file extention is  .pig