SDET- QA Automation Techie

Software Testing Blog

  • Home
  • Training
    • Online
    • Self-Paced
  • Video Tutorials
  • Interview Skills
    • HR Interview Questions Videos
    • Domain Knowledge
  • Career Guidance
  • Home
  • Software Testing
    • Manual Testing Tutorials
    • Manual Testing Project
    • Manaul Testing FAQS
    • ISTQB
    • AGILE
  • Web Automation Testing
    • Java Programmng
    • Python Programmng
    • Selenium with Java
    • Selenium with Python
    • Robot Framework(Selenium with Python)
    • selenium with Cucumber
    • TestNG+IntelliJ
    • Mobile App Testing(Appium)
    • JMeter
  • API Automation Testing
    • Rest Assured API Testing (BDD)
    • Rest Assured API Testing (Java+ TestNG)
    • Robot Framework(Rest API Testing with Python)
    • Postman
    • SoapUI
    • API Testing(FAQ's)
  • SDET|DevOps
    • Continuos Integration
    • SDET Essentials
    • AWS For Testers
    • Docker
  • SQL
    • Oracle(SQL)
    • MySQL for Testers
    • NoSQL
  • Unix/Linux
    • UNIX TUTORIALS
    • Linux Shell Scripting
  • ETL Testing
    • ETL Data warehouse Tutorial
    • ETL Concepts Tools and Templates
    • ETL Testing FAQ's
    • ETL Testing Videos
  • Big Data Hadoop
  • Video Tutorials
  • ApachePOI Video Tutorials
  • Downloads
    • E-Books for Professionals
    • Resumes
  • Automation Essencials
    • Cloud Technologies
      • Docker For Testers
      • AWS For Testers
      • Sub Child Category 3
    • Java Collections
    • Selenium Locators
    • Frequently Asked Java Programs
    • Frequently Asked Python Programs
    • Protractor
    • Cypress Web Automation

Hadoop PIG Notes

 Data flow in Pig, Pig Execution Modes, Pig Usage, Pig Vs Traditional Hadoop Map Reduce(MR), Transformations in Pig   

Basics

  • Pig is a scripting platform for processing and analysing large data sets.
  • very usefulfor people who did not have java knowledge
  • used for high level data flow and processing the data available on HDFS.
  • PIG is named pig because like the animal, it can consume and process any type of data, and has lots of usage in data cleansing.
  • Internally, whatever you write in Pig, it internally converts to Map reduce(MR) jobs.
  • Pig is client side installation, it need not sit on hadoop cluster.
  • Pig script will execute a set of commands, which will be converted to Map Reduce(MR) jobs and submitted to hadoop running locally or remotely.
  • A hadoop cluster will not care whether the job was submitted from pig or from some other environment.
  • map reduce programs get executed only when the DUMP or STORE command is called(more on this later).

Pig Vs Traditional Hadoop Map Reduce(MR).

  • lot of effort required in writing Map Reduce in Hadoop, but in pig the effort required is very less.
  • In Hadoop, we have to write toolrunner, Mapper, Reducer, but in Pig, nothing is mandatory as we just write a small script(set of commands)
  • Hadoop Mapreduce has more functionality than Pig.
  • since in pig, we just have to write the script, and not the separate toolrunner, mapper, reducer, etc, the development effort while using PIG is very less.
  • Pig is slightly slower than MR job.

Components of PIG

  • pig execution environment
    • it is essentially, the hadoop cluster, where the pig script is submitted to run.
    • it can be local hadoop or remote hadoop cluster.
  • pig latin
    • new language, which is compiled to map reduce(MR) jobs
    • increases productivity, as less no of lines are required.
    • good for non java programmers.
    • provides operations like join, group, filter, sort, but we need to write lot of code for join etc in hadoop.
    • data flow language instead of procedural language.

Data flow in Pig

  • LOAD the data from HDFS, and into the Pig program.
  • data is transformed into appropriate format, may be by GROUP, JOIN etc, or combine two files, FILTER etc or any other built in function.
  • DUMP the data to screen or STORE the data somewhere.

Pig Execution Modes

  • local mode
    • pig -x local
    • to enter into a default shell named grunt
  • map reduce mode
    • pig
    • enter to map reduce mode

Pig Latin Example

  • A = LOAD 'myserver.log' AS (ipaddress:chararray, timestamp:int, url: chararray) using PigStorage();
  • A = LOAD 'myserver.log' using PigStorage();
  • B = GROUP A by ipaddress
  • C = FOREACH B GENERATE ipaddress, COUNT(A);
  • STORE C INTO 'output.txt'
  • DUMP C

Terminology

  • atom : any value is called an atom
  • tuple : collection of atoms, values (123, abc, xyz)
  • bag : collection of tuples {(123,abc,xyz), (sdksjd, 122,skd)}

Transformations in Pig

Data for the below transformations can be found here
  • SAMPLE
    • to get some data from dataset.
    • x = SAMPLE c 0.01 == approximate 1% of c into x
  • LIMIT
    • to limit the no of records.
    • x = LIMIT c 3
    • get only 3 records from c and put in x
    • can fetch any random 3, and not exact same set of records every time)
  • ORDER
    • to get the columns in ascending or descending order.
    • x = ORDER c by f1 ASC
    • sort c by f1 column in asc order.
  • JOIN
    • to join two or more datasets into a single dataset.
    • x = JOIN a BY fieldInA, b BY fieldInB, C BY fieldInC
  • GROUP
    • used to group the dataset based on a field.
    • B = GROUP A BY age;
  • UNION
    • Combination of one or more data sets.
    • a = load 'file1.txt' using PigStorage(',') as (field1:int, field2:int, field3:int)
    • b = load 'file2.txt' using PigStorage(',') as (anotherfield1:int, anotherfield2:int, anotherfield3:int)
    • c = UNION a,b => union works if both the fields erc have the same format, and datatype in all columns.
    • d = DISTINCT c
    • f = FILTER c BY f1 > 3

Pig Usage

  • processing of logs generated from the servers.
  • data processing for search platform
  • ad hoc queries across large cluster
  • Share This:  
  •  Facebook
  •  Twitter
  •  Google+
  •  Stumble
  •  Digg
Email ThisBlogThis!Share to TwitterShare to Facebook
Newer Post Older Post Home
popup

Popular Posts

  • How To Explain Project In Interview Freshers and Experienced
    “ Describe an important project you’ve worked on ” is one of the most common questions you can expect in an interview. The purpose of a...
  • MANUAL TESTING REAL TIME INTERVIEW QUESTIONS & ANSWERS
    1. How will you receive the project requirements? A. The finalized SRS will be placed in a project repository; we will access it fr...
  • API/Webservices Testing using RestAssured (Part 1)
    Rest Assured : Is an API designed for automating REST services/Rest API's Pre-Requisites Java Free videos: https://www.you...

Facebook Page

Pages

  • Home
  • Resumes
  • Job Websites India/UK/US
  • ISTQB
  • Selenium with Java
  • E-Books for Professionals
  • Manual Testing Tutorials
  • Agile Methodology
  • Manual Testing Projects

Live Traffic

YouTube


Blog Visitors

Copyright © SDET- QA Automation Techie | Powered by Blogger
Design by SDET | Blogger Theme by | Distributed By Gooyaabi Templates