Published December 18, 2017 by

Apache Sqoop Overview & Import Data From MySQL to HDFS



Overview on Sqoop

Sqoop is open source s/w from Apache used for transfer data between RDBMS(Oracle, SQL Server, MySQL) and HDFS.

MySQL Database
Connecting to MySQL Database in cloudera VM:

root user: root/cloudera
other user: cloudera/cloudera

[cloudera@quickstart ~]$ mysql -u root -p
Read More
    email this
Published December 18, 2017 by

Streaming data into Hadoop using Apache Flume


Flume:Flume a hadoop echo system s/w used for streaming the logs file from applications int o HDFS.

In this post let's discuuss about following topics.
  • Overview on Flume
  • Streaming log files data into HDFS
  • Streaming Twitter App logs into HDFS
Read More
    email this
Published December 16, 2017 by

Overview on Apache Pig



What is pig?

  • Implemented by Yahoo.
  • Pig Hadoop echo system s/w from apache foundation used for analysing the data.
  • Pig uses pig latin language.
  • Data flow language.
  • handle structured, semi-structured and un-structured
  • Replacement of mapreduce(not 100%)
  • Pig internally uses MapReduce.
Read More
    email this
Published December 16, 2017 by

Apache Hive UDF'S (User Defined Functions)


In this post, let's discuss about Hive UDF's.

  • Creating UDF
  • How to packaging UDF(creating jar file)
  • Add jar file in to hive
  • Test UDF


Steps to create and test UDF's

1) Implement the code for UDF in Java
2) Package java class into jar file copy in some location
3) Add jar file in to Hive CLI
4) Create temporary function in hive
5) Use hive UDF BY  using Query.

Prerequiste: Table should have some data.

Problem statement-1
Find the maximum marks obtained out of four subject by an student.

Package java class into jar file copy in some location.

SELECT CLASS IN ECLIPSE-->RIGHT-->EXPORT-->JAVA-->JAR--> BROWSE THE LOCATION-->PROFILE FILENAME WITH .JAR Extension.

Add jar file in to Hive CLI

hive> add jar /home/cloudera/training/HiveUDFS/getMaxMarks.jar;

Create temporary function in hive

hive> create temporary function getmaxmarks as 'udfs.GetMaxMarks';

Use hive UDF BY  using Query

hive> select getmaxmarks(10,20,30,40) from dummy;   // sanity test

There are 2 types of UDF'S

1) Regular UDF( UDF) ---> Applied on more number of rows in a table
2) User Defined aggregate function (UDAF) --> Group of result sets.

Problem statement-2: Find the mean of marks obtained in maths by all the students.

Package java class into jar file copy in some location

Right click onth package-->export-->java-->provide jar file name.

Add jar file in to Hive CLI

hive> add jar /home/cloudera/training/HiveUDFS/getMeanMarks.jar;

Create temporary function in hive

hive> create temporary function getmeanmarks as 'udaf.GetMeanMarks';

Use functions with queries

hive> select getmeanmarks(social)from t_student_record;

Read More
    email this
Published December 02, 2017 by

Popular Open Source Big Data Tools

Data has become a powerful tool in today’s society, where it translates into direct knowledge and tons of money. Companies are paying through the nose to get their hands on data, so that they can modify their strategies, based on the wants and needs of their customers. But, it doesn’t stop there! Big Data is also important for governments, which helps run countries – such as calculating the census.
Read More
    email this