Checking versions of Hadoop eco system tools

We need to know versions of hadoop technologies while we are trouble shooting any Hadoop issues. This article talks how to check versions of  Hadoop ecosystem technologies.

1) Hadoop version

Hadoop version can be found using below command.

hadoop version

2) Hive version

Hive version can be found using below command.

hive version

3) Pig version

Pig version can be found using below command.

pig version

4) Sqoop version

Sqoop version can be found using below command.

sqoop version

5) Tez version 

Tez version can be found using below rpm command.

rpm -qa|grep -i tez

6) Zookeeper version

Zookeeper version can be found using below rpm command.

rpm -qa|grep -i zookeeper

7) Hortonworks Data Platform  (HDP) version 

HDP version can be found using below command.

hdp-select versions

8) Knox version

Apache Knox version can be found using below rpm command.

rpm -qa|grep -i knox

9) Ranger version

Apache Ranger version can be found using below rpm command.

rpm -qa|grep -i ranger

We can also check version file for ranger version.

10)Checking all  versions from Ambari Web UI

Goto ---> Admin ---> Service accounts and versions


Using Sqoop Import Data From Mysql Into Hadoop

Sqoop is a tool in the apache ecosystem that was designed to solve the problem of importing data from relational databases and exporting data from HDFS to relational databases. Sqoop is able to interact with relational databases such as Oracle, SQL server, DB2, MySQL and Teradata and any other JDBC compatible database. The ability to connect to relational databases is supported by connectors that work with JDBC drivers. JDBC drivers are proprietary software licensed by the respective system vendors so Sqoop is not bundled with any JDBC drivers. These need to be downloaded and copied to $SQOOP_HOME/lib where $SQOOP_HOME is the directory Sqoop is installed.
A configured set up of Hadoop is required for installation of Sqoop. If you have not installed Hadoop, please refer to the Ubuntu Tutorial – learn how to set up a single node Hadoop cluster.

Using Apache Flume Streaming data into Hadoop

Apache Flume is a tool in the Hadoop ecosystem that provides capabilities for efficiently collecting, aggregating and bringing in large amounts of data into Hadoop. Examples of large amounts of data are log data, network traffic data, social media data, geo-location data, sensor and machine data and email message data. Flume provides several features to manage data.  It lets users ingest data from multiple data sources into Hadoop. It protects systems from data spikes when the rate of data inflow exceeds the rate at which data is written. Flume NG guarantees data delivery using channel based transactions. Flume scales horizontally to process more data streams and data volumes.


Using Pig Latin Writing Mapreduce Programs

In the Hadoop ecosystem Pig offers features for performing extraction, transformation and loading of data (ETL). In ETL the main objective is to acquire data, perform a set of transformations that add value to the data and load the cleansed data into a target system. Examples of transformations are removing duplicates, correcting spelling mistakes, calculating new variables and joining with other data sets. There are two components in Pig that work together in data processing. Pig Latin is the language used to express what needs to be done. An interpreter layer transforms Pig Latin programs into MapReduce or Tez jobs which are processed in Hadoop. Pig Latin is a fairly simple language that anyone with basic understanding of SQL can productively use it. Latin has a standard set of functions that are used for data manipulation and that are extensible by writing user defined functions (UDF) using java or Python.

Using Apache Hive Queries, Summarize and Analyze Data

Apache Hive is project within the Hadoop ecosystem that provides data warehouse capabilities. It was not designed for processing OLTP workloads. It has features for manipulating large distributed data using a SQL-like language called HiveQL. This makes it suitable for extract/transform (ETL), reporting and data analysis problems. HiveQL queries are translated into Java MapReduce code which runs on Hadoop. The queries are executed by Mapreduce, Apache Tez or Apache Spark. Hive queries run on Mapreduce take long to run because of batch processing. Spark provides a way to run low latency queries. Spark provides better performance than MapReduce without requiring any changes in queries. Hive is able to access data stored in HDFS, Hbase andAmazon S3.

Seting up a multi node Hadoop cluster on AWS

This tutorial will be divided into two parts. In the first part we will demonstrate how to set up instances on Amazon Web Services (AWS). AWS is a cloud computing platform that enables us to quickly provision virtual servers. In the second part we will demonstrate how to install Hadoop in the four node cluster we created.


Seting up of Hadoop on 4 Amazon Instances

In the first part of this tutorial provisioning a cluster with four instances on Amazon ec2 was demonstrated. Connecting to the instances using SSH was also explained. The second part of this tutorial will pick up from there and explain how to set up a Hadoop cluster with two data nodes, one primary name node and one secondary name node. Login to your Amazon management console and start your instances.

Managing of Files Within Hadoop File System

Data in hdfs is store in blocks that have a default size of 64mb. Files that you store in hdfs are broken up and distributed throughout the cluster. The dfs.datanode.data.dir setting in hdfs-site.xml of the data nodes specifies where the blocks are stored. The dfs.replication value specifies the number of times blocks are replicated within the cluster. By default each block is replicated three times.


Using Spark On Amazon EMR Procesing Data

Apache Spark is a data processing framework that has been developed to process very large amounts of data very fast. The speed gains are achieved because Spark data processing happens in memory. This is a notable difference from MapReduce which processes data stored on disk. Spark is not a replacement for mapreduce, it provides a solution for running workloads that require low latency. Spark is a very flexible framework that can run standalone, on Apache Mesos or on Hadoop Yarn. To use Spark you dont need any knowledge of Hadoop. Spark can consume data from HDFS, Hbase, Cassandra, Hive or any compatible Hadoop data source.

Develop Data Models in Hive

Within the Hadoop ecosystem Hive is considered as a data warehouse. This could be true or false depending on how you look at it. Considering the tool set available in Hive it can be used as a data warehouse but its business intelligence (BI) capabilities are limited. To remedy the weakness of BI capabilities in Hive Simba Technologies provides an ODBC connector that enables BI tools like Tableau, Excel and SAP Business Objects to connect to Hive. Therefore it is good to look at Hive as one of the tools available for BI and analytics, instead of their replacement.
In this tutorial we will take a practical approach of getting data, designing a star schema, implementing and querying the data. This tutorial assumes you have basic understanding of Hadoop and Hive. It also assumes you have Hadoop and Hive installed and running correctly. If not please refer to learn how to set up Hadoop tutorial and learn how to process data with Hive for a review of basic concepts and installation.

Develop Effective Data Models in Hbase

To develop a data model in Hbase that is scalable you need a good understanding of the strengths and weaknesses of the database. The guiding principle is the patterns in which the data will be accessed. Simply put the queries that will be issued against the data guide schema design. Using this approach it is advisable to use a schema that stores data that is read together within proximity. This tutorial builds on managing data using NoSQL Hbase database. Please refer to it for review of basic concepts and how to install Hbase on Hadoop.


Using Scala Analyze Data In Spark

Scala is a programming language that incorporates object oriented and functional programming styles. It is one of the programming languages along Java and Python that can be used to develop Spark applications. Python and Scala shells that enable interactive analyze the data which is available. Scala is a comprehensive programming language that cannot be covered in a short tutorial like this. This tutorial will focus on explaining the basic structure of a Spark application and demonstrate to the reader how to quickly become productive using Spark. Some of concepts that will be briefly demonstrated are: declaring variables, creating resilient distributed datasets (RDD) and transforming RDDs.
The structure of a Scala program is highlighted below:


Process Data Creating Topologies In Storm

In part 1 of this tutorial key concepts that are used in Storm were discussed. In that tutorial it was explained Storm topologies are expressed as directed acyclic graphs. The nodes on the graphs are either bolts or spouts. Spouts represent the source of data streams, for example a twitter spout is used to acquire a stream of tweets. The bolts specify the logic that is used to process data. Data emitted by the spouts is processed by the bolts. In this tutorial the main objective is to demonstrate how to code topologies and submit them to Storm.


Coordinate Hadoop Clusters Using Zookeeper

Hadoop was designed to be a distributed system that scales up to thousands of nodes. Even with a few hundred node cluster managing all those servers is not easy. Problems that can kill your application such as deadlocks, inconsistency and race conditions arise. Zookeeper was developed to solve challenges that arise when administering a large number of servers. To solve these challenges it provides a centralized way to manage objects required for a properly working cluster. Some of the objects that can be managed are naming space, configuration information, distributed synchronization and group services.

Process Data Creating Topologies In Storm

Within the Hadoop ecosystem Oozie provides services that enable jobs to be scheduled. With job scheduling you are able to organize multiple jobs into a single unit that is run sequentially. The types of jobs that are supported are MapReduce, Pig, Hive, Sqoop, Java programs shell scripts. To support scheduling Oozie avails two types of jobs. These are work flow and coordinator jobs. Work flow jobs are specified as directed acyclic graphs (DAG) that are executed in sequence. For coordinator jobs, time and data availability triggers are used to start them after which they run in a recurring manner. Packaging many work flow jobs and managing their life cycle is done by Oozie Bundle.

Using Apache Tez Framework Process Data Interactively And In Batch

Within Hadoop, MapReduce has been the widely used approach to process data. In this approach data processing happens in batch mode that can take minutes, hours or days to get results. MapReduce is useful when waiting for a long period for query results is not problematic. However when you need to get query results in a few seconds such a data processing model is no longer useful. Apache Tez is a project in the Hadoop ecosystem that was developed to address the need for interactive data processing. The project began incubation in February of 2013 and became a top level project in July of 2014. By using Tez as the data processing framework performance gains of up to 3 times over MapReduce are achievable. Apache Hive and Apache Pig are two of projects in Hadoop that have benefited greatly from performance gains offered by Tez.