Popular Hadoop Distributions


Currently there are lot of Hadoop distributions available in the big data market, but the major free open source distribution is from Apache Software Foundation. And even remaining hadoop distribution companies provide free versions of Hadoop, and also provide customized hadoop distributions suitable for client organization needs. By using Apache Hadoop as the core framework, these companies build their own customized hadoop cluster setup and services and provide commercial support for big data organizations. These are known as commercial hadoop distributions. These hadoop vendors provide services like managing updates, providing support, training, and consulting, and even adding some innovations of their own that make Hadoop reasonable for an enterprise to handle.


In Free Open Source market, Redhat is making money by taking Unix/Linux Core Kernel (an open source operating system) bundle all its required components, building a simple installer, and providing paid support to any customers.
In the same way, there are many companies which are providing enterprise editions and paid support on top of apache Hadoop distribution.

Free Open Source Hadoop Distribution

·        Apache Hadoop 
o   Core Hadoop Distribution Used by all other distributions
o   Complex Cluster Setup but No Commercial Support
o   Manual Installation and Integration of Hadoop Eco System Components like Hive, HBase, Pig, etc.
o   Right choice for free trial / test demo purpose.

Other Popular Hadoop Distributions

Cloudera Hadoop                                  
o   Hadoop’s co-founder, Doug Cutting, is its chief architect
o   Cloudera is the Market leader in the Hadoop space because it released the first commercial Hadoop distribution
o   Highly active contributor of code to the Hadoop ecosystem
o   Provides Cloudera Distribution for Hadoop (CDH) Parcels as well as powerful management and monitoring tool, Cloudera Manager for Hadoop administration.
o   Its approach is to take components it deems to be mature and retrofit them into the existing production-ready open source libraries that are included in its distribution.
o   Formed in 2008 with its core distribution based on 100% open source Apache Hadoop.
o   CDH may be downloaded from Cloudera’s website at no charge upto 50 data nodes large cluster, but with no technical support nor Cloudera Manager.

Hortonworks
o   Fast growing company and Started in 2011.
o   Another Major Player in Hadoop market.
o   Initially originated from Yahoo and has the largest number of committers and code contributors for the Hadoop ecosystem components.
o   Releases Hortonworks Data Platform (HDP), which includes Hadoop as well as related tooling and projects
o   Hortonworks has collaboration with major data management companies like Teradata, Microsoft, Informatica, and SAS to provide integrated Hadoop solutions with their own product sets.
o   Uses Apache Ambari for management, Stinger for queries, and Solr for searches.

Amazon Web Services Elastic MapReduce (AWS EMR) Hadoop
o   Hosted Hadoop framework running on the web-scale
infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple
Storage Service (Amazon S3).
o   Provides Management Software and GUI Support
o   Provides enhanced Data protection

MapR Hadoop
o   Provides complete distribution of Apache Hadoop and related projects that’s independent of the Apache Software Foundation.
o   MapR is being promoted as the only Hadoop distribution that provides full data protection, no single points of failure, and significant ease-of-use advantages.
o   It has customized underlying HDFS into its own proprietary version MapRFS that is intended to improve efficient management of data, reliability, and ease of use.
o   Three MapR editions are available: M3, M5, and M7.
o   The M3 Edition is free and available for unlimited production use;
o   MapR M5 is an intermediate-level subscription software offering;
o   MapR M7 is a complete distribution for Apache Hadoop and HBase that includes Pig, Hive, Sqoop, and much more.

Pivotal Greenplum Hadoop
o   Integrates EMC’s massively parallel processing (MPP) database technology (formerly known as Greenplum, and now known as HAWQ) with Apache Hadoop
o   High-performance Hadoop distribution with true SQL processing for Hadoop.
o   SQL-based queries and other business intelligence tools can be used to analyze data that is stored in HDFS

Intel Hadoop
o   Provides excellent performance with optimizations for Intel Xeon processors, Intel SSD storage, and Intel 10GbE networking.
o   Provides data security via encryption and decryption in HDFS
o   Supports role-based access control with cell-level granularity in HBase.
o   Improved Hive query performance.
o   Support for statistical analysis with open source statistical package R, and analytical graphics through Intel Graph Builder.

IBM InfoSphere Big Insights 
o   Focus around value add on top of the open source Hadoop stack
o   BigInsights comes with a built in browser-based spreadsheet tool called BigSheets
o   Great support for Adaptive Real-time Analytics and good text analytic capabilities by using the AQL and JAQL.

 Microsoft Hadoop on Windows Azure
o   Microsoft HDInsight is integration of Apache Hadoop version and Hortonworks Data Platform on Windows Cloud Platform Azure
o   Currently supports Pig, Hive, and Sqoop

DataStax Hadoop                   
o   DataStax Enterprise big data platform consists of open source tools Apache Hadoop, Cassandra, Solr, Hive, Pig, Mahout, etc.
o   DSE is designed to manage real-time, enterprise search data in the same database cluster.
o   It also comes with OpsCenter Enterprise, which allows for the management DSE Clusters via a central web interface.


Apart from these, there are many other hadoop distributions but all of these are open sourced under Apache’s GNU Public License.

Followers