Using Partition In Hive To Improve Query Performance

In previous Hive tutorials we have have looked at Hive as the Hadoop project that offers data warehousing features. Installing and configuring Hive was demonstrated. Guidelines on best practices when creating data models were also discussed. If you are new to these concepts please refer to setting up Hive and creating effective data models in Hive tutorials for a review of those concepts. In this tutorial the main focus will be on how one can use partitioning and bucketing to speed up queries.

Partitioning is a data management technique used to make queries run faster by dividing a large table into smaller parts. This reduces the amount of data to be scanned thereby reducing the amount of time needed to run a query. When a data partitioning strategy is unavailable all the data files residing in the data directory have to be scanned and filters applied which can result in long running queries especially when there are large tables. When partitioning is used only data directories that are needed are scanned and the others are ignored. To use partitioning to your advantage you need to identify columns of low cardinality that are frequently used in querying data that will help in organizing data by relying on partitioning feature in Hive. These columns are used to split the data into different partitions. A range of the partition column forms a partition which is stored in its own sub directory within the data directory of the table. Hive supports the use of one or multiple partition columns.
There are two partitioning approaches that are supported in Hive. These are dynamic partitioning and static partitioning. Static partitioning is preferable over dynamic partitioning when you know the values of partition columns before data is loaded into a Hive table. Dynamic partitioning is better when you only know partition column values during data load. The decision on which type of partitioning to use is not usually clear but there are some key points to consider.
Dynamic partitioning is suitable in situations when:
  • You are loading data from a hive table that is not yet partitioned. In this case when data is loaded there is no need for partitioning because the table is likely small. However with time there will be data growth that hurts performance. When loading data use of dynamic partitioning will resolve these issues.
  • Values of partition columns are not known. When there are difficulties in identifying values that are unique in a column you cannot use static partitioning. In such situations Hive identifies unique values and automatically creates partitions.
  • Due to data growth you decide to change columns used to partition data. This arises when a previous partitioning cannot cope with data growth. This situation is resolved by creation of a new table with additional columns, loading data into a new table and deleting the previous table. During load of data into the new table dynamic partitioning is used.
Table partitioning is supported on both external and internal tables. In the next sections we will demonstrate how this is done.
To demonstrate how to partition data we will use airline delay data from department of transportation that is available from here http://stat-computing.org/dataexpo/2009/the-data.html. The data is from 1987 to 2008. In this tutorial we will use data for the year 1987 but you can download and use data for all the years.
We need to remove headers from the data files and move them to a HDFS directory.
hadoop fs -mkdir /usr/local/airline1987
hadoop fs -put /home/eduonix/Downloads/1987.csv /usr/local/airline1987
load data

Static partitioning is the default way of creating partitions. We will demonstrate how it is used then demonstrate dynamic partitioning in the next section.

With our data in csv format we can choose to load it directly into a partitioned table or create a non partitioned staging table from which we will query data to be loaded into the partitioned table. For simplicity we will load data from csv file. In this tutorial our interest is to partition the data by year so the 1987 data is one partition. In the CREATE TABLE command use PARTITIONED BY option to specify the columns that will be used in partitioning data. Start Hive shell and use the command below to create a partitioned table.
create table partitionedairlinedata
(DayofMonth INT ,
DayOfWeek INT ,
DepTime INT ,
CRSDepTime INT ,
ArrTime INT ,
CRSArrTime INT ,
UniqueCarrier STRING ,
FlightNum INT ,
TailNum STRING ,
ActualElapsedTime INT ,
CRSElapsedTime INT ,
AirTime STRING ,
ArrDelay INT ,
DepDelay INT ,
Origin STRING ,
Distance INT ,
TaxiOut STRING ,
Cancelled INT ,
CancellationCode STRING ,
Diverted INT ,
CarrierDelay STRING ,
WeatherDelay STRING ,
SecurityDelay STRING ,
LateAircraftDelay STRING)

create table

From the command above it is very important to note that columns used to partition the data are excluded from the table definition. To create an external table you just need to add the external keyword in between create and table as shown in the output below.
external table

When using static partitioning each partition has to be loaded independently, thereby we need to use separate statements for each load. The airline data runs from 1987 up to 2008. We chose to use year as our partitioning column so we need to load each of the data files separately. To load the 1987 year we use the command below. We need to repeat this process for all the years.

LOAD DATA INPATH '/usr/local/airline1987/1987.csv'
INTO TABLE partitionedairlinedata
PARTITION (Year = 1987);


Post a Comment

Google Q&A Forum