Apache Hive Notes


Apache Hive
  • Hive is a data warehouse infrastructure tool to process structured data in Hadoop. 
  • Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.
  • It stores schema in a database and processed data into HDFS.
  • It is designed for OLAP not for OLTP.
  • It provides SQL type language for querying called HiveQL or HQL.
  • Hive is not RDBMS.
  • Hive is a system for managing and querying un-structured data into structured format. It uses the concept of Map Reduce for execution.
Hive clients

Command Line Interface (CLI client)
Hive CLI
Beeline
Web Browser (GUI)
http://192.168.17.144:8888/

Hue : web based application which provides interface to interact with hadoop ecosystem.
http://localhost:8888/


Hive Databases & Tables

Create database in Hive

We can connect to the hive in 3 different ways:
1) Hive CLI
2) Beeline
3) Hue ( browser application)

IP: 192.168.13.132

Browser Hue Application:  http://192.168.13.132:8888/accounts/login/?next=/
Login : cloudera/cloudera

Hive databases

Display databases:

hive> show databases;

Creating new databases:

hive> create database userdb;
hive> create schema userdb1;

Pointing to specific database

hive> use userdb;

Display the current database

hive> set hive.cli.print.current.db=true;
hive (userdb)>

dropping databases

hive (userdb)> drop database userdb1;

Hive data types

Numeric Types:
Int, tiny int, big int, float, double, decimal
String Types:
Char, varchar, string
Date/Time Types:
Timestamp, date
Complex Types
Array, Map, Structs


Hive Tables

1) Internal Tables (Managed tables)
2) External Tables

Internal Tables (Managed Tables)

  • First we have to create table and load the data.
  • We can call this one as "data on schema".
  • By dropping this table, both data and schema will be removed.
  • The stored location of this table will be at /user/hive/warehouse.
  • Data is temporary
  • Hive to Manage the table data completely not allowing any external source to use the table (external sources means sqoop, pig, mapreduce etc..)
  • Don’t want data after deletion
External Tables 
  • Data will be available in HDFS. The table is going to create on HDFS data.
  • We can call this one as "schema on data".
  • At the time of dropping the table it drops only schema, the data will be still available in HDFS as before.
External tables provide an option to create multiple schemas for the data stored in HDFS 
instead of deleting the data every time whenever schema updates

  • If processing data available in HDFS
  • Useful when the files are being used outside of Hive
Creating Internal tables & loading data

1) create a table and insert data
2) create a table then load the data from Local File System
3) create a table then load the data from HDFS
4) create a table based on another table
5) create a table then load the data from another table

Method-1

hive (userdb)> create table employees(empid int,ename string,salary int,job string,deptno int);
hive (userdb)> insert into employees values(101,'pavan',60000,'enginerr',10);
hive (userdb)> select * from employees;

Method-2

hive (userdb)> create table IF NOT EXISTS employee(eid int,name string,salary string,designation string) 
               ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

hive (userdb)> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' OVERWRITE INTO TABLE employee;

Method-3

-> copy the file in to HDFS
[cloudera@quickstart hivedata]$ hdfs dfs -put /home/cloudera/training/hivedata/emp.txt /user/hive/warehouse/

hive (userdb)> Create table IF NOT EXISTS employee ( eid int, name String,salary String, designation String) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

hive (userdb)> LOAD DATA INPATH '/user/hive/warehouse/emp.txt' OVERWRITE INTO TABLE employee;

Method-4

->Describe table
hive (userdb)> describe employee;

hive (userdb)> create table employee_new as select eid,name from employee;

Method-5

hive (userdb)> create table empnames(names string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

hive (userdb)> insert into table empnames select name from employee;

External tables
  • External tables creates on top of data.
  • If table is deleted data still persists in HDFS.
1) create external table and insert data.
2) create external table and load the data from Local file system.
3) create external table and load the data from HDFS.

Method-1

hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION '/user/training/employee_external';

hive> insert into employee_external values(111,'pavan','50000','manager');

Method-2

hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION '/user/training/employee_external';

hive> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' INTO TABLE employee_external;

hive> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' OVERWRITE INTO TABLE employee_external;

** If you use OVERWRITE in loading statement data will be override.
** Other wise copy of files will be created.
** If you drop the table , data file will not be removed from HDFS.

Method-3

hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION '/user/training/employee_external';

hive> LOAD DATA INPATH '/user/training/emp.txt' INTO TABLE employee_external;

describe

hive> describe employee_external;
OK
eid                     int
name                    string
salary                  string
designation             string
Time taken: 0.09 seconds, Fetched: 4 row(s)

hive> describe extended employee_external;
OK
eid                     int
name                    string
salary                  string
designation             string


Complex Data Types

array  : We can store multiple elements with same data
map
struct

Array

hive> create table arrayTable(id int,name string,sal bigint,sub array,city string)
row format delimited fields terminated by ',' collection items terminated by '$';


hive> load data local inpath'/home/cloudera/training/hivedata/arrrayfile.txt' into table arrayTable;

hive> select * from arrayTable;
OK
1       abc     40000   ["a","b","c"]   hyd
2       def     3000    ["d","f"]       bang
Time taken: 1.029 seconds, Fetched: 2 row(s)
hive> select sub[2] from arrayTable;
OK
c
NULL
Time taken: 0.402 seconds, Fetched: 2 row(s)
hive> select sub[1] from arrayTable;
OK
b
f
Time taken: 0.244 seconds, Fetched: 2 row(s)
hive>

Map

create table maptable(id int,name string,sal bigint,sub array,pfund map,city string) row format delimited fields terminated by ',' 
collection items terminated by '$' map keys terminated by '#';

hive> load data local inpath'/home/cloudera/training/hivedata/mapfile.txt' into table maptable;


hive> select * From maptable;
hive> select pfund["pf"],["epf"] from maptable;


Struct

hive> create table structtable(id int,name string,sal bigint,sub array,dud map,
address struct
row format delimited fields terminated by ',' collection items terminated by '$' map keys terminated by '#';


hive> load data local inpath '/home/cloudera/training/hivedata/structfile.txt' into table structtable;


hive> select * From structtable;
OK
1       abc     40000   ["a","b","c"]   {"pf":500,"epf":200}    {"city":"hyd","state":"ap","pin":500001}
2       def     3000    ["d","f"]       {"pf":500}      {"city":"bang","state":"kar","pin":600038}
Time taken: 0.136 seconds, Fetched: 2 row(s)
hive> select address.city from structtable;
OK
hyd
bang
Time taken: 0.14 seconds, Fetched: 2 row(s)
hive> select address.pin from structtable;
OK
500001
600038
Time taken: 0.13 seconds, Fetched: 2 row(s)
hive>

Partitioning

1) create Hive table

hive> create table all_students(sno int,sname string, passout_year string) row format delimited fields terminated by ',';

2) Load the data

hive> load data local inpath '/home/cloudera/training/hivedata/Students.csv' into table all_students;

3) create partision table

hive> create table students_part(sno int,sname string) PARTITIONED BY(passout_year string);

4) For partition we have to set this property
set hive.exec.dynamic.partition.mode=nonstrict;

5) Load data into partision table

hive> INSERT OVERWRITE TABLE students_part PARTITION(passout_year) SELECT sno,sname, passout_year from all_students;


Bucketting

1) create Hive table

hive> create table all_students(sno int,sname string, passout_year string) row format delimited fields terminated by ',';

2) Load the data

hive> load data local inpath '/home/cloudera/training/hivedata/Students.csv' into table all_students;

3) creating bucket table:
create table students_buck(sno int,sname string, passout_year string) clustered by (passout_year) sorted by (sno) into 3 buckets;

4) Enable bucketing table:
set hive.enforce.bucketing=true;

5) Insert data into bucket table:
insert overwrite table students_buck select * from all_students;

** After bucketing splits will be stored in  /user/hive/warehouse/students_buck

6) Test Bucket table:
select * from students_buck tablesample(bucket 2 out of 3 on sname);

7) Test normal table:
select * from all_students limit 10;

-------------------------------------
Alter table - rename table name, rename column
Views & Indexes
Select ..where 
Select ..order by
Select..Group By
-----------------------------

Renaming table

hive> alter table all_students rename to students;

Renaming the column

hive> alter table students change passout_year pyear String;

View ---> View is a logical object. can be create for 1 or more columns from a table.

Creating view

hive> create view students_view as select sno,pyear from students where pyear='2011';

hive> select * from students_view;
OK
5       2011
6       2011
7       2011
18      2011
19      2011
20      2011

Dropping view

hive> drop view students_view;

Indexes

Index--> improving the performance when fetching the data from tables.

Creating index

hive> create index sno_idx on table students(sno) as 'compact' with deferred rebuild;

hive> show index on students;

Dropping Index

hive> drop index sno_idx on students;

HQL--Select with where clause

where --> used for filtering the rows based on condition.

Display students who are passed in the year 2012.
hive> select * from students where pyear=2012;

HQL-Select with order by clause

Display students in asceding order based on pyear;

hive> select * from students order by pyear;

HQL-Select with group by clause

Count number of students passed in each year.

hive> select pyear,count(*) from students group by pyear;

Group by with having clause

having--> apply the condition on top of group by clause

hive> select pyear,count(*) from students group by pyear having count(*)>=5;


Hive Built-in functions

dummy --> one row and one colomn.

hive> select * From dummy;
OK
xxx
Time taken: 2.575 seconds, Fetched: 1 row(s)
hive> describe dummy;


hive> select concat('abc','xyz') from dummy;
hive> select concat(sname,pyear) from students;

hive> select substr('welcome',2,4) from dummy;
elco
hive> select substr('welcome',3,2) from dummy;
lc
hive> select substr(sname,1,4) from students;

hive> select upper('welcome') from dummy;
WELCOME

hive> select ucase('welcome') from dummy;
WELCOME

hive> select lower('WELCOME') from dummy;
welcome

hive> select lcase('WELCOME') from dummy;
welcome

hive> select upper(sname) from students;

hive> select lower(sname) from students;

hive> select '   welcome   ' from dummy;
   welcome
hive> select trim('   welcome   ') from dummy;
welcome

hive> select rtrim('   welcome   ') from dummy;
   welcome
hive> select ltrim('   welcome   ') from dummy;
welcome

hive> select CURRENT_DATE from dummy;
2017-12-28

hive> select year(CURRENT_DATE) from dummy;
hive> select month(CURRENT_DATE) from dummy;
hive> select day(CURRENT_DATE) from dummy;


Aggregation Functions

Aggregation functions applicable for group of columns.

Tables: owners , cars

hive> create table owners(id int,name string,age int,car_id int) row format delimited fields terminated by ',' 
lines terminated by '\n';

hive> load data local inpath '/home/cloudera/training/hivedata/owners.txt' into table owners;

hive> create table cars(id int,make string,model string,year string)row format delimited fields terminated by ',' 
lines terminated by '\n';

hive> load data local inpath '/home/cloudera/training/hivedata/cars.txt' into table cars;


count(*)  --> returns total number of rows including nulls.
count(exp) --> returns total number of rows for which the supplied column.

sum(col)--> returns sum of elements in the group.
sum(distinct col) --> returns sum of unique elements in the group(eleminate duplictes)

avg(col)---> returns average value of elements from a group.
min(col)--> returns minimum value in the group
max(col)--> retruns maximum value from the group.

Query: table : cars

I want to see the car makers that have more than 1 model in cars table.

hive> select make,count(distinct model) from cars group by make having count(distinct model)>1;


Hive User Defined Functions

select count(*) from customers;

Steps to create and test UDF's

1) Implement the code for UDF in Java
2) Package java class into jar file copy in some location
3) Add jar file in to Hive CLI
4) Create temporary function in hive
5) Use hive UDF BY  using Query.

Prerequiste: Table should have some data.

Problem statement-1

Find the maximum marks obtained out of four subject by an student.

Package java class into jar file copy in some location.
-----------
SELECT CLASS IN ECLIPSE-->RIGHT-->EXPORT-->JAVA-->JAR--> BROWSE THE LOCATION-->PROFILE FILENAME WITH .JAR Extension.

Add jar file in to Hive CLI
---------------------
hive> add jar /home/cloudera/training/HiveUDFS/getMaxMarks.jar;

Create temporary function in hive
-----------------
hive> create temporary function getmaxmarks as 'udfs.GetMaxMarks';

Use hive UDF BY  using Query
-------------
hive> select getmaxmarks(10,20,30,40) from dummy;   // sanity test


2 types of UDF

1) Regular UDF( UDF) ---> applied on more number of rows in a table
2) User Defined aggregate function (UDAF) --> group of result sets.

Problem statement -2: Find the mean of marks obtained in maths by all the students.

Package java class into jar file copy in some location
----------
Right click onth package-->export-->java-->provide jar file name.

Add jar file in to Hive CLI
-----
hive> add jar /home/cloudera/training/HiveUDFS/getMeanMarks.jar;

Create temporary function in hive
----------------
hive> create temporary function getmeanmarks as 'udaf.GetMeanMarks';

Use functions with queries
-----------------------
hive> select getmeanmarks(social)from t_student_record;


Hive clients
------------------
1) hive CLI 
2) beeline
3) Hue ( Web UI) ; http://192.168.13.135:8888
4) JDBC Client ( Java program)

Hive Joins:
Join is used for getting the data from multiple tables.

1) Inner join
2) Left outer join
3) right outer join
4) full outer join

T1: 1 3 5 7 (left table)
T2  1 2 5 8 9 (right table)

inner join: 1,5  ( COMMON RECORDS IN BOTH THE TABLES)

left outer join: 1,5,3,7,null,null ( common records+extra records from left table+ if there is no corresponding values it substitute as nulls)
right outer join: 1,5 2,8,9,null,null

full outer join: 1,5,3,7,2,8,9,null,null,null

Tables
-----------------
orders
customers

hive> create table customers(id int,name string,age int,address string,salary int) row format delimited fields terminated by ',' lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/customers.txt' into table customers;

hive> create table orders(oid int,date date,customer_id int,amount int)row format delimited fields terminated by ',' lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/orders.txt' into table orders;

Join Queries

INNER JOIN:

SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

LEFT OUTER JOIN

SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

RIGHT OUTER JOIN

SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

FULL OUTER JOIN

SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);


Processing XML

hive> create table xmltable(line string);
hive> load data local inpath '/home/cloudera/training/hivedata/sampxml1.xml' into table xmltable;


hive> select * from xmltable;
OK
xmltable.line
Ravi25hyd
Rani24f
SampathmDel
Time taken: 0.064 seconds, Fetched: 3 row(s)
hive> select xpath_string(line,'rec/name') from xmltable;
OK
_c0
Ravi
Rani
Sampath
Time taken: 0.108 seconds, Fetched: 3 row(s)
hive> select xpath_string(line,'rec/city') from xmltable;
OK
_c0
hyd

Del
Time taken: 0.1 seconds, Fetched: 3 row(s)
hive>

Hive WebUI (Hue application)

Generating data: http://generatedata.com/

1) Querying tables

2) Create new table and import the data from external file.

step 1) DataBrowsers-->MetaStore tables-->
step 2)you need to click on ' + ' at right corner

Followers