- Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
- Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.
- It stores schema in a database and processed data into HDFS.
- It is designed for OLAP not for OLTP.
- It provides SQL type language for querying called HiveQL or HQL.
- Hive is not RDBMS.
- Hive is a system for managing and querying un-structured data into structured format. It uses the concept of Map Reduce for execution.
Command Line Interface (CLI client)
Hive CLI
Beeline
Web Browser (GUI)
http://192.168.17.144:8888/
Hue : web based application which provides interface to interact with hadoop ecosystem.
http://localhost:8888/
Hive Databases & Tables
Create database in Hive
We can connect to the hive in 3 different ways:
1) Hive CLI
2) Beeline
3) Hue ( browser application)
IP: 192.168.13.132
Browser Hue Application: http://192.168.13.132:8888/accounts/login/?next=/
Login : cloudera/cloudera
Hive databases
Display databases:
hive> show databases;
Creating new databases:
hive> create database userdb;
hive> create schema userdb1;
Pointing to specific database
hive> use userdb;
Display the current database
hive> set hive.cli.print.current.db=true;
hive (userdb)>
dropping databases
hive (userdb)> drop database userdb1;
Hive data types
Numeric Types:
Int, tiny int, big int, float, double, decimal
String Types:
Char, varchar, string
Date/Time Types:
Timestamp, date
Complex Types
Array, Map, Structs
Hive Tables
1) Internal Tables (Managed tables)
2) External Tables
Internal Tables (Managed Tables)
- First we have to create table and load the data.
- We can call this one as "data on schema".
- By dropping this table, both data and schema will be removed.
- The stored location of this table will be at /user/hive/warehouse.
- Data is temporary
- Hive to Manage the table data completely not allowing any external source to use the table (external sources means sqoop, pig, mapreduce etc..)
- Don’t want data after deletion
- Data will be available in HDFS. The table is going to create on HDFS data.
- We can call this one as "schema on data".
- At the time of dropping the table it drops only schema, the data will be still available in HDFS as before.
instead of deleting the data every time whenever schema updates
- If processing data available in HDFS
- Useful when the files are being used outside of Hive
1) create a table and insert data
2) create a table then load the data from Local File System
3) create a table then load the data from HDFS
4) create a table based on another table
5) create a table then load the data from another table
Method-1
hive (userdb)> create table employees(empid int,ename string,salary int,job string,deptno int);
hive (userdb)> insert into employees values(101,'pavan',60000,'enginerr',10);
hive (userdb)> select * from employees;
Method-2
hive (userdb)> create table IF NOT EXISTS employee(eid int,name string,salary string,designation string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
hive (userdb)> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' OVERWRITE INTO TABLE employee;
Method-3
-> copy the file in to HDFS
[cloudera@quickstart hivedata]$ hdfs dfs -put /home/cloudera/training/hivedata/emp.txt /user/hive/warehouse/
hive (userdb)> Create table IF NOT EXISTS employee ( eid int, name String,salary String, designation String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
hive (userdb)> LOAD DATA INPATH '/user/hive/warehouse/emp.txt' OVERWRITE INTO TABLE employee;
Method-4
->Describe table
hive (userdb)> describe employee;
hive (userdb)> create table employee_new as select eid,name from employee;
Method-5
hive (userdb)> create table empnames(names string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
hive (userdb)> insert into table empnames select name from employee;
External tables
- External tables creates on top of data.
- If table is deleted data still persists in HDFS.
2) create external table and load the data from Local file system.
3) create external table and load the data from HDFS.
Method-1
hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION '/user/training/employee_external';
hive> insert into employee_external values(111,'pavan','50000','manager');
Method-2
hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION '/user/training/employee_external';
hive> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' INTO TABLE employee_external;
hive> LOAD DATA LOCAL INPATH '/home/cloudera/training/hivedata/emp.txt' OVERWRITE INTO TABLE employee_external;
** If you use OVERWRITE in loading statement data will be override.
** Other wise copy of files will be created.
** If you drop the table , data file will not be removed from HDFS.
Method-3
hive> create EXTERNAL table employee_external(eid int,name string,salary string,designation string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION '/user/training/employee_external';
hive> LOAD DATA INPATH '/user/training/emp.txt' INTO TABLE employee_external;
describe
hive> describe employee_external;
OK
eid int
name string
salary string
designation string
Time taken: 0.09 seconds, Fetched: 4 row(s)
hive> describe extended employee_external;
OK
eid int
name string
salary string
designation string
Complex Data Types
array : We can store multiple elements with same data
map
struct
Array
hive> create table arrayTable(id int,name string,sal bigint,sub array
row format delimited fields terminated by ',' collection items terminated by '$';
hive> load data local inpath'/home/cloudera/training/hivedata/arrrayfile.txt' into table arrayTable;
hive> select * from arrayTable;
OK
1 abc 40000 ["a","b","c"] hyd
2 def 3000 ["d","f"] bang
Time taken: 1.029 seconds, Fetched: 2 row(s)
hive> select sub[2] from arrayTable;
OK
c
NULL
Time taken: 0.402 seconds, Fetched: 2 row(s)
hive> select sub[1] from arrayTable;
OK
b
f
Time taken: 0.244 seconds, Fetched: 2 row(s)
hive>
Map
create table maptable(id int,name string,sal bigint,sub array
collection items terminated by '$' map keys terminated by '#';
hive> load data local inpath'/home/cloudera/training/hivedata/mapfile.txt' into table maptable;
hive> select * From maptable;
hive> select pfund["pf"],["epf"] from maptable;
Struct
hive> create table structtable(id int,name string,sal bigint,sub array
address struct
row format delimited fields terminated by ',' collection items terminated by '$' map keys terminated by '#';
hive> load data local inpath '/home/cloudera/training/hivedata/structfile.txt' into table structtable;
hive> select * From structtable;
OK
1 abc 40000 ["a","b","c"] {"pf":500,"epf":200} {"city":"hyd","state":"ap","pin":500001}
2 def 3000 ["d","f"] {"pf":500} {"city":"bang","state":"kar","pin":600038}
Time taken: 0.136 seconds, Fetched: 2 row(s)
hive> select address.city from structtable;
OK
hyd
bang
Time taken: 0.14 seconds, Fetched: 2 row(s)
hive> select address.pin from structtable;
OK
500001
600038
Time taken: 0.13 seconds, Fetched: 2 row(s)
hive>
Partitioning
1) create Hive table
hive> create table all_students(sno int,sname string, passout_year string) row format delimited fields terminated by ',';
2) Load the data
hive> load data local inpath '/home/cloudera/training/hivedata/Students.csv' into table all_students;
3) create partision table
hive> create table students_part(sno int,sname string) PARTITIONED BY(passout_year string);
4) For partition we have to set this property
set hive.exec.dynamic.partition.mode=nonstrict;
5) Load data into partision table
hive> INSERT OVERWRITE TABLE students_part PARTITION(passout_year) SELECT sno,sname, passout_year from all_students;
Bucketting
1) create Hive table
hive> create table all_students(sno int,sname string, passout_year string) row format delimited fields terminated by ',';
2) Load the data
hive> load data local inpath '/home/cloudera/training/hivedata/Students.csv' into table all_students;
3) creating bucket table:
create table students_buck(sno int,sname string, passout_year string) clustered by (passout_year) sorted by (sno) into 3 buckets;
4) Enable bucketing table:
set hive.enforce.bucketing=true;
5) Insert data into bucket table:
insert overwrite table students_buck select * from all_students;
** After bucketing splits will be stored in /user/hive/warehouse/students_buck
6) Test Bucket table:
select * from students_buck tablesample(bucket 2 out of 3 on sname);
7) Test normal table:
select * from all_students limit 10;
-------------------------------------
Alter table - rename table name, rename column
Views & Indexes
Select ..where
Select ..order by
Select..Group By
-----------------------------
Renaming table
hive> alter table all_students rename to students;
Renaming the column
hive> alter table students change passout_year pyear String;
View ---> View is a logical object. can be create for 1 or more columns from a table.
Creating view
hive> create view students_view as select sno,pyear from students where pyear='2011';
hive> select * from students_view;
OK
5 2011
6 2011
7 2011
18 2011
19 2011
20 2011
Dropping view
hive> drop view students_view;
Indexes
Index--> improving the performance when fetching the data from tables.
Creating index
hive> create index sno_idx on table students(sno) as 'compact' with deferred rebuild;
hive> show index on students;
Dropping Index
hive> drop index sno_idx on students;
HQL--Select with where clause
where --> used for filtering the rows based on condition.
Display students who are passed in the year 2012.
hive> select * from students where pyear=2012;
HQL-Select with order by clause
Display students in asceding order based on pyear;
hive> select * from students order by pyear;
HQL-Select with group by clause
Count number of students passed in each year.
hive> select pyear,count(*) from students group by pyear;
Group by with having clause
having--> apply the condition on top of group by clause
hive> select pyear,count(*) from students group by pyear having count(*)>=5;
Hive Built-in functions
dummy --> one row and one colomn.
hive> select * From dummy;
OK
xxx
Time taken: 2.575 seconds, Fetched: 1 row(s)
hive> describe dummy;
hive> select concat('abc','xyz') from dummy;
hive> select concat(sname,pyear) from students;
hive> select substr('welcome',2,4) from dummy;
elco
hive> select substr('welcome',3,2) from dummy;
lc
hive> select substr(sname,1,4) from students;
hive> select upper('welcome') from dummy;
WELCOME
hive> select ucase('welcome') from dummy;
WELCOME
hive> select lower('WELCOME') from dummy;
welcome
hive> select lcase('WELCOME') from dummy;
welcome
hive> select upper(sname) from students;
hive> select lower(sname) from students;
hive> select ' welcome ' from dummy;
welcome
hive> select trim(' welcome ') from dummy;
welcome
hive> select rtrim(' welcome ') from dummy;
welcome
hive> select ltrim(' welcome ') from dummy;
welcome
hive> select CURRENT_DATE from dummy;
2017-12-28
hive> select year(CURRENT_DATE) from dummy;
hive> select month(CURRENT_DATE) from dummy;
hive> select day(CURRENT_DATE) from dummy;
Aggregation Functions
Aggregation functions applicable for group of columns.
Tables: owners , cars
hive> create table owners(id int,name string,age int,car_id int) row format delimited fields terminated by ','
lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/owners.txt' into table owners;
hive> create table cars(id int,make string,model string,year string)row format delimited fields terminated by ','
lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/cars.txt' into table cars;
count(*) --> returns total number of rows including nulls.
count(exp) --> returns total number of rows for which the supplied column.
sum(col)--> returns sum of elements in the group.
sum(distinct col) --> returns sum of unique elements in the group(eleminate duplictes)
avg(col)---> returns average value of elements from a group.
min(col)--> returns minimum value in the group
max(col)--> retruns maximum value from the group.
Query: table : cars
I want to see the car makers that have more than 1 model in cars table.
hive> select make,count(distinct model) from cars group by make having count(distinct model)>1;
Hive User Defined Functions
select count(*) from customers;
Steps to create and test UDF's
1) Implement the code for UDF in Java
2) Package java class into jar file copy in some location
3) Add jar file in to Hive CLI
4) Create temporary function in hive
5) Use hive UDF BY using Query.
Prerequiste: Table should have some data.
Problem statement-1
Find the maximum marks obtained out of four subject by an student.
Package java class into jar file copy in some location.
-----------
SELECT CLASS IN ECLIPSE-->RIGHT-->EXPORT-->JAVA-->JAR--> BROWSE THE LOCATION-->PROFILE FILENAME WITH .JAR Extension.
Add jar file in to Hive CLI
---------------------
hive> add jar /home/cloudera/training/HiveUDFS/getMaxMarks.jar;
Create temporary function in hive
-----------------
hive> create temporary function getmaxmarks as 'udfs.GetMaxMarks';
Use hive UDF BY using Query
-------------
hive> select getmaxmarks(10,20,30,40) from dummy; // sanity test
2 types of UDF
1) Regular UDF( UDF) ---> applied on more number of rows in a table
2) User Defined aggregate function (UDAF) --> group of result sets.
Problem statement -2: Find the mean of marks obtained in maths by all the students.
Package java class into jar file copy in some location
----------
Right click onth package-->export-->java-->provide jar file name.
Add jar file in to Hive CLI
-----
hive> add jar /home/cloudera/training/HiveUDFS/getMeanMarks.jar;
Create temporary function in hive
----------------
hive> create temporary function getmeanmarks as 'udaf.GetMeanMarks';
Use functions with queries
-----------------------
hive> select getmeanmarks(social)from t_student_record;
Hive clients
------------------
1) hive CLI
2) beeline
3) Hue ( Web UI) ; http://192.168.13.135:8888
4) JDBC Client ( Java program)
Hive Joins:
Join is used for getting the data from multiple tables.
1) Inner join
2) Left outer join
3) right outer join
4) full outer join
T1: 1 3 5 7 (left table)
T2 1 2 5 8 9 (right table)
inner join: 1,5 ( COMMON RECORDS IN BOTH THE TABLES)
left outer join: 1,5,3,7,null,null ( common records+extra records from left table+ if there is no corresponding values it substitute as nulls)
right outer join: 1,5 2,8,9,null,null
full outer join: 1,5,3,7,2,8,9,null,null,null
Tables
-----------------
orders
customers
hive> create table customers(id int,name string,age int,address string,salary int) row format delimited fields terminated by ',' lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/customers.txt' into table customers;
hive> create table orders(oid int,date date,customer_id int,amount int)row format delimited fields terminated by ',' lines terminated by '\n';
hive> load data local inpath '/home/cloudera/training/hivedata/orders.txt' into table orders;
Join Queries
INNER JOIN:
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
LEFT OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
RIGHT OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
FULL OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
Processing XML
hive> create table xmltable(line string);
hive> load data local inpath '/home/cloudera/training/hivedata/sampxml1.xml' into table xmltable;
hive> select * from xmltable;
OK
xmltable.line
Time taken: 0.064 seconds, Fetched: 3 row(s)
hive> select xpath_string(line,'rec/name') from xmltable;
OK
_c0
Ravi
Rani
Sampath
Time taken: 0.108 seconds, Fetched: 3 row(s)
hive> select xpath_string(line,'rec/city') from xmltable;
OK
_c0
hyd
Del
Time taken: 0.1 seconds, Fetched: 3 row(s)
hive>
Hive WebUI (Hue application)
Generating data: http://generatedata.com/
1) Querying tables
2) Create new table and import the data from external file.
step 1) DataBrowsers-->MetaStore tables-->
step 2)you need to click on ' + ' at right corner