Apache Sqoop Notes


Overview on Sqoop

Sqoop is open source s/w from Apache used for transfer data between RDBMS(oRACLE, sLQSERVER, mYSQL...) and HDFS.

MySQL Database

Connecting to MySQL Database:
root user: root/cloudera
other user: cloudera/cloudera


[cloudera@quickstart ~]$ mysql -u root -p

Create new user
---------------
mysql> create user cloudera1 identified by 'cloudera1';
mysql> grant all privileges on *.* to 'cloudera1';

listing databases in MySQL
--------------
mysql> show databases;

switching to db
--------------------
mysql> USE retail_db;

listing tables in retail_db database
---------
mysql> show tables;


Connecting to Sqoop
--------------------

version of sqoop
-------------------
[cloudera@quickstart hivedata]$ sqoop version

[cloudera@quickstart hivedata]$ sqoop help

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

import command
----------------

import --> importing data from RDBMS to HDFS

1) import data from a table to HDFS(all the rows and columns)
2) import data from a table to HDFS (all the rows but specific columns)
3) import data from a table to HDFS (all the columns, filter the rows by usign where)
4) import data from a table to HDFS  (Specific columns,filter the rows by usign where)


Table:  retail_db.customers

import data from a table to HDFS(all the rows and columns)
---------------------------------------------
[cloudera@quickstart hivedata]$ sqoop import --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera --table customers --m 1;

imported data will be stored in /user/cloudera   (default location)

view the data
-------
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/customers/part-m-00000


import data from a table to HDFS (all the rows but specific columns)
---------------------------------------------------------------------
[cloudera@quickstart hivedata]$ sqoop import --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera --table customers --columns "customer_id","customer_fname","customer_email" --m 1;

imported data will be stored in /user/cloudera   (default location)

view the data
-------
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/customers/part-m-00000


import data from a table to HDFS (all the columns, filter the rows by usign where)
-------------------------------------------------------------------------------------
[cloudera@quickstart hivedata]$ sqoop import --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera --table customers --where "customer_state='IN'" --m 1 --target-dir /user/hive/warehouse/customers

view the data
-------
[cloudera@quickstart ~]$ hdfs dfs -cat /user/hive/warehouse/customers/part-m-00000

4) import data from a table to HDFS  (Specific columns,filter the rows by usign where)
---------------------
[cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera --table customers --columns "customer_id","customer_fname" --where "customer_state='IN'" --m 1 --target-dir /user/hive/warehouse/customers

view the data
-------
[cloudera@quickstart ~]$ hdfs dfs -cat /user/hive/warehouse/customers/part-m-00000

-------------------------------------------------------
Sqoop commands
------------------------------------------------------
import-all-tables
list-databases
list-tables
create-hive-table
hive-import
eval
export
---------------------

import-all-tables : import data from all mysql tables to HDFS
--------------------
[cloudera@quickstart ~]$ hadoop dfsadmin -safemode leave   // you need to run this if you get error.

[cloudera@quickstart ~]$ sqoop import-all-tables --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera

list-databases: list available databasses from Mysql
-----------------
[cloudera@quickstart ~]$ sqoop list-databases --connect jdbc:mysql://192.168.13.142/ --username cloudera --password cloudera

list-tables: list avaialable tables in the database
---------------------------
[cloudera@quickstart ~]$ sqoop list-tables --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera


create-hive-table : import a table definition in to hive
-----------------------------------------------------------
step 1) import mysql table data in to hdfs.
[cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera --table customers --m 1;

step 2) import mysql table definition in to hive.(create-hive-table)
-----------------------------------------------------
[cloudera@quickstart ~]$ sqoop create-hive-table --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera --table customers --fields-terminated-by ',' --lines-terminated-by '\n';

step 3) load the data from HDFS to hive table.
--------------------------------------
hive> load data inpath '/user/cloudera/customers' into table customers;

'hive-import' option for import command  ( used for reduce the abouve steps)
------------------------------------------------------------------------------
[cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera --table customers --m 1 --hive-import;

eval: evaluate sql statement and display the result
------------------------------------------------
[cloudera@quickstart ~]$ sqoop eval --connect jdbc:mysql://192.168.13.142/retail_db --username cloudera --password cloudera --query "select * from customers limit 10";


export: export the data from HDFS to MySQL
---------------------------------------------
insert mode
update update

mysql> create database hr;   // creating new daabase in MySQL
mysql> use hr;
mysql> create table employees(name varchar(30),email varchar(40));  // creating table

insert mode
-------------
[cloudera@quickstart hivedata]$ sqoop export --connect jdbc:mysql://192.168.13.142/hr --username cloudera --password cloudera --table employees --export-dir /user/hive/warehouse/Employees.csv;

update mode
--------------
[cloudera@quickstart hivedata]$ sqoop export --connect jdbc:mysql://192.168.13.142/hr --username cloudera --password cloudera --table employees --export-dir /user/hive/warehouse/Employees.csv --update-key name;



Sqoop Commands (Cont..)
----------------------------
Sqoop Jobs
codgen command
-----------------------

Sqoop job
--------
is a file cotains quoop commands. we can run this job at any time.

create   --create
listing jobs   --list
detailed information about the job    --show
execute job   --exec
delete job   --delete

create a new job to import data from MySQL table to HDFS.
----------------------
sqoop job --create myjob1 -- import --connect jdbc://192.168.13.136/retail_db --username cloudera --password --cloudera --table customers -- m 1;

listing available jobs
---------------------------
[cloudera@quickstart ~]$ sqoop job --list

display detailed information about the job
---------------------------------------
[cloudera@quickstart ~]$ sqoop job --show myjob


delete existing job
-----------------------------
[cloudera@quickstart ~]$ sqoop job --delete myjob


[cloudera@quickstart ~]$ hdfs dfsadmin -safemode leave             ---> changing HDFS save mode off

Execute sqoop job
-----------------------
[cloudera@quickstart ~]$ sqoop job --exec myjob

codgen command : generate code files for sqoop commands
--------------------------------------------
[cloudera@quickstart ~]$ sqoop codegen --connect jdbc:mysql://192.168.13.136/retail_db --username cloudera --password cloudera --table customers

Verify files
----------------
[cloudera@quickstart ~]$ ls /tmp/sqoop-cloudera/compile/752b65100f5001f20bd57eb85a460b51/

Followers