Apache Sqoop Overview & Import Data From MySQL to HDFS


Overview on Sqoop

Sqoop is open source s/w from Apache used for transfer data between RDBMS(Oracle, SQL Server, MySQL) and HDFS.

MySQL Database
Connecting to MySQL Database in cloudera VM:

root user: root/cloudera
other user: cloudera/cloudera

[cloudera@quickstart ~]$ mysql -u root -p




Create new user in MySQL
mysql> create user cloudera1 identified by 'cloudera1';
mysql> grant all privileges on *.* to 'cloudera1';

Listing databases in MySQL
mysql> show databases;

Switching to db
mysql> USE retail_db;

Listing tables in retail_db database
mysql> show tables;

Connecting to Sqoop

Version of Sqoop
[cloudera@quickstart hivedata]$ sqoop version

[cloudera@quickstart hivedata]$ sqoop help

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information




import command

import --> importing data from RDBMS to HDFS

1) import data from a table to HDFS(all the rows and columns)
2) import data from a table to HDFS (all the rows but specific columns)
3) import data from a table to HDFS (all the columns, filter the rows by usign where)
4) import data from a table to HDFS  (Specific columns,filter the rows by usign where)


Table:  trails_db.customers

1) import data from a table to HDFS(all the rows and columns)

[cloudera@quickstart hivedata]$ sqoop import --connect jdbc:mysql://192.168.13.135/retail_db --username cloudera --password cloudera --table customers --m 1;

imported data will be stored in /user/cloudera   (default location)

view the data
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/customers/part-m-00000


2) import data from a table to HDFS (all the rows but specific columns)

[cloudera@quickstart hivedata]$ sqoop import --connect jdbc:mysql://192.168.13.135/retail_db --username cloudera --password cloudera --table customers --columns "customer_id","customer_fname","customer_email" --m 1;

imported data will be stored in /user/cloudera   (default location)

view the data
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/customers/part-m-00000


3) import data from a table to HDFS (all the columns, filter the rows by usign where)

[cloudera@quickstart hivedata]$ sqoop import --connect jdbc:mysql://192.168.13.135/retail_db --username cloudera --password cloudera --table customers --where "customer_state='IN'" --m 1 --target-dir /user/hive/warehouse/customers

view the data
[cloudera@quickstart ~]$ hdfs dfs -cat /user/hive/warehouse/customers/part-m-00000

4) import data from a table to HDFS  (Specific columns,filter the rows by usign where)

[cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql://192.168.13.135/retail_db --username cloudera --password cloudera --table customers --columns "customer_id","customer_fname" --where "customer_state='IN'" --m 1 --target-dir /user/hive/warehouse/customers

view the data
[cloudera@quickstart ~]$ hdfs dfs -cat /user/hive/warehouse/customers/part-m-00000

Followers