Hadoop HDFS Commands

HDFS Commands
  • jps
HDFS Command to print Hadoop processes.
[root@quickstart Desktop]# jps

  • fsck
HDFS Command to check the health of the Hadoop file system.
[cloudera@quickstart training]$ hdfs fsck /

·        ls
HDFS Command to display the list of Files and Directories in HDFS.
[cloudera@quickstart training]$ hdfs dfs -ls /

  • mkdir
HDFS Command to create the directory in HDFS.
[cloudera@quickstart training]$ hdfs dfs -mkdir /bigdatatesting
[cloudera@quickstart training]$ hdfs dfs -ls /
drwxr-xr-x   - cloudera supergroup          0 2018-05-23 00:46 /bigdatatesting
Note: Here we are trying to create a directory named “Bigdatatesting” in HDFS.

  • touchz
HDFS Command to create a file in HDFS with file size 0 bytes.
[cloudera@quickstart training]$ hdfs dfs -touchz /bigdatatesting/test.dat
[cloudera@quickstart training]$ hdfs dfs -ls /bigdatatesting/
Found 1 items
-rw-r--r--   1 cloudera supergroup          0 2018-05-23 00:48 /bigdatatesting/test.tx

Note: Here we are trying to create a file named “test.dat” in the directory “bigdatatesting” of hdfs with file size 0 bytes.

  • du
HDFS Command to check the file size. 
[cloudera@quickstart training]$ hdfs dfs -du -s /bigdatatesting/test.dat
0  0  /bigdatatesting/test.dat

·        appendToFile
Appends the contents to the given destination file on HDFS. The destination file will be created if it does not exist.
[cloudera@quickstart training]$ hdfs dfs -appendToFile - /bigdatatesting/test.dat

  • cat
HDFS Command that reads a file on HDFS and prints the content of that file to the standard output.
 [cloudera@quickstart training]$ hdfs dfs -cat /bigdatatesting/test.dat

  • copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.
Step1: Create a file in Local File System.
[cloudera@quickstart training]$ cat>> test1.dat
[cloudera@quickstart training]$ ls test1.dat
Step2: Copy file from Local File system to HDFS
[cloudera@quickstart training]$ hdfs dfs -copyFromLocal test1.dat /bigdatatesting/
Note: Here the test is the file present in the local directory /home/cloudera/training and after the command gets executed the test file will be copied in /bigdatatesting directory of HDFS.

  • copyToLocal
HDFS Command to copy the file from HDFS to Local File System.
Step1: Check test.dat file present in local file system.
[cloudera@quickstart training]$ ls test.dat
ls: cannot access test.dat: No such file or directory
Step2: Copy test.dat file from HDFS to local file system. 
[cloudera@quickstart training]$ hdfs dfs -copyToLocal /bigdatatesting/test.dat /home/cloudera/training
Step3: Check again test.dat file present in local file system.
[cloudera@quickstart training]$ ls test.dat
Note: Here test.dat is a file present in the bigdatatesting directory of HDFS and after the command gets executed the test.dat file will be copied to local directory /home/Cloudera/training

  • put
HDFS Command to copy single source or multiple sources from local file system to the destination file system.
Step1: Create a file in Local File System.
[cloudera@quickstart training]$ cat>> test2.dat
[cloudera@quickstart training]$ ls test2.dat
Step2: Copy file from Local File system to HDFS
[cloudera@quickstart training]$ hdfs dfs -put test2.dat /bigdatatesting/
Note: Here the test2.dat is the file present in the local directory /home/cloudera/training and after the command gets executed the test2.dat file will be copied in /bigdatatesting directory of HDFS.
Note:  The command put is similar to copyFromLocal command.
  • ·        get

HDFS Command to copy files from hdfs to the local file system.
Step1: Create a new file test3.dat on HDFS.
[cloudera@quickstart training]$ hdfs dfs -touchz /bigdatatesting/test3.dat

Step2: Copy test3.dat file from HDFS to local file system. 
[cloudera@quickstart training]$ hdfs dfs -get /bigdatatesting/test3.dat /home/cloudera/training

Step3: Check again test3.dat file present in local file system.
[cloudera@quickstart training]$ ls test3.dat

Note1: Here test3.dat is a file present in the bigdatatesting directory of HDFS and after the command gets executed the test.dat file will be copied to local directory /home/Cloudera/training

Note2: The command get is similar to copyToLocal  command

  • cp
HDFS Command to copy files from source to destination. This command allows multiple sources as well, in which case the destination must be a directory.
[cloudera@quickstart training]$ hdfs dfs -mkdir /hadooptesting/
[cloudera@quickstart training]$ hdfs dfs -cp /bigdatatesting/test.dat /hadooptesting

  • mv
HDFS Command to move files from source to destination. This command allows multiple sources as well, in which case the destination needs to be a directory.
[cloudera@quickstart training]$ hdfs dfs -mv /bigdatatesting/test1.dat /hadooptesting/

  • rm
HDFS Command to remove the file from HDFS.
[cloudera@quickstart training]$ hdfs dfs -rm /bigdatatesting/test2.dat
Deleted /bigdatatesting/test2.dat

  • rm -r
HDFS Command to remove the entire directory and all of its content from HDFS.
[cloudera@quickstart training]$ hdfs dfs -rm -r /hadooptesting
Deleted /hadooptesting

  • rmdir
HDFS Command to remove the directory if it is empty.
[cloudera@quickstart training]$ hdfs dfs -rmdir /bigdatatesting

  • usage
HDFS Command that returns the help for an individual command.
[cloudera@quickstart training]$ hdfs dfs -usage mkdir
Note: By using usage command you can get information about any command.

  • help
HDFS Command that displays help for given command or all commands if none is specified.
[cloudera@quickstart training]$ hdfs dfs -help


SDET (Software Development Engineer in Test)

What is an SDET?

SDET is an IT professional who can work equally effectively in development and testing. Full form of SDET is Software Development Engineer in Test and he/she takes part in the complete software development process.

An SDET's professional's knowledge is entirely focused on testability, robustness, and performance. They are also able to play a contributory or reviewer role in the creation of designs for production software.

Difference between SDET and Tester?

Manual Tester
Knows the entire system start to end
Limited knowledge about the system
SDET is involved in every step of the software development process like
Designing, development, and testing.
QA is only involved in the testing life cycle of the software development process.
Highly skilled professional with development as well as testing knowledge.
Software tester is only involved in preparing and executing the test cases
SDET can participate in test automation tool development and may make it for generic use.
Not expected to develop test automation tools or frameworks.
SDETs need to perform duties like performance testing, automated generation of test data, etc.
Only testing related task will be performed by the tester.
Know requirements and guidelines for the products
No such knowledge expected from QA professionals.

When do you need SDET?

Today organizations are looking for a professional who can take part in software development. At the same time, he should also handle testing of the developed software. That's why hiring SDET helps them as they can work for developing high-performance code or designing the testing framework.

What are the roles and responsibilities of an SDET?
  • SDET should able to perform Test Automation and setting up frameworks on multiple application platforms like Web, Mobile, and Desktop.
  • Investigate customer problems referred by the technical support team.
  • Create & manage bug reports and communicate with the team.
  • Able to build different test scenarios and acceptance tests.
  • SDET needs to handle technical communications with Partners to understand client's systems or APIs.
  • SDET also work with deployments teams and resolving any level issues for the system.
  • SDET should also able to set up, maintain, and operate test automation frameworks.

Career Progression

Your career progression as a SDET in typical CMMI level 5 company will look like following but will vary from company to company
SDET (Fresher) => Sr. SDET (2-3 years' experience) => SDET Team Coordinator (5-8 years' experience> =>SDET Manager (8+ years' experience)


A SDET professional is a mix of developer as well as a tester who has exposure to project management. This all in one type of skill set make the SDET jobs more challenging and highly demanding in the current market.


10 Advantages and Disadvantages of Selenium

Selenium is at present the most powerful freeware of open source automation tool. It is developed by Jason Huggins and his team. This is release under the Apache 2.0 license and can be downloaded and used without any charge. Selenium is easy to get started with for simple functional testing of web application. It supports record and playback for testing web based application. Selenium supports multithreading feature i.e. multiple instance of script can be run on different browsers.


1. Selenium is pure open source, freeware and portable tool. 
2. Selenium supports variety of languages that include Java, Perl, Python, C#, Ruby, Groovy, Java Script, and VB Script. etc. 
3. Selenium supports many operating systems like Windows, Macintosh, Linux, Unix etc. 
4. Selenium supports many browsers like Internet explorer, Chrome, Firefox, Opera, Safari etc. 
5. Selenium can be integrated with ANT or Maven kind of framework for source code compilation. 
6. Selenium can be integrated with TestNG testing framework for testing our applications and generating reports. 
7. Selenium can be integrated with Jenkins or Hudson for continuous integration. 
8. Selenium can be integrated with other open source tools for supporting other features. 
9. Selenium can be used for Android, IPhone, Blackberry etc. based application testing. 
10. Selenium supports very less CPU and RAM consumption for script execution. 

1. Selenium needs very much expertise resources. The resource should also be very well versed in framework architecture. 
2. Selenium only supports web based application and does not support windows based application. 
3. It is difficult to test Image based application. 
4. Selenium need outside support for report generation activity like dependence on TestNG or Jenkins. 
5. Selenium does not support built in add-ins support. 
6. Selenium user lacks online support for the problems they face. 
7. Selenium does not provide any built in IDE for script generation and it need other IDE like Eclipse for writing scripts. 
8. Selenium script creation time is bit high. 
9. Selenium does not support file upload facility. 
10. Selenium partially supports for Dialog boxes.


Apache Sqoop Jobs

Sqoop job is a file contains sqoop commands. we can run this job at any time.

create   --create
listing jobs   --list
detailed information about the job    --show
execute job   --exec
delete job   --delete

Create a new job to import data from MySQL table to HDFS.

sqoop job --create myjob -- import --connect jdbc:// --username cloudera --password --cloudera --table customers -- m1;

Listing available jobs

[cloudera@quickstart ~]$ sqoop job --list

Display detailed information about the job

[cloudera@quickstart ~]$ sqoop job --show myjob

Delete existing job

[cloudera@quickstart ~]$ sqoop job --delete myjob

[cloudera@quickstart ~]$ hdfs dfsadmin -safemode leave   ---> changing HDFS save mode off

Execute sqoop job

[cloudera@quickstart ~]$ sqoop job --exec myjob

codgen command : generates code files for sqoop commands

[cloudera@quickstart ~]$ sqoop codegen --connect jdbc:mysql:// --username cloudera --password cloudera --table customers

Verify files
[cloudera@quickstart ~]$ ls /tmp/sqoop-cloudera/compile/752b65100f5001f20bd57eb85a460b51/

Apache Sqoop Commands

Sqoop commands
  1. import-all-tables
  2. list-databases
  3. list-tables
  4. create-hive-table
  5. hive-import
  6. eval
  7. export

import-all-tables : import data from all mysql tables to HDFS

[cloudera@quickstart ~]$ hadoop dfsadmin -safemode leave   // you need to run this if you get error.

[cloudera@quickstart ~]$ sqoop import-all-tables --connect jdbc:mysql:// --username cloudera --password cloudera

list-databases: list available databases from Mysql

[cloudera@quickstart ~]$ sqoop list-databases --connect jdbc:mysql:// --username cloudera --password cloudera

list-tables: list avaialable tables in the database

[cloudera@quickstart ~]$ sqoop list-tables --connect jdbc:mysql:// --username cloudera --password cloudera

create-hive-table : import a table definition in to hive

step 1) import mysql table data in to hdfs.
[cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql:// --username cloudera --password cloudera --table customers --m 1;

step 2) import mysql table definition in to hive.(create-hive-table)

[cloudera@quickstart ~]$ sqoop create-hive-table --connect jdbc:mysql:// --username cloudera --password cloudera --table customers --fields-terminated-by ',' --lines-terminated-by '\n';

step 3) load the data from HDFS to hive table.

hive> load data inpath '/user/cloudera/customers' into table customers;

'hive-import' option for import command  ( used for reduce the above steps)

[cloudera@quickstart ~]$ sqoop import --connect jdbc:mysql:// --username cloudera --password cloudera --table customers --m 1 --hive-import;

eval: evaluate SQL statement and display the result

[cloudera@quickstart ~]$ sqoop eval --connect jdbc:mysql:// --username cloudera --password cloudera --query "select * from customers limit 10";

export: export the data from HDFS to MySQL

  • insert mode
  • update update

mysql> create database hr;   // creating new daTabase in MySQL
mysql> use hr;
mysql> create table employees(name varchar(30),email varchar(40));  // creating table

insert mode

[cloudera@quickstart hivedata]$ sqoop export --connect jdbc:mysql:// --username cloudera --password cloudera --table employees --export-dir /user/hive/warehouse/Employees.csv;

update mode

[cloudera@quickstart hivedata]$ sqoop export --connect jdbc:mysql:// --username cloudera --password cloudera --table employees --export-dir /user/hive/warehouse/Employees.csv --update-key name;


HBase Interview Questions

HBase Interview Questions
What are the different commands used in Hbase operations?
There are 5 atomic commands which carry out different operations by Hbase.
Get, Put, Delete, Scan and Increment.
How to connect to Hbase?
A connection to Hbase is established through Hbase Shell which is a Java API.
What is the role of Master server in Hbase?
The Master server assigns regions to region servers and handles load balancing in the cluster.
What is the role of Zookeeper in Hbase?
The zookeeper maintains configuration information, provides distributed synchronization, and also maintains the communication between clients and region servers.
When do we need to disable a table in Hbase?
In Hbase a table is disabled to allow it to be modified or change its settings. .When a table is disabled it cannot be accessed through the scan command.
Give a command to check if a table is disabled.
Hbase > is_disabled “table name”
What does the following table do?
hbase > disable_all 'p.*'
The command will disable all the table starting with the letter p

What are the different types of filters used in Hbase?
Filters are used to get specific data form a Hbase table rather than all the records.
They are of the following types.
  • Column Value Filter
  • Column Value comparators
  • KeyValue Metadata filters.
  • RowKey filters.
Name three disadvantages Hbase has as compared to RDBMS?
·        Hbase does not have in-built authentication/permission mechanism
·        The indexes can be created only on a key column, but in RDBMS it can be done in any column.
·        With one HMaster node there is a single point of failure.
What are catalog tables in Hbase?
The catalog tables in Hbase maintain the metadata information. They are named as −ROOT− and .META. The −ROOT− table stores information about location of .META> table and the .META> table holds information about all regions and their locations.
Is Hbase a scale out or scale up process?
Hbase runs on top of Hadoop which is a distributed system. Haddop can only scale up as and when required by adding more machines on the fly. So Hbase is a scale out process.
What are the step in writing something into Hbase by a client?
In Hbase the client does not write directly into the HFile. The client first writes to WAL(Write Access Log), which then is accessed by Memstore. The Memstore Flushes the data into permanent memory from time to time.
What is compaction in Hbase?
As more and more data is written to Hbase, many HFiles get created. Compaction is the process of merging these HFiles to one file and after the merged file is created successfully, discard the old file.
What are the different compaction types in Hbase?
There are two types of compaction. Major and Minor compaction. In minor compaction, the adjacent small HFiles are merged to create a single HFile without removing the deleted HFiles. Files to be merged are chosen randomly.
In Major compaction, all the HFiles of a column are emerged and a single HFiles is created. The delted HFiles are discarded and it is generally triggered manually.
What is the difference between the commands delete column and delete family?
The Delete column command deletes all versions of a column but the delete family deletes all columns of a particular family.
What is a cell in Hbase?
A cell in Hbase is the smallest unit of a Hbase table which holds a piece of data in the form of a tuple{row,column,version}
What is the role of the class HColumnDescriptor in Hbase?
This class is used to store information about a column family such as the number of versions, compression settings, etc. It is used as input when creating a table or adding a column.
What is the lower bound of versions in Hbase?
The lower bound of versions indicates the minimum number of versions to be stored in Hbase for a column. For example If the value is set to 3 then three latest version will be maintained and the older ones will be removed.
What is TTL (Time to live) in Hbase?
TTL is a data retention technique using which the version of a cell can be preserved till a specific time period. Once that timestamp is reached the specific version will be removed.
Does Hbase support table joins?
Hbase does not support table jons. But using a mapreduce job we can specify join queries to retrieve data from multiple Hbase tables.
What is a rowkey in Hbase?
Each row in Hbase is identified by a unique byte of array called row key.
What are the two ways in which you can access data from Hbase?
The data in Hbase can be accessed in two ways.
·        Using the rowkey and table scan for a range of row key values.
·        Using mapreduce in a batch manner.
What are the two types of table design approach in Hbase?
They are − (i) Short and Wide (ii) Tall and Thin
In which scenario should we consider creating a short and wide Hbase table?
The short and wide table design is considered when there is
·        There is a small number of columns
·        There is a large number of rows
In Which scenario should we consider a Tall-thin table design?
The tall and thin table design is considered when there is
·        There is a large number of columns
·        There is a small number of rows
What does the following command do?
major_compact 'tablename'
Run a major compaction on the table.
How does Hbase support Bulk data loading?
There are two main steps to do a data bulk load in Hbase.
·        Generate Hbase data file(StoreFile) using a custom mapreduce job) from the data source. The StoreFile is created in Hbase internal format which can be efficiently loaded.
·        The prepared file is imported using another tool like comletebulkload to import data into a running cluster. Each file gets loaded to one specific region.
How does Hbase provide high availability?
Hbase uses a feature called region replication. In this feature for each region of a table, there will be multiple replicas that are opened in different RegionServers. The Load Balancer ensures that the region replicas are not co-hosted in the same region servers.
what is HMaster?
The Hmaster is the Master server responsible for monitoring all RegionServer instances in the cluster and it is the interface for all metadata changes. In a distributed cluster, it runs on the Namenode.
What is HRegionServer in Hbase?
HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode.
What are the different Block Caches in Hbase?
HBase provides two different BlockCache implementations: the default on-heap LruBlockCache and the BucketCache, which is (usually) off-heap.
How does WAL help when a RegionServer crashes?
The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.
Why MultiWAL is needed?
With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.
In Hbase what is log splitting?
When a region is edited, the edits in the WAL file which belong to that region need to be replayed. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting.
How can you disable WAL? What is the benefit?
WAL can be disabled to improve performance bottleneck.
This is done by calling the Hbase client field Mutation.writeToWAL(false).
When do we do manual Region splitting?
The manual region splitting is done we have an unexpected hotspot in your table because of many clients querying the same table.
What is a Hbase Store?
A Habse Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
Which file in Hbase is designed after the SSTable file of BigTable?
The HFile in Habse which stores the Actual data(not metadata) is designed after the SSTable file of BigTable.
Why do we pre-create empty regions?
Tables in HBase are initially created with one region by default. Then for bulk imports, all clients will write to the same region until it is large enough to split and become distributed across the cluster. So empty regions are created to make this process faster.
What is the scope of a rowkey in Habse?
Rowkeys are scoped to ColumnFamilies. The same rowkey could exist in each ColumnFamily that exists in a table without collision.
What is the information stored in hbase:meta table?
The Hbase:meta tables stores details of region in the system in the following format.
info:regioninfo (serialized HRegionInfo instance for this region)
info:server (server:port of the RegionServer containing this region)
info:serverstartcode (start-time of the RegionServer process containing this region)
What is a Namespace in Hbase?
A Namespace is a logical grouping of tables . It is similar to a database object in a Relational database system.
How do we get the complete list of columns that exist in a column Family?
The complete list of columns in a column family can be obtained only querying all the rows for that column family.
When the records are fetched form a Hbase tables, in which order are the sorted?
The records fetched form Hbase are always sorted in the order of rowkey-> column Family-> column qualifier-> tiestamp.