Hadoop Interview Questions and answers


Question.1  What is Hadoop?

Answer:  Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce. For some details, see HadoopMapReduce.
Question.2   What platforms and Java versions does Hadoop run on?
Answer:   Java 1.6.x or higher, preferably from Sun -see HadoopJavaVersions
Linux and Windows are the supported operating systems, but BSD, Mac OS/X, and OpenSolaris are known to work. (Windows requires the installation of Cygwin.

Question.3   How well does Hadoop scale?
Answer:   Hadoop has been demonstrated on clusters of up to 4000 nodes. Sort performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 1.8 hours) and improving using these non-default configuration values:
dfs.block.size = 134217728
dfs.namenode.handler.count = 40
mapred.reduce.parallel.copies = 20
mapred.child.java.opts = -Xmx512m
fs.inmemory.size.mb = 200
io.sort.factor = 100
io.sort.mb = 200
io.file.buffer.size = 131072
Sort performances on 1400 nodes and 2000 nodes are pretty good too – sorting 14TB of data on a 1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5 hours. The updates to the above configuration being:
mapred.job.tracker.handler.count = 60
mapred.reduce.parallel.copies = 50
tasktracker.http.threads = 50
mapred.child.java.opts = -Xmx1024m
Question.4   What kind of hardware scales best for Hadoop?
Answer:   The short answer is dual processor/dual core machines with 4-8GB of RAM using ECC memory, depending upon workflow needs. Machines should be moderately high-end commodity machines to be most cost-effective and typically cost 1/2 – 2/3 the cost of normal production application servers but are not desktop-class machines. This cost tends to be $2-5K. For a more detailed discussion, see MachineScaling page.
Question.5   I have a new node I want to add to a running Hadoop cluster; how do I start services on just one node?
Answer:   This also applies to the case where a machine has crashed and rebooted, etc, and you need to get it to rejoin the cluster. You do not need to shutdown and/or restart the entire cluster in this case.
First, add the new node’s DNS name to the conf/slaves file on the master node.
Then log in to the new slave node and execute:
$ cd path/to/hadoop
$ bin/hadoop-daemon.sh start datanode
$ bin/hadoop-daemon.sh start tasktracker
If you are using the dfs.include/mapred.include functionality, you will need to additionally add the node to the dfs.include/mapred.include file, then issue hadoopdfsadmin -refreshNodes and hadoopmradmin -refreshNodes so that the NameNode and JobTracker know of the additional node that has been added.
Question.6Is there an easy way to see the status and health of a cluster?
Answer:   There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system.The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks. The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.
You can also see some basic HDFS cluster health data by running:$ bin/hadoopdfsadmin–report
Question.7   How much network bandwidth might I need between racks in a medium size (40-80 node) Hadoop cluster?
Answer:   The true answer depends on the types of jobs you’re running. As a back of the envelope calculation one might figure something like this:60 nodes total on 2 racks = 30 nodes per rack Each node might process about 100MB/sec of data In the case of a sort job where the intermediate data is the same size as the input data, that means each node needs to shuffle 100MB/sec of data In aggregate, each rack is then producing about 3GB/sec of data However, given even reducer spread across the racks, each rack will need to send 1.5GB/sec to reducers running on the other rack. Since the connection is full duplex, that means you need 1.5GB/sec of bisection bandwidth for this theoretical job. So that’s 12Gbps.
However, the above calculations are probably somewhat of an upper bound. A large number of jobs have significant data reduction during the map phase, either by some kind of filtering/selection going on in the Mapper itself, or by good usage of Combiners. Additionally, intermediate data compression can cut the intermediate data transfer by a significant factor. Lastly, although your disks can probably provide 100MB sustained throughput, it’s rare to see a MR job which can sustain disk speed IO through the entire pipeline. So, I’d say my estimate is at least a factor of 2 too high.
So, the simple answer is that 4-6Gbps is most likely just fine for most practical jobs. If you want to be extra safe, many inexpensive switches can operate in a “stacked” configuration where the bandwidth between them is essentially backplane speed. That should scale you to 96 nodes with plenty of headroom. Many inexpensive gigabit switches also have one or two 10GigE ports which can be used effectively to connect to each other or to a 10GE core.
Question.8   How can I help to make Hadoop better?
Answer:   If you have trouble figuring how to use Hadoop, then, once you’ve figured something out (perhaps with the help of the mailing lists), pass that knowledge on to others by adding something to this wiki.
If you find something that you wish were done better, and know how to fix it, read HowToContribute, and contribute a patch.
Question.9    I am seeing connection refused in the logs. How do I troubleshoot this?
Answer:   See ConnectionRefused
Question.10   Why is the ‘hadoop.tmp.dir’ config default user.name dependent?
Answer:    We need a directory that a user can write and also not to interfere with other users. If we didn’t include the username, then different users would share the same tmp directory. This can cause authorization problems, if folks’ default umask doesn’t permit write by others. It can also result in folks stomping on each other, when they’re, e.g., playing with HDFS and re-format their filesystem.
Question.11   Does Hadoop require SSH?
Answer:   Hadoop provided scripts (e.g., start-mapred.sh and start-dfs.sh) use ssh in order to start and stop the various daemons and some other utilities. The Hadoop framework in itself does not require ssh. Daemons (e.g. TaskTracker and DataNode) can also be started manually on each node without the script’s help.
Question.12   What mailing lists are available for more help?
Answer:  general is for people interested in the administrivia of Hadoop (e.g., new release discussion).
user@hadoop.apache.org is for people using the various components of the framework.
-dev mailing lists are for people who are changing the source code of the framework. For example, if you are implementing a new file system and want to know about the FileSystem API, hdfs-dev would be the appropriate mailing list.
Question.13  What does “NFS: Cannot create lock on (some dir)” mean?
Answer:     This actually is not a problem with Hadoop, but represents a problem with the setup of the environment it is operating.
Usually, this error means that the NFS server to which the process is writing does not support file system locks. NFS prior to v4 requires a locking service daemon to run (typically rpc.lockd) in order to provide this functionality. NFSv4 has file system locks built into the protocol.
In some (rarer) instances, it might represent a problem with certain Linux kernels that did not implement the flock() system call properly.
It is highly recommended that the only NFS connection in a Hadoop setup be the place where the NameNode writes a secondary or tertiary copy of the fsimage and edits log. All other users of NFS are not recommended for optimal performance.
Question.14    Do I have to write my job in Java?
Answer:   No. There are several ways to incorporate non-Java code.
HadoopStreaming permits any shell command to be used as a map or reduce function.
libhdfs, a JNI-based C API for talking to hdfs (only).
Hadoop Pipes, a SWIG-compatible C++ API (non-JNI) to write map-reduce jobs.
Question.15   How do I submit extra content (jars, static files, etc) for my job to use during runtime?
Answer:   The distributed cache feature is used to distribute large read-only files that are needed by map/reduce jobs to the cluster. The framework will copy the necessary files from a URL (either hdfs: or http:) on to the slave node before any tasks for the job are executed on that node. The files are only copied once per job and so should not be modified by the application.
For streaming, see the HadoopStreaming wiki for more information.
Copying content into lib is not recommended and highly discouraged. Changes in that directory will require Hadoop services to be restarted.
Question.16   How do I get my MapReduce Java Program to read the Cluster’s set configuration and not just defaults?
Answer:   The configuration property files ({core|mapred|hdfs}-site.xml) that are available in the various conf/ directories of your Hadoop installation needs to be on the CLASSPATH of your Java application for it to get found and applied. Another way of ensuring that no set configuration gets overridden by any Job is to set those properties as final; for example:
mapreduce.task.io.sort.mb
400
true
Setting configuration properties as final is a common thing Administrators do, as is noted in the Configuration API docs.
A better alternative would be to have a service serve up the Cluster’s configuration to you upon request, in code. https://issues.apache.org/jira/browse/HADOOP-5670 may be of some interest in this regard, perhaps.
Question.17   Can I write create/write-to hdfs files directly from map/reduce tasks?
Answer:   Yes. (Clearly, you want this since you need to create/write-to files other than the output-file written out by OutputCollector.)
Caveats:
${mapred.output.dir} is the eventual output directory for the job (JobConf.setOutputPath / JobConf.getOutputPath).
${taskid} is the actual id of the individual task-attempt (e.g. task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g. task_200709221812_0001_m_000000).
With speculative-execution on, one could face issues with 2 instances of the same TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn’t create/write-to files directly via reduce tasks.)
To get around this the framework helps the application-writer out by maintaining a special ${mapred.output.dir}/_${taskid} sub-dir for each reduce task-attempt on hdfs where the output of the reduce task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application.
The application-writer can take advantage of this by creating any side-files required in ${mapred.output.dir} during execution of his reduce-task, and the framework will move them out similarly – thus you don’t have to pick unique paths per task-attempt.
Fine-print: the value of ${mapred.output.dir} during execution of a particular reduce task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by JobConf.setOutputPath. So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to take advantage of this feature.
For map task attempts, the automatic substitution of ${mapred.output.dir}/_${taskid} for ${mapred.output.dir} does not take place. You can still access the map task attempt directory, though, by using FileOutputFormat.getWorkOutputPath(TaskInputOutputContext). Files created there will be dealt with as described above.
The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to hdfs.
Question.18   How do I get each of a job’s maps to work on one complete input-file and not allow the framework to split-up the files?
Answer:   Essentially a job’s input is represented by the InputFormat(interface)/FileInputFormat(base class).
For this purpose one would need a ‘non-splittable’ FileInputFormat i.e. an input-format which essentially tells the map-reduce framework that it cannot be split-up and processed. To do this you need your particular input-format to return false for the isSplittable call.
E.g.org.apache.hadoop.mapred.SortValidator.RecordStatsChecker.NonSplitableSequenceFileInputFormat in src/test/org/apache/hadoop/mapred/SortValidator.java
In addition to implementing the InputFormat interface and having isSplitable(…) returning false, it is also necessary to implement the RecordReader interface for returning the whole content of the input file. (default is LineRecordReader, which splits the file into separate lines).The other, quick-fix option, is to set mapred.min.split.size to large enough value.
Question.19    Why I do see broken images in jobdetails.jsp page?
Answer:    In hadoop-0.15, Map / Reduce task completion graphics are added. The graphs are produced as SVG(Scalable Vector Graphics) images, which are basically xml files, embedded in html content. The graphics are tested successfully in Firefox 2 on Ubuntu and MAC OS. However for other browsers, one should install an additional plugin to the browser to see the SVG images.

Followers