Hadoop Interview Questions
Hadoop Interview Questions
Big data is nothing but an assortment of such a huge and complex data that it becomes very tedious
to capture, store, process, retrieve and analyze it with the help of on-hand database management
tools or traditional data processing techniques. In fact, the concept of BIG DATA may vary from
company to company depending upon its size, capacity, competence, human resource, techniques
and so on. For some companies it may be a cumbersome job to manage a few gigabytes and for
others it may be some terabytes creating a hassle in the entire organization.
HDFS
What is BIG DATA?
Big Data is nothing but an assortment of such a huge and complex data that it becomes very
tedious to capture, store, process, retrieve and analyze it with the help of on-hand database
management tools or traditional data processing techniques.
Can you give a detailed overview about the Big Data being generated by
Facebook?
As of December 31, 2012, there are 1.06 billion monthly active users on facebook and 680
million mobile users. On an average, 3.2 billion likes and comments are posted every day on
Facebook. 72% of web audience is on Facebook. And why not! There are so many activities going
on facebook from wall posts, sharing images, videos, writing comments and liking posts, etc. In fact,
Facebook started using Hadoop in mid-2009 and was one of the initial users of Hadoop.
What is Hadoop?
Hadoop is a framework that allows for distributed processing of large data sets across
clusters of commodity computers using a simple programming model
Technically speaking, Hadoop is an open source software framework that supports data-intensive
distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally
known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by
Google on MapReduce system and applies concepts of functional programming. Hadoop is written
in the Java programming language and is the highest-level Apache project being constructed and
used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J.
Cafarella. And the charming yellow elephant you see is basically named after Dougs sons toy
elephant!
Hadoop Ecosystem:
Once you are familiar with What is Hadoop, lets probe into its ecosystem. Hadoop Ecosystem is
nothing but various components that make up Hadoop so powerful, among which HDFS and
MapReduce are the core components!
1. HDFS:
The Hadoop Distributed File System (HDFS) is a very robust feature of Apache Hadoop.
HDFS is designed to amass gigantic amount of data unfailingly, and to transfer the data at an
amazing speed among nodes and facilitates the system to continue working smoothly even if any of
the nodes fail to function. HDFS is very competent in writing programs, handling their allocation,
processing the data and generating the final outcomes. In fact, HDFS manages around 40 petabytes
of data at Yahoo! The key components of HDFS are NameNode, DataNodes and Secondary
NameNode.
2. MapReduce:
It all started with Google applying the concept of functional programming to solve the
problem of how to manage large amounts of data on the internet. Google named it as the
MapReduce system and was penned down in a paper published by Google. With the ever
increasing amount of data generated on the web, MapReduce was created in 2004 and Yahoo
stepped in to develop Hadoop in order to implement the MapReduce technique in Hadoop. The
function of MapReduce is to help Google in searching and indexing the large quantity of web pages
in matter of a few seconds or even in a fraction of a second. The key components of MapReduce
are JobTracker, TaskTrackers and JobHistoryServer.
3. Apache Pig:
Apache Pig is another component of Hadoop, which is used to evaluate huge data sets
made up of high-level language. In fact, Pig was initiated with the idea of creating and executing
commands on Big Data sets. The basic attribute of Pig programs is parallelization which helps
them to manage large data sets. Apache Pig consists of a compiler that generates a series of
MapReduce program and a Pig Latin language layer that facilitates SQL-like queries to be run on
distributed databases in Hadoop.
4. Apache Hive:
As the name suggests, Hive is Hadoops data warehouse system that enables quick data
summarization for Hadoop, handle queries and evaluate huge data sets which are located in
Hadoops file systems and also maintains full support for map/reduce. Another striking feature of
Apache Hive is to provide indexes such as bitmap indexes in order to speed up queries. Apache
Hive was originally developed by Facebook, but now it is developed and used by other companies
too, including Netflix.
5. Apache HCatalog
Apache HCatalog is another important component of Apache Hadoop which provides a table
and storage management service for data created with the help of Apache Hadoop. HCatalog offers
features like a shared schema and data type mechanism, a table abstraction for users and smooth
functioning across other components of Hadoop such as such as Pig, Map Reduce, Streaming, and
Hive.
6. Apache HBase
HBase is acronym for Hadoop DataBase. HBase is a distributed, column oriented database
that uses HDFS for storage purposes. On one hand it manages batch style computations using
MapReduce and on the other hand it handles point queries (random reads). The key components of
Apache HBase are HBase Master and the RegionServer.
7. Apache Zookeeper
Apache ZooKeeper is another significant part of Hadoop ecosystem. Its major funciton is to keep a
record of configuration information, naming, providing distributed synchronization, and providing
group services which are immensely crucial for various distributed systems. Infact, HBase is
dependent upon ZooKeeper for its functioning.
What is HDFS?
HDFS is a file system designed for storing very large files with streaming data access
patterns, running clusters on commodity hardware.
a file, it automatically gets replicated at two other locations also. So even if one or two of the
systems collapse, the file is still available on the third system.
Since the data is replicated thrice in HDFS, does it mean that any calculation
done on one node will also be replicated on the other two?
Since there are 3 nodes, when we send the MapReduce programs, calculations will be done
only on the original data. The master node will know which node exactly has that particular data. In
case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required
calculation will be done on the second replica.
What is a Namenode?
Namenode is the master node on which job tracker runs and consists of the metadata. It
maintains and manages the blocks which are present on the datanodes. It is a high-availability
machine and single point of failure in HDFS.
What is a metadata?
Metadata is the information about the data stored in datanodes such as location of the file,
size of the file and so on.
What is a Datanode?
Datanodes are the slaves which are deployed on each machine and provide the actual
storage. These are responsible for serving read and write requests for the clients.
Why do we use HDFS for applications having large data sets and not when there
are lot of small files?
HDFS is more suitable for large amount of data sets in a single file as compared to small
amount of data spread across multiple files. This is because Namenode is a very expensive high
performance system, so it is not prudent to occupy the space in the Namenode by unnecessary
amount of metadata that is generated for multiple small files. So, when there is a large amount of
data in a single file, name node will occupy less space. Hence for getting optimized performance,
HDFS supports large data sets instead of multiple small files.
What is a daemon?
Daemon is a process or service that runs in background. In general, we use this word in
UNIX environment. The equivalent of Daemon in Windows is services and in Dos is TSR.
subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and
disk and machine failure, each block is replicated to a small number of physically separate machines
(typically three). If a block becomes unavailable, a copy can be read from another location in a way
that is transparent to the client.
If we want to copy 10 blocks from one machine to another, but another machine
can copy only 8.5 blocks, can the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to
another, the Master node will figure out what is the actual amount of space required, how many
block are being used, how much space is available, and it will allocate the blocks accordingly.
What is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be
physically located at different places. Rack is a physical collection of datanodes which are stored at
a single location. There can be multiple racks in a single location.
What is the difference between Gen1 and Gen2 Hadoop with regards to the
Namenode?
In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is
known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive
Namenode takes over the charge.
What is MapReduce?
Map Reduce is the heart of Hadoop that consists of two parts map and reduce. Maps
and reduces are programs for processing data. Map processes the data first to give some
intermediate output which is further processed by Reduce to generate the final output.
Thus, MapReduce allows for distributed processing of the map and reduction operations.
all the files stored in different datanodes and on the other hand, datanodes require low configuration
system.
HADOOP CLUSTER
Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode
What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is 70, for job tracker is 30 and for task tracker is 60.
What happens if you get a connection refused java exception when you type
hadoop fsck /?
It could mean that the Namenode is not working on your VM.
We are using Ubuntu operating system with Cloudera, but from where we can
download Hadoop or does it come by default with Ubuntu?
This is a default configuration of Hadoop that you have to download from Cloudera or from
Edurekas dropbox and the run it on your systems. You can also proceed with your own configuration
but you need a Linux box, be it Ubuntu or Red hat. There are installation steps present at the
Cloudera location or in Edurekas Drop box. You can go either ways.
Can you give us some more details about SSH communication between Masters
and the Slaves?
SSH is a password-less secure communication where data packets are sent across the
slave. It has some format into which data is sent across. SSH is not only between masters and
slaves but also between two hosts.
MAP REDUCE
What is MapReduce?
It is a framework or a programming model that is used for processing large data sets over
clusters of computers using distributed programming.
How can we change the split size if our commodity hardware has less
storage space?
If our commodity hardware has less storage space, we can change the split size by writing
the custom splitter. There is a feature of customization in Hadoop which can be called from the
main method.
What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using
MapReduce in any programming language which can accept standard input and can produce
standard output. It could be Perl, Python, Ruby and not necessarily be Java. However, customization
in MapReduce can only be done using Java and not any other programming language.
What is a Combiner?
A Combiner is a mini reducer that performs the local reduce task. It receives the input from
the mapper on a particular node and sends the output to the reducer. Combiners help
in enhancing the efficiency of MapReduce by reducing the quantum of data that is required to be
sent to the reducers.
PIG
Can you give us some examples how Hadoop is used in real time environment?
Let us assume that the we have an exam consisting of 10 Multiple-choice questions and 20
students appear for that exam. Every student will attempt each question. For each question and
each answer option, a key will be generated. So we have a set of key-value pairs for all the
questions and all the answer options for every student. Based on the options that the students have
selected, you have to analyze and find out how many students have answered correctly. This isnt an
easy task. Here Hadoop comes into picture! Hadoop helps you in solving these problems quickly
and without much effort. You may also take the case of how many students have wrongly attempted
a particular question.
What is PIG?
PIG is a platform for analyzing large data sets that consist of high level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs. PIGs
infrastructure layer consists of a compiler that produces sequence of MapReduce Programs.
produces a physical plan. The physical plan describes the physical operators that are needed to
execute the script.
in
Pig
Latin,
Pig
compiler
will
convert
the
program
into
MapReduce
jobs.
Are there any problems which can only be solved by MapReduce and cannot be
solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?
Let us take a scenario where we want to count the population in two cities. I have a data set
and sensor list of different cities. I want to count the population by using one mapreduce for two
cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of
Bangalore city similar to Noida through which I can bring the population data of these two cities to
one reducer. The idea behind this is some how I have to instruct map reducer program whenever
you find city with the name Bangalore and city with the name Noida, you create the alias name
which will be the common name for these two cities so that you create a common key for both the
cities and it get passed to the same reducer. For this, we have to write custom partitioner.
In mapreduce when you create a key for city, you have to consider city as the key. So,
whenever the framework comes across a different city, it considers it as a different key. Hence, we
need to use customized partitioner. There is a provision in mapreduce only, where you can write your
custom partitioner and mention if city = bangalore or noida then pass similar hashcode. However,
we cannot create custom partitioner in Pig. As Pig is not a framework, we cannot direct execution
engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.
Does Pig give any warning when there is a type mismatch or missing field?
No, Pig will not show any warning if there is no matching field or a mismatch. If you assume
that Pig gives such a warning, then it is difficult to find in log file. If any mismatch is found, it assumes
a null value in Pig.
: FOREACH
bagname
GENERATE
expression1,
expression2,
..
The meaning of this statement is that the expressions mentioned after GENERATE will be applied to
the current record of the data bag.
What is bag?
A bag is one of the data models present in Pig. It is an unordered collection of tuples with
possible duplicates. Bags are used to store collections while grouping. The size of bag is the size of
the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill
this bag into local disk and keep only some parts of the bag in memory. There is no necessity that
the complete bag should fit into memory. We represent bags with {}.