Top Hadoop Interview Q&A
Top Hadoop Interview Q&A
Apache Hadoop is an open-source software library used to control data processing and storage in
big data applications. Hadoop helps to analyze vast amounts of data parallelly and more swiftly.
Apache Hadoop was acquainted with the public in 2012 by The Apache Software Foundation(ASF).
Hadoop is economical to use as data is stored on affordable commodity Servers that run as
clusters.
Before the digital period, the volume of data gathered was slow and could be examined and stored
with a single storage format. At the same time, the format of the data received for similar purposes
had the same format. However, with the development of the Internet and digital platforms like social
media, the data comes in multiple formats (structured, semi-structured, and unstructured), and its
velocity also massively grown. A new name was given to this data which is Big data. Then, the need
for multiple processors and storage units arose to handle the big data. Therefore, as a solution,
Hadoop was introduced. Learn More on the Architecture of Hadoop.
Simply, big data is larger, more complex data sets, particularly from new data sources. These data
sets are so large that conventional data processing software can’t manage them. But these massive
volumes of data can be used to address business problems you wouldn’t have been able to tackle
before.
Hadoop is a famous big data tool utilized by many companies globally. Few successful Hadoop
users:
Uber
The Bank of Scotland
Netflix
The National Security Agency (NSA) of the United States
Twitter
HDFS is the Hadoop Distributed File System, is the storage layer for Hadoop. The files in HDFS are
split into block-size parts called data blocks.
GetThese blocks
Placed areProduct
at Top saved on the slave with
Companies nodes in the
Scaler
cluster. By default, the size of the block is 128 MB by default, which can be configured as per our
necessities. It follows the master-slave architecture. It contains two daemons- DataNodes and
NameNode.
NameNode
The NameNode is the master daemon that operates on the master node. It saves the filesystem
metadata, that is, files names, data about blocks of a file, blocks locations, permissions, etc. It
manages the Datanodes.
DataNode
The DataNodes are the slave daemon that operates on the slave nodes. It saves the actual business
data. It serves the client read/write requests based on the NameNode instructions. It stores the
blocks of the files, and NameNode stores the metadata like block locations, permission, etc.
Fault Tolerance
Hadoop framework divides data into blocks and creates various copies of blocks on several
machines in the cluster. So, when any device in the cluster fails, clients can still access their
data from the other machine containing the exact copy of data blocks.
High Availability
In the HDFS environment, the data is duplicated by generating a copy of the blocks. So,
whenever a user wants to obtain this data, or in case of an unfortunate situation, users can
simply access their data from the other nodes because duplicate images of blocks are already
present in the other nodes of the HDFS cluster.
High Reliability
HDFS splits the data into blocks, theseGet
blocks are at
Placed stored by the Hadoop
Top Product framework
Companies on nodes
with Scaler
existing in the cluster. It saves data by generating a duplicate of every block current in the
cluster. Hence presents a fault tolerance facility. By default, it creates 3 duplicates of each block
containing information present in the nodes. Therefore, the data is promptly obtainable to the
users. Hence the user does not face the difficulty of data loss. Therefore, HDFS is very reliable.
Replication
Replication resolves the problem of data loss in adverse conditions like device failure, crashing
of nodes, etc. It manages the process of replication at frequent intervals of time. Thus, there is
a low probability of a loss of user data.
Scalability
HDFS stocks the data on multiple nodes. So, in case of an increase in demand, it can scale the
cluster.
HDFS NAS
HDFS is a Distributed File system that is NAS is a file-level computer data storage server connected to a
mainly used to store data by commodity computer network that provides network access to a
hardware. heterogeneous group of clients.
Hadoop MapReduce is a software framework for processing enormous data sets. It is the main
component for data processing in the Hadoop framework. It divides the input data into several parts
and runs a program on every data component parallel at one. The word MapReduce refers to two
separate and different tasks.
Let us take an example of a text file called example_data.txt and understand how MapReduce
works.
Now, assume we have to find out the word count on the example_data.txt using MapReduce. So, we
will be looking for the unique words and the number of times those unique words appeared.
First, we break the input into three divisions, as seen in the figure. This will share the work
among all the map nodes.
Then, all the words are tokenized in each of the mappers, and a hardcoded value (1) to each of
the tokens is given. The reason behind giving a hardcoded value equal to 1 is that every word by
itself will, at least, occur once.
Now, a list of key-value pairs will be created where the key is nothing but the individual words
and value is one. So, for the first line (Coding Ice Jamming), we have three key-value pairs –
Coding, 1; Ice, 1; Jamming, 1.
The mapping process persists the same on all the nodes.
Next, a partition process occurs where sorting and shuffling follow so that all the tuples with the
same key are sent to the identical reducer.
Get Placed at Top Product Companies with Scaler
Subsequent to the sorting and shuffling phase, every reducer will have a unique key and a list of
values matching that very key. For example, Coding, [1,1]; Ice, [1,1,1].., etc.
Now, each Reducer adds the values which are present in that list of values. As shown in the
example, the reducer gets a list of values [1,1] for the key Jamming. Then, it adds the number of
ones in the same list and gives the final output as – Jamming, 2.
Lastly, all the output key/value pairs are then assembled and written in the output file.
In Hadoop MapReduce, shuffling is used to transfer data from the mappers to the important
reducers. It is the process in which the system sorts the unstructured data and transfers the output
of the map as an input to the reducer. It is a significant process for reducers. Otherwise, they would
not accept any information. Moreover, since this process can begin even before the map phase is
completed, it helps to save time and complete the process in a lesser amount of time.
Apache Spark comprises the Spark Core Engine, Spark Streaming, MLlib, GraphX, Spark SQL, and
Spark R.
The Spark Core Engine can be used along with any of the other five components specified. It is not
required to use all the Spark components collectively. Depending on the use case and request, one
or more can be used along with Spark Core.
Hive is an open-source system that processes structured data in Hadoop, living on top of the latter
for summing Big Data and facilitating analysis and queries. In addition, hive enables SQL developers
to write Hive Query Language statements similar to standard SQL statements for data query and
analysis. It is created to make MapReduce programming easier because you don’t know and write
lengthy Java code.
Apache Pig architecture includes a Pig Latin interpreter that applies Pig Latin scripts to process and
interpret massive datasets. Programmers use Pig Latin language to examine huge datasets in the
Hadoop environment. Apache pig has a vibrant set of datasets showing different data operations
like join, filter, sort, load, group, etc.
Programmers must practice Pig Latin language to address a Pig script to perform a particular task.
Pig transforms these Pig scripts into a series of Map-Reduce jobs to reduce programmers’ work. Pig
Latin programs are performed via various mechanisms such as UDFs, embedded, and Grunt shells.
Parser: The Parser handles the Pig Scripts and checks the syntax of the script.
Optimizer: The optimizer receives the logical plan (DAG). And carries out the logical
optimization such as projection and push down.
Compiler: The compiler converts the logical plan into a series of MapReduce jobs.
Execution Engine: In the end, the MapReduce jobs get submitted to Hadoop in sorted order.
Execution Mode: Apache Pig is executed in local and Map Reduce modes. The selection of
execution mode depends on where the data is stored and where you want to run the Pig script.
Yarn stands for Yet Another Resource Negotiator. It is the resource management layer of Hadoop.
The Yarn was launched in Hadoop 2.x. Yarn provides many data processing engines like graph
processing, batch processing, interactive processing, and stream processing to execute and process
data saved in the Hadoop Distributed File System. Yarn also offers job scheduling. It extends the
capability of Hadoop to other evolving technologies so that they can take good advantage of HDFS
and economic clusters.
Apache Yarn is the data operating method for Hadoop 2.x. It consists of a master daemon known as
“Resource Manager,” a slave daemon called node manager, and Application Master.
Resource Manager: It runs on a master daemon and controls the resource allocation in the
cluster. Get Placed at Top Product Companies with Scaler
Node Manager: It runs on the slave daemons and executes a task on each single Data Node.
Application Master: It controls the user job lifecycle and resource demands of single
applications. It works with the Node Manager and monitors the execution of tasks.
Container: It is a combination of resources, including RAM, CPU, Network, HDD, etc., on a single
node.
Apache Zookeeper is an open-source service that supports controlling a huge set of hosts.
Management and coordination in a distributed environment are complex. Zookeeper automates this
process and enables developers to concentrate on building software features rather than bother
about its distributed nature.
Simple distributed coordination process: The coordination process among all nodes in
Zookeeper is straightforward.
Synchronization: Mutual exclusion and co-operation among server processes.
Ordered Messages: Zookeeper tracks with a number by denoting its order with the stamping of
each update; with the help of all this, messages are ordered here.
Serialization: Encode the data according to specific rules. Ensure your application runs
consistently.
Reliability: The zookeeper is very reliable. In case of an update, it keeps all the data until
forwarded.
Atomicity: Data transfer either succeeds or fails, but no transaction is partial.
Persistent Znodes:
The default znode in ZooKeeper is the Persistent Znode. It permanently stays in the zookeeper
server until any other clients leave it apart.
Ephemeral Znodes:
These are the temporary znodes. It is smashed whenever
Get Placed the creator
at Top Product client logs
Companies out
with of the
Scaler
ZooKeeper server. For example, assume client1 created eznode1. Once client1 logs out of the
ZooKeeper server, the eznode1 gets destroyed.
Sequential Znodes:
Sequential znode is assigned a 10-digit number in numerical order at the end of its name.
Assume client1 produced a sznode1. In the ZooKeeper server, the sznode1 will be named like
this:
sznode0000000001
If the client1 generates another sequential znode, it will bear the following number in a
sequence. So the subsequent sequential znode is <znode name>0000000002.
C) cat: We are using the cat command to display the content of the file present in the directory of
HDFS.
hadoop fs –cat /path_to_file_in_hdfs
D)mv : The HDFS mv command moves the files or directories from the source to a destination within
HDFS.
hadoop fs -mv <src> <dest>
E) copyToLocal: This command copies the file from the file present in the newDataFlair directory of
HDFS to the local file system.
hadoop fs -copyToLocal <hdfs source> <localdst>
F) get: Copies the file from the Hadoop File System to the Local File System.
hadoop fs -get<src> <localdest>
It is a tool that is used for copying a very large amount of data to and from Hadoop file systems in
parallel. It uses MapReduce to affect its distribution, error handling, recovery, and reporting. It
expands a list of files and directories into input to map tasks, each of which will copy a partition of
the files specified in the source list.
By default, the size of the HDFS data block is 128 MB. The ideas for the large size of blocks are:
To reduce the expense of seek: Because of the large size blocks, the time consumed to shift the
data from the disk can be longer than the usual time taken to commence the block. As a result,
the multiple blocks are transferred at the disk transfer rate.
If there are small blocks, the number of blocks will be too many in Hadoop HDFS and too much
metadata to store. Managing such a vast number of blocks and metadata will create overhead
and head to traffic in a network.
Hadoop provides an option where a particular set of lousy input records can be skipped when
processing map inputs. Applications can manage this feature through the SkipBadRecords class.
This feature can be used when map tasks fail deterministically on a particular input. This usually
happens due to faults in the map function. The user would have to fix these issues.
26. Where are the two types of metadata that NameNode server stores?
The two types of metadata that NameNode server stores are in Disk and RAM.
Metadata is linked to two files which are:
EditLogs: It contains all the latest changes in the file system regarding the last FsImage.
FsImage: It contains the whole state of the namespace of the file system from the origination of
the NameNode.
Once the file is deleted from HDFS, the NameNode will immediately store this in the EditLog.
All the file systems and metadata which are present in the Namenode’s Ram are read by the
Secondary NameNode continuously and later get recorded into the file system or hard disk. EditLogs
is combined with FsImage in the NameNode. Periodically, Secondary NameNode downloads the
EditLogs from the NameNode, and then it is implemented to FsImage. The new FsImage is then
copied back into the NameNode and used only after the NameNode has started the subsequent
time.
27. Which Command is used to find the status of the Blocks and File-system
health?
The command used to find the status of the block is: hdfs fsck <path> -files –blocks
And the command used to find File-system health is: hdfs fsck/ -files –blocks –locations > dfs-
fsck.log
The dfsadmin tools are a specific set of tools designed to help you root out information about your
Hadoop Distributed File system (HDFS). As a bonus, you can use them to perform some
administration operations on HDFS as well.
Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when
you want to share the files across all nodes in a Hadoop cluster. These files can be jar files or simple
properties files.
Get Placed at Top Product Companies with Scaler
Hadoop's MapReduce framework allows the facility to cache small to moderate read-only files such
as text files, zip files, jar files, etc., and distribute them to all the Datanodes(worker-nodes)
MapReduce jobs are running. All Datanode gets a copy of the file(local-copy), which is sent by
Distributed Cache.
Both the Jobtracker and the name node detect the failure on which blocks were the DataNode
failed.
On the failed node all the tasks are rescheduled by locating other DataNodes with copies of
these blocks
User’s data will be replicated to another node from namenode to maintain the configured
replication factor.
The primary parameters of a mapper are text, LongWritable, text, and IntWritable. The initial two
represent input parameters, and the other two signify intermediate output parameters.
34. Mention the main Configuration parameters that has to be specified by the
user to run MapReduce.
The chief configuration parameters that the user of the MapReduce framework needs to mention is:
Get Placed at Top Product Companies with Scaler
Job’s input Location
Job’s Output Location
The Input format
The Output format
The Class including the Map function
The Class including the reduce function
JAR file, which includes the mapper, the Reducer, and the driver classes.
Resilient Distributed Datasets is the basic data structure of Apache Spark. It is installed in the Spark
Core. They are immutable and fault-tolerant. RDDs are generated by transforming already present
RDDs or storing an outer dataset from well-built storage like HDFS or HBase.
Since they have distributed collections of objects, they can be operated in parallel. Resilient
Distributed Datasets are divided into parts such that they can be executed on various nodes of a
cluster.
36. Give a brief on how Spark is good at low latency workloads like graph
processing and Machine Learning.
The data is stored in memory by Apache Spark for faster processing and development of machine
learning models, which may need a lot of Machine Learning algorithms for multiple repetitions and
various conceptual steps to create an optimized model. In the case of Graph algorithms, it moves
within all the nodes and edges to make a graph. Theseat
Get Placed low latency
Top workloads,
Product Companieswhich
withneed many
Scaler
iterations, enhance the performance.
37. What applications are supported by Apache Hive?
Java
PHP
Python
C++
Ruby
Metastore is used to store the metadata information; it’s also possible to use RDBMS and the open-
source ORM layer, converting object Representation into a relational schema. It’s the central
repository of Apache Hive metadata. It stores metadata for Hive tables (similar to their schema and
location) and partitions in a relational database. It gives the client access to this information by
using metastore service API. Disk storage for the Hive metadata is separate from HDFS storage.
It can also connect to a separate database The main advantage of remote mode over Local mode is that
running in a separate JVM in the same or Remote mode does not acquire the administrator to share JDBC
separate machine. login information for the metastore database.
Apache Hive organizes tables into partitions. Partitioning is the manner in which a table is split into
related components depending on the values of appropriate columns like date, city, and department.
Every table in the hive can have one or more than one partition keys to recognize a distinct partition.
With the help of partitions, it is effortless to do queries on slices of the data.
Heterogeneity: The design of applications should allow the users to access services and run
applications over a heterogeneous collection of computers and networks taking into
consideration Hardware devices, OS, networks, Programming languages.
Transparency: Distributed system Designers must hide the complexity of the system as much
as they can. Some Terms of transparency are location, access, migration, Relocation, and so on.
Openness: It is a characteristic that determines whether the system can be extended and
reimplemented in various ways.
Security: Distributed system Designers must take care of confidentiality, integrity, and
availability.
Scalability: A system is said to be scalable if it can handle the addition of users and resources
without suffering a noticeable loss of performance.
Get Placed at Top Product Companies with Scaler
Recommended Resource:
Spark Interview
43. How can you restart NameNode and all the daemons in Hadoop?
The following commands will help you restart NameNode and all the daemons:
You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and
then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.
You can stop all the daemons with the ./sbin /stop-all.sh command and then start the daemons
using the ./sbin/start-all.sh command.
44. How do you differentiate inner bag and outer bag in Pig.
Example : (4,{(4,2,1),(4,3,3,)})
Example:{(park, New York),(Hollywood, Los Angeles)}
In this example the complete relation is an outer
Which is a bag of tuples, nothing but an outer bag.
bag and {(4,2,1),(4,3,3,)} is an inner bag.
If the source data gets updated in a very short interval of time, the synchronization of data in HDFS
that is imported by Sqoop is done with the help of incremental parameters.
We should use incremental import along with the append choice even when the table is refreshed
continuously with new rows. Principally where values of a few columns are examined, and if it
encounters any revised value for those columns, only a new row will be inserted. Similar to
incremental import, the origin has a date column examined for all the records that have been
modified after the last import, depending on the previous revised column in the beginning. The
values would be modernized.
47. What is the default File format to import data using Apache sqoop?
There are basically two file formats sqoop allos to import data they are:
-compress-codec parameter is generally used to get the output file of a sqoop import in formats
other than .gz.
1. Flume Source
2. Flume Channel
3. Flume Sink
4. Flume Agent
5. Flume Event