0% found this document useful (0 votes)
24 views38 pages

HADOOP FRAME WORK

Uploaded by

vinaybiradar14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views38 pages

HADOOP FRAME WORK

Uploaded by

vinaybiradar14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Hadoop Framework - Components, and Uses

If you are learning about Big Data, you are bound to come across mentions of the "Hadoop
Framework". The rise of big data and its analytics have made the Hadoop framework very
popular. Hadoop is open-source software, meaning the bare software is easily available for free
and customizable according to individual needs
This helps in curating the software according to the specific needs of the big data that needs
to be handled. As we know, big data is a term used to refer to the huge volume of data that
cannot be stored or processed, or analyzed using the mechanisms traditionally used. It is due
to several characteristics of big data. This is because big data has a high volume, is generated
at great speed, and the data comes in many varieties

Since the traditional frameworks are ineffective in handling big data, new techniques had to
be developed to combat it. This is where the Hadoop framework comes in. The Hadoop
framework is primarily based on Java and is used to deal with big data.
These 3 components are together form the Hadoop framework architecture.

https://hadoop.apache.org/docs/r3.4.0/
•Hadoop Common: The common utilities that support the other Hadoop modules.
•Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to
application data.
•Hadoop YARN: A framework for job scheduling and cluster resource management.
•Hadoop MapReduce: A YARN-based system for parallel processing of large data sets
Overview of Hadoop Architecture
Big data, with its immense volume and varying data structures has overwhelmed traditional networking
frameworks and tools. Using high-performance hardware and specialized servers can help, but they are
inflexible and come with a considerable price tag.
Hadoop manages to process and store vast amounts of data by using interconnected affordable
commodity hardware. Hundreds or even thousands of low-cost dedicated servers working together to
store and process data within a single ecosystem.
The Hadoop Distributed File System (HDFS), YARN, and MapReduce are at the heart of that ecosystem.
HDFS is a set of protocols used to store large data sets, while MapReduce efficiently processes the
incoming data.
Understanding the Layers of Hadoop Architecture
Separating the elements of distributed systems into functional layers helps streamline data
management and development. Developers can work on frameworks without negatively impacting
other processes on the broader ecosystem. Hadoop can be divided into four (4) distinctive layers.
1. Distributed Storage Layer
Each node in a Hadoop cluster has its own disk space, memory, bandwidth, and processing. The incoming data is split into
individual data blocks, which are then stored within the HDFS distributed storage layer. HDFS assumes that every disk drive
and slave node within the cluster is unreliable. As a precaution, HDFS stores three copies of each data set throughout the
cluster. The HDFS master node (NameNode) keeps the metadata for the individual data block and all its replicas.
2. Cluster Resource Management
Hadoop needs to coordinate nodes perfectly so that countless applications and users effectively share their resources. Initially,
MapReduce handled both resource management and data processing. YARN separates these two functions. As the de-facto
resource management tool for Hadoop, YARN is now able to allocate resources to different frameworks written for Hadoop.
These include projects such as Apache Pig, Hive, Giraph, Zookeeper, as well as MapReduce itself.
3. Processing Framework Layer
The processing layer consists of frameworks that analyze and process datasets coming into the cluster. The structured and
unstructured datasets are mapped, shuffled, sorted, merged, and reduced into smaller manageable data blocks. These
operations are spread across multiple nodes as close as possible to the servers where the data is located. Computation
frameworks such as Spark, Storm, Tez now enable real-time processing, interactive query processing and other programming
options that help the MapReduce engine and utilize HDFS much more efficiently.

4. Application Programming Interface


The introduction of YARN in Hadoop 2 has lead to the creation of new processing frameworks and APIs. Big data
continues to expand and the variety of tools needs to follow that growth. Projects that focus on search platforms,
data streaming, user-friendly interfaces, programming languages, messaging, failovers, and security are all an
intricate part of a comprehensive Hadoop ecosystem.
Core Components of Hadoop Architecture
1. HDFS (Hadoop Distributed File System):
It is a data storage system. Since the data sets are huge, it uses a distributed system to store this data. It is
stored in blocks where each block is around 128 MB. It consists of NameNode and DataNode. There can
only be one NameNode but multiple DataNodes.
Features:
I. The storage is distributed to handle a large data pool
II. Distribution increases data security
III. It is fault-tolerant, other blocks can pick up the failure of one block.
One of the most critical components of Hadoop architecture is the Hadoop Distributed File System (HDFS).
HDFS is the primary storage system used by Hadoop applications. It’s designed to scale to petabytes of
data and runs on commodity hardware. What sets HDFS apart is its ability to maintain large data sets
across multiple nodes in a distributed computing environment.
HDFS operates on the basic principle of storing large files across multiple machines. It achieves high
throughput by dividing large data into smaller blocks, which are managed by different nodes in the
network. This nature of HDFS makes it an ideal choice for applications with large data sets.
HDFS Explaination.
The Hadoop Distributed File System (HDFS) is fault-tolerant by design. Data is stored in individual data
blocks in three separate copies across multiple nodes and server racks. If a node or even an entire rack
fails, the impact on the broader system is negligible.
DataNodes process and store data blocks, while NameNodes manage the many DataNodes, maintain
data block metadata, and control client access.
NameNode:
Initially, data is broken into abstract data blocks. The file metadata for these blocks, which include the file name, file
permissions, IDs, locations, and the number of replicas, are stored in a fsimage, on the NameNode local memory.
Should a NameNode fail, HDFS would not be able to locate any of the data sets distributed throughout the
DataNodes. This makes the NameNode the single point of failure for the entire cluster. This vulnerability is resolved
by implementing a Secondary NameNode or a Standby NameNode.
Secondary NameNode
The Secondary NameNode served as the primary backup solution in early Hadoop versions. The Secondary
NameNode, every so often, downloads the current fsimage instance and edit logs from the NameNode and merges
them. The edited fsimage can then be retrieved and restored in the primary NameNode.
The failover is not an automated process as an administrator would need to recover the data from the Secondary
NameNode manually.
Standby NameNode
The High Availability feature was introduced in Hadoop 2.0 and subsequent versions to avoid any downtime in case of
the NameNode failure. This feature allows you to maintain two NameNodes running on separate dedicated master
nodes.
The Standby NameNode is an automated failover in case an Active NameNode becomes unavailable. The Standby
NameNode additionally carries out the check-pointing process. Due to this property, the Secondary and Standby
NameNode are not compatible. A Hadoop cluster can maintain either one or the other.
Zookeeper
Zookeeper is a lightweight tool that supports high availability and redundancy. A Standby NameNode maintains an active
session with the Zookeeper daemon. If an Active NameNode falters, the Zookeeper daemon detects the failure and carries
out the failover process to a new NameNode. Use Zookeeper to automate failovers and minimize the impact a NameNode
failure can have on the cluster.
DataNode
Each DataNode in a cluster uses a background process to store the individual blocks of data on slave servers.
By default, HDFS stores three copies of every data block on separate DataNodes. The NameNode uses a rack-aware
placement policy. This means that the DataNodes that contain the data block replicas cannot all be located on the same
server rack.
A DataNode communicates and accepts instructions from the NameNode roughly twenty times a minute. Also, it reports the
status and health of the data blocks located on that node once an hour. Based on the provided information, the NameNode
can request the DataNode to create additional replicas, remove them, or decrease the number of data blocks present on the
node.
Rack Aware Placement Policy
One of the main objectives of a distributed storage system like HDFS is to maintain high availability and replication.
Therefore, data blocks need to be distributed not only on different DataNodes but on nodes located on different server racks.
This ensures that the failure of an entire rack does not terminate all data replicas. The HDFS NameNode maintains a default
rack-aware replica placement policy:
•The first data block replica is placed on the same node as the client.
•The second replica is automatically placed on a random DataNode on a different rack.
•The third replica is placed in a separate DataNode on the same rack as the second replica.
•Any additional replicas are stored on random DataNodes throughout the cluster.
2. YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework. The data which is stored can
be processed with help of YARN using data processing engines like interactive processing. It
can be used to fetch any sort of data analysis.
Features:
I. It is a filing system that acts as an Operating System for the data stored on HDFS
II. It helps to schedule the tasks to avoid overloading any system

Yet Another Resource Negotiator (YARN) is responsible for managing resources in the cluster
and scheduling tasks for users. It is a key element in Hadoop architecture as it allows
multiple data processing engines such as interactive processing, graph processing, and batch
processing to handle data stored in HDFS.
YARN separates the functionalities of resource management and job scheduling into
separate daemons*. This design ensures a more scalable and flexible Hadoop architecture,
accommodating a broader array of processing approaches and a wider array of applications.

*A daemon is a program that runs in the background of a computer, performing tasks without direct
user interaction. Daemons are often found in Unix and Unix-like operating systems, such as Linux.
YARN Explained as follows.
YARN (Yet Another Resource Negotiator) is the default cluster management resource for Hadoop 2 and Hadoop 3. In
previous Hadoop versions, MapReduce used to conduct both data processing and resource allocation. Over time the
necessity to split processing and resource management led to the development of YARN.
YARN’s resource allocation role places it between the storage layer, represented by HDFS, and the MapReduce processing
engine. YARN also provides a generic interface that allows you to implement new processing engines for various data types.
ResourceManager
The ResourceManager (RM) daemon controls all the processing resources in a Hadoop cluster. Its primary
purpose is to designate resources to individual applications located on the slave nodes. It maintains a
global overview of the ongoing and planned processes, handles resource requests, and schedules and
assigns resources accordingly. The ResourceManager is vital to the Hadoop framework and should run on
a dedicated master node.
The RM sole focus is on scheduling workloads. Unlike MapReduce, it has no interest in failovers or
individual processing tasks. This separation of tasks in YARN is what makes Hadoop inherently scalable
and turns it into a fully developed computing platform.

NodeManager
Each slave node has a NodeManager processing service and a DataNode storage service. Together they
form the backbone of a Hadoop distributed system.
The DataNode, as mentioned previously, is an element of HDFS and is controlled by the NameNode. The
NodeManager, in a similar fashion, acts as a slave to the ResourceManager. The primary function of the
NodeManager daemon is to track processing-resources data on its slave node and send regular reports to
the ResourceManager.
Containers
Processing resources in a Hadoop cluster are always deployed in containers. A container has memory, system files, and
processing space.
A container deployment is generic and can run any requested custom resource on any system. If a requested amount of cluster
resources is within the limits of what’s acceptable, the RM approves and schedules that container to be deployed.
The container processes on a slave node are initially provisioned, monitored, and tracked by the NodeManager on that specific
slave node.

Application Master
Every container on a slave node has its dedicated Application Master. Application Masters are deployed in a container as well.
Even MapReduce has an Application Master that executes map and reduce tasks.
As long as it is active, an Application Master sends messages to the Resource Manager about its current status and the state
of the application it monitors. Based on the provided information, the Resource Manager schedules additional resources or
assigns them elsewhere in the cluster if they are no longer needed.
The Application Master oversees the full lifecycle of an application, all the way from requesting the needed containers from the
RM to submitting container lease requests to the NodeManager.

JobHistory Server
The JobHistory Server allows users to retrieve information about applications that have completed their activity. The REST API
provides interoperability and can dynamically inform users on current and completed jobs served by the server in question.
How Does YARN Work?
A basic workflow for deployment in YARN starts when a client application submits a request to the ResourceManager.
1.The ResourceManager instructs a NodeManager to start an Application Master for this request, which is then started in a
container.
2.The newly created Application Master registers itself with the RM. The Application Master proceeds to contact the HDFS
NameNode and determine the location of the needed data blocks and calculates the amount of map and reduce tasks needed
to process the data.
3.The Application Master then requests the needed resources from the RM and continues to communicate the resource
requirements throughout the life-cycle of the container.
4.The RM schedules the resources along with the requests from all the other Application Masters and queues their requests.
As resources become available, the RM makes them available to the Application Master on a specific slave node.
5.The Application Manager contacts the NodeManager for that slave node and requests it to create a container by providing
variables, authentication tokens, and the command string for the process. Based on that request, the NodeManager creates
and starts the container.

6. The Application Manager then monitors the process and reacts in the event of failure by restarting the process on the next
available slot. If it fails after four different attempts, the entire job fails. Throughout this process, the Application Manager
responds to client status requests.
Once all tasks are completed, the Application Master sends the result to the client application, informs the RM that the
application has completed its task, deregisters itself from the Resource Manager, and shuts itself down.
The RM can also instruct the NameNode to terminate a specific container during the process in case of a processing priority
change.
3. MapReduce:
The MapReduce framework is the processing unit. All data is distributed and processed
parallelly. There is a MasterNode that distributes data amongst SlaveNodes. The SlaveNodes
do the processing and send it back to the MasterNode.
Features:
I. Consists of two phases, Map Phase and Reduce Phase.
II. Processes big data faster with multiples nodes working under one CPU
MapReduce Programming Model
MapReduce is a programming model integral to Hadoop architecture. It is designed to process
large volumes of data in parallel by dividing the work into a set of independent tasks. The
MapReduce model simplifies the processing of vast data sets, making it an indispensable part
of Hadoop.
MapReduce is characterized by two primary tasks, Map and Reduce. The Map task takes a set
of data and converts it into another set of data, where individual elements are broken down
into tuples. On the other hand, the Reduce task takes the output from the Map as input and
combines those tuples into a smaller set of tuples.
MapReduce Explanation
MapReduce is a programming algorithm that processes data dispersed across the Hadoop cluster. As with any
process in Hadoop, once a MapReduce job starts, the ResourceManager requisitions an Application Master to
manage and monitor the MapReduce job lifecycle.
The Application Master locates the required data blocks based on the information stored on the NameNode. The
AM also informs the Resource Manager to start a MapReduce job on the same node the data blocks are located
on. Whenever possible, data is processed locally on the slave nodes to reduce bandwidth usage and improve
cluster efficiency.
The input data is mapped, shuffled, and then reduced to an aggregate result. The output of the MapReduce job is
stored and replicated in HDFS.

The Hadoop servers that perform the mapping and reducing tasks are often referred to as Mappers and
Reducers.
The Resource Manager decides how many mappers to use. This decision depends on the size of the processed
data and the memory block available on each mapper server.
4. Hadoop Common
Hadoop Common, often referred to as the ‘glue’ that holds Hadoop architecture together,
contains libraries and utilities needed by other Hadoop modules. It provides the necessary
Java files and scripts required to start Hadoop. This component plays a crucial role in ensuring
that the hardware failures are managed by the Hadoop framework itself, offering a high
degree of resilience and reliability.
Difference Between RDBMS and Hadoop
RDMS (Relational Database Management System): RDBMS is an information management
system, which is based on a data model. In RDBMS tables are used for information storage.
Each row of the table represents a record and column represents an attribute of data.
Organization of data and their manipulation processes are different in RDBMS from other
databases. RDBMS ensures ACID (atomicity, consistency, integrity, durability) properties
required for designing a database. The purpose of RDBMS is to store, manage, and retrieve
data as quickly and reliably as possible.
Hadoop: It is an open-source software framework
Hadoop: It is an open-source software framework used for storing data and running
applications on a group of commodity hardware. It has large storage capacity and high
processing power. It can manage multiple concurrent processes at the same time. It is used in
predictive analysis, data mining and machine learning. It can handle both structured and
unstructured form of data. It is more flexible in storing, processing, and managing data than
traditional RDBMS. Unlike traditional systems, Hadoop enables multiple analytical processes on
the same data at the same time. It supports scalability very flexibly.
Below is a table of differences between RDBMS and Hadoop:
S.No. RDBMS Hadoop

Traditional row-column based databases, basically used for An open-source software used for storing data and
1.
data storage, manipulation and retrieval. running applications or processes concurrently.

In this both structured and unstructured data is


2. In this structured data is mostly processed.
processed.
3. It is best suited for OLTP environment. It is best suited for BIG data.
4. It is less scalable than Hadoop. It is highly scalable.

5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

6. It stores transformed and aggregated data. It stores huge volume of data.


7. It has no latency in response. It has some latency in response.

8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.

9. High data integrity available. Low data integrity available than RDBMS.

10. Cost is applicable for licensed software. Free of cost, as it is an open source software.
Hadoop History
Hadoop, developed by Doug Cutting and Mike Cafarella in 2005, was inspired by Google’s
technologies for handling large datasets. Initially created to improve Yahoo’s indexing capabilities, it
consists of the Hadoop Distributed File System (HDFS) and the MapReduce programming model.
HDFS enables the storage of data across thousands of servers, while MapReduce processes this data
in parallel, significantly improving efficiency and scalability. Released as an open-source project
under the Apache Foundation in 2006, Hadoop quickly became a fundamental tool for companies
needing to store and analyze vast amounts of unstructured data, thereby playing a pivotal role in
the emergence and growth of the big data industry.
Hadoop is an open-source framework for big data analytics that has several key aspects
/importance :

Scalability : Hadoop's distributed environment allows for easy addition of more servers to handle
increased data or processing workloads. Hadoop’s distributed architecture allows it to extend horizontally,
allowing it to handle massive volumes of data by adding more commodity hardware to the cluster.
Cost-effectiveness : Hadoop offers a cost-effective storage solution for businesses. Hadoop makes use of
less costly commodity hardware. Because of its affordability, Hadoop is a popular choice for companies
wishing to manage and store massive volumes of data without going over budget.
Fault tolerance: Data stored in a node is replicated in other cluster nodes, ensuring fault tolerance. Data is
replicated among several cluster nodes via the Hadoop Distributed File System (HDFS). This redundancy
increases system resilience even if one node fails, lowering the risk of data loss.
Data processing : Hadoop can process all types of data, including structured, semi-structured, and
unstructured.
Data locality : Hadoop's architecture considers data locality to improve the efficiency of data processing.
Real-time analytics : Hadoop can support real-time analytics applications to help drive better operational
decision-making.
Batch workloads: Hadoop can support batch workloads for historical analysis.
Security : Hadoop's security framework revolves around five pillars: administration, authentication/
perimeter security, authorization, audit and data protection
Challenges of Hadoop:
Hadoop, despite its robust capabilities in handling big data, faces several challenges:

1.Complexity in Management: Managing a Hadoop cluster is complex. It requires expertise in cluster


configuration, maintenance, and optimization. The setup and maintenance of Hadoop can be
resource-intensive and requires a deep understanding of the underlying architecture.
2. Performance Limitations: While efficient for batch processing, Hadoop is not optimized for real-
time processing. The latency in Hadoop’s MapReduce can be a significant drawback for applications
requiring real-time data analysis.
3.Security Concerns: By default, Hadoop does not include robust security measures. It lacks
encryption at storage and network levels, making sensitive data vulnerable. Adding security features
often involves integrating additional tools, which can complicate the system further.
4.Scalability Issues: Although Hadoop is designed to scale up easily, adding nodes to a cluster does
not always lead to linear improvements in performance. The management overhead and network
congestion can diminish the benefits of scaling.
Challenges of Hadoop contd…
5.Resource Management: Hadoop’s resource management, originally handled by the MapReduce
framework, is often inefficient. This has led to the development of alternatives like YARN (Yet
Another Resource Negotiator), which improves resource management but also adds to the
complexity.
6.High Costs of Skilled Personnel: The demand for professionals skilled in Hadoop is high, and so is
their cost. Finding and retaining personnel with the necessary expertise can be challenging and
expensive.
7.Data Replication Overhead: HDFS’s default method of ensuring data reliability through replication
consumes a lot of storage space, which can become inefficient and costly as data volumes grow.
Key Features of Hadoop
Several Hadoop features simplify the handling of large amounts of data. Let us explore the
range of fundamental features of Hadoop that make it preferred by professionals in big data
and various industries. They are as follows –
1. Open-Source Framework
Hadoop is a free and open-source framework. That means the source code is freely available
online to everyone. This code can be customized to meet the needs of businesses.
2. Cost-Effectiveness
Hadoop is a cost-effective model because it makes use of open-source software and low-cost
commodity hardware. Numerous nodes make up the Hadoop cluster. These nodes are a
collection of commodity hardware (servers or physical workstations). They are also reasonably
priced and offer a practical way to store and process large amounts of data.
3. High-Level Scalability
The Hadoop cluster is both horizontally and vertically scalable. Scalability refers to –
(i)Horizontal Scalability: This means the addition of any number of nodes in the cluster.
(ii) Vertical Scalability: This means an increase in the hardware capacity (data storage
capacity) of the nodes.
This high-level and flexible scalability of Hadoop offers powerful processing capabilities.
Key Features of Hadoop. Continuation…
4. Fault Tolerance
Hadoop uses a replication method to store copies of data in each block on different machines
(nodes). This replication mechanism makes Hadoop a fault-tolerant framework because if any
of the machines fail or crash, a replica copy of the same data can be accessed on other
machines. With the more recent Hadoop 3 version, the fault tolerance feature has been
enhanced. It employs a replication mechanism called ‘Erasure Coding’ to provide fault
tolerance while using less space, with a storage overhead of no more than 50 percent.
5. High-Availability of Data
The high availability of data and fault tolerance features are complementary to each other. The
fault tolerance feature provides data availability at any time if any of the DataNode or
NameNode fails or crashes. A copy of the same data is available on either of these three
platforms
6. Data-Reliability
Data reliability in Hadoop is because of the –
(i)Replication mechanism in HDFS (Hadoop Distributed File System), which creates a
duplicate of each block, HDFS reliably stores data on the nodes.
(ii)Block Scanner, Volume Scanner, Disk Checker, and Directory Scanner are built-in
mechanisms provided by the framework itself to ensure data reliability.
7. Faster Data Processing
Hadoop has overcome traditional data processing challenges which were slow and sluggish.
Enough resources were not available to process the data smoothly and fast. The distributive
storage property of the Hadoop cluster of nodes enables faster data processing of huge
amounts of data at lightning speed.
8. Data Locality
This is one of the most distinctive features of Hadoop. The MapReduce component has a feature
called data locality. This attribute helps by placing calculation logic close to where the actual data is
located on the node. As a result, network congestion is reduced and system performance is improved
overall.
9. Possibility of Processing All Types of Data
The Hadoop framework can process all types of data – structured, semi-structured, and unstructured
that includes databases, images, videos, audio, graphics, etc. It makes it possible for the users to
work with any kind of data for analysis and interpretation irrespective of their format or size. In the
Hadoop ecosystem, various tools like Apache Hive, Pig, Sqoop, Zookeeper, GraphX, etc., can process
the data.
10. Easy Operability
Hadoop is an open-source software framework written in Java. To work on it, users must be familiar
with programming languages such as SQL, Java, and others. The framework handles the complete
data processing and storage dissemination process. Therefore, it is easy to operate on it.
13 Big Limitations of Hadoop for Big Data Analytics
We will discuss various limitations of Hadoop in this section along with their solution
1. Issue with Small Files
Hadoop does not suit for small data. (HDFS) Hadoop distributed file system lacks the ability to
efficiently support the random reading of small files because of its high capacity design.

Small files are the major problem in HDFS. A small file is significantly smaller than the HDFS
block size (default 128MB). If we are storing these huge numbers of small files, HDFS can’t
handle this much of files, as HDFS is for working properly with a small number of large files for
storing large data sets rather than a large number of small files. If there are too many small
files, then the NameNode will get overload since it stores the namespace of HDFS.

Solution: Solution to this Drawback of Hadoop to deal with small file issue is simple. Just merge
the small files to create bigger files and then copy bigger files to HDFS.
2. Slow Processing Speed
In Hadoop, with a parallel and distributed algorithm, the MapReduce process large data sets. There
are tasks that we need to perform: Map and Reduce and, MapReduce requires a lot of time to
perform these tasks thereby increasing latency. Data is distributed and processed over the cluster
in MapReduce which increases the time and reduces processing speed.

Solution: As a Solution to this Limitation of Hadoop spark has overcome this issue, by in-memory
processing of data. In-memory processing is faster as no time is spent in moving the
data/processes in and out of the disk. Spark is 100 times faster than MapReduce as it processes
everything in memory. We also Flink, as it processes faster than spark because of its streaming
architecture and Flink gets instructions to process only the parts of the data that have actually
changed, thus significantly increases the performance of the job.
3. Support for Batch Processing only
Hadoop supports batch processing only, it does not process streamed data, and hence
overall performance is slower. The MapReduce framework of Hadoop does not leverage the
memory of the Hadoop cluster to the maximum.
Solution-
To solve these limitations of Hadoop spark is used that improves the performance,
but Spark stream processing is not as efficient as Flink as it uses micro-batch
processing. Flink improves the overall performance as it provides single run-time for the
streaming as well as batch processing. Flink uses native closed loop iteration operators
which make machine learning and graph processing faster.

4. No Real-time Data Processing


Apache Hadoop is for batch processing, which means it takes a huge amount of data in input,
process it and produces the result. Although batch processing is very efficient for processing a high
volume of data, depending on the size of the data that processes and the computational power of
the system, an output can delay significantly. Hadoop is not suitable for Real-time data processing.
Solution-
4. No Real-time Data Processing

Apache Hadoop is for batch processing, which means it takes a huge amount of data in input,
process it and produces the result. Although batch processing is very efficient for processing a
high volume of data, depending on the size of the data that processes and the computational
power of the system, an output can delay significantly. Hadoop is not suitable for Real-time data
processing.

Solution:
(i) Apache Spark supports stream processing. Stream processing involves continuous input and
output of data. It emphasizes on the velocity of the data, and data processes within a small
period of time. Learn more about Spark Streaming APIs.
(ii)Apache Flink provides single run-time for the streaming as well as batch processing, so one
common run-time is utilized for data streaming applications and batch processing applications.
Flink is a stream processing system that is able to process row after row in real time.
5. No Delta Iteration
Hadoop is not so efficient for iterative processing, as Hadoop does not support cyclic data
flow(i.e. a chain of stages in which each output of the previous stage is the input to the next
stage).
Solution: We can use Apache Spark to overcome this type of Limitations of Hadoop, as it
accesses data from RAM instead of disk, which dramatically improves the performance of
iterative algorithms that access the same dataset repeatedly. Spark iterates its data in batches.
For iterative processing in Spark, we schedule and execute each iteration separately.
6. Latency
In Hadoop, MapReduce framework is comparatively slower, since it is for supporting different
format, structure and huge volume of data. In MapReduce, Map takes a set of data and
converts it into another set of data, where individual elements are broken down into key-
value pairs and Reduce takes the output from the map as input and process further and
MapReduce requires a lot of time to perform these tasks thereby increasing latency.
Solution:
Spark is used to reduce this limitation of Hadoop, Apache Spark is yet another batch system but it is
relatively faster since it caches much of the input data on memory by RDD(Resilient Distributed
Dataset) and keeps intermediate data in memory itself. Flink’s data streaming achieves low latency and
high throughput.
7. Not Easy to Use
In Hadoop, MapReduce developers need to hand code for each and every operation which
makes it very difficult to work. MapReduce has no interactive mode, but adding one such
as hive and pig makes working with MapReduce a little easier for adopters.
Solution: To solve this Drawback of Hadoop, we can use the spark. Spark has interactive
mode so that developers and users alike can have intermediate feedback for queries and other
activities. Spark is easy to program as it has tons of high-level operators. We can easily use
Flink as it also has high-level operators. This way spark can solve many limitations of Hadoop
8. Security
Hadoop is challenging in managing the complex application. If the user doesn’t know how to enable
a platform who is managing the platform, your data can be a huge risk. At storage and network
levels, Hadoop is missing encryption, which is a major point of concern. Hadoop supports Kerberos
authentication, which is hard to manage.
HDFS supports access control lists (ACLs) and a traditional file permissions model. However, third-
party vendors have enabled an organization to leverage Active Directory Kerberos and LDAP for
authentication.
Solution: Spark provides a security bonus to overcome these limitations of Hadoop. If we run the
spark in HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN
giving it the capability of using Kerberos authentication.
9. No Abstraction
Hadoop does not have any type of abstraction so MapReduce developers need to hand code
for each and every operation which makes it very difficult to work.
Solution-
To overcome these drawbacks of Hadoop, Spark is used in which we have RDD abstraction
for the batch. Flink has Dataset abstraction.
10. Vulnerable by Nature
Hadoop is entirely written in Java, a language most widely used, hence java been most
heavily exploited by cyber criminals and as a result, implicated in numerous security
breaches.
11. No Caching
Hadoop is not efficient for caching. In Hadoop, MapReduce cannot cache the intermediate data in
memory for a further requirement which diminishes the performance of Hadoop.
Solution: Spark and Flink can overcome this limitation of Hadoop, as Spark and Flink cache data in
memory for further iterations which enhance the overall performance.
12. Lengthy Line of Code
Hadoop has a 1,20,000 line of code, the number of lines produces the number of bugs and it
will take more time to execute the program.
Solution: Although, Spark and Flink are written in scala and java but the implementation is
in Scala, so the number of lines of code is lesser than Hadoop. So it will also take less time to
execute the program and solve the lengthy line of code limitations of Hadoop.

13. Uncertainty
Hadoop only ensures that the data job is complete, but it’s unable to guarantee when the job will
be complete
Hadoop

Hadoop is a data handling framework written primarily in Java, with some secondary
code in C+. It uses a basic-level programming model and is able to deal with large
datasets

This framework uses distributed storage and parallel processing to store and manage
big data. It is one of the most widely used pieces of big data software

Hadoop consists mainly of three components: Hadoop HDFS, Hadoop MapReduce, and
Hadoop YARN. These components come together to handle big data effectively. These
components are also known as Hadoop modules

You might also like