Apache Hadoop
Historical Review of Big Data
Apache Hadoop
• Apache Hadoop is an open source software framework for storage
and large scale processing of data-sets on clusters of commodity
hardware.
• Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
• Doug Cutting , who was working at Yahoo! at the
time
• He named the project after his son's toy elephant.
• Cutting's son was 2 years old at the time and just
beginning to talk. He called his beloved stuffed
yellow elephant "Hadoop"
The Apache Hadoop framework modules
• Hadoop Common: contains libraries and utilities needed by other
Hadoop modules
• Hadoop Distributed File System (HDFS): a distributed file-system that
stores data on the commodity machines, providing very high
aggregate bandwidth across the cluster
• Hadoop YARN: a resource-management platform responsible for
managing compute resources in clusters and using them for
scheduling users' applications
• Hadoop MapReduce: a programming model for large-scale data
processing
HADOOP use by
• In 2010, Facebook claimed to have one of the largest HDFS cluster
storing 21 Petabytes of data.
• In 2012, Facebook declared that they have the largest single HDFS
cluster with more than 100 PB of data.
• And Yahoo! has more than 100,000 CPU in over 40,000
servers running Hadoop, with its biggest Hadoop cluster
running 4,500 nodes. All told, Yahoo! stores 455 petabytes of data in
HDFS.
• In fact, by 2013, most of the big names in the Fortune 50 started
using Hadoop.
Facebook and Yahoo! have shifted from heavy reliance on Hadoop to
integrating more cloud-native solutions. While Yahoo! still uses Hadoop
extensively, both companies are adopting modern tools like Apache Spark and
cloud storage systems to handle big data more efficiently. The overall trend in
the industry is moving away from traditional Hadoop clusters in favor of more
scalable and cost-effective cloud-based alternatives
• Spotify (2023): Spotify continues to use Hadoop to process large datasets, particularly for its
recommendation systems. The scalability of Hadoop allows Spotify to manage and analyze vast amounts
of user data effectively.
• Airbnb (2022): Airbnb leverages Hadoop for its data infrastructure, focusing on storage and batch
processing. This enables them to perform extensive data analytics, helping in service optimization and
decision-making processes.
• LinkedIn (2021): LinkedIn remains invested in Hadoop for its data lake architecture. Hadoop's
capabilities in handling large-scale data processing are essential for LinkedIn's algorithms, especially in
delivering personalized content and recommendations.
• Twitter (2020): Twitter continues to rely on Hadoop for managing and processing the massive volumes
of data generated by its platform. Hadoop is integral to their analytics and machine learning
frameworks, helping to enhance user experience.
Let’s start
• Hadoop has two fundamental units – Storage and Processing. storage
part of Hadoop, refers to HDFS which stands for Hadoop Distributed
File System. So, let’s start with HDFS.
• What is HDFS?
• Advantages of HDFS
• Features of HDFS
what is a Distributed File System?
• Distributed File System talks about managing data, i.e. files or folders across
multiple computers or servers.
• DFS is a file system that allows us
• To store data over multiple nodes or machines in a cluster and allow multiple users to access
data.
• So basically, it serves the same purpose as the file system which is available in your
machine, like for windows you have NTFS (New Technology File System) or for Mac
you have HFS (Hierarchical File System).
• The only difference is that, in case of Distributed File System,
• You store data in multiple machines rather than single machine.
• Even though the files are stored across the network,
• DFS organizes, and displays data in such a manner that a user sitting on a machine will feel like
all the data is stored in that very machine.
HDFS versus NFS
Network File System Hadoop Distributed File System
(NFS) (HDFS)
Single machine makes part Single virtual file system
of its file system available to spread over many
other machines machines
Sequential or random access Optimized for sequential
PRO: Simplicity, generality, read and local accesses
transparency PRO: High throughput,
CON: Storage capacity and high capacity
throughput limited by single CON: Specialized for
server particular types of
02/15/2025
applications 11
Next topic :
Comparing Hadoop and RDBMS
S.No. RDBMS Hadoop
Traditional row-column based databases,
An open-source software used for storing data and
1. basically used for data storage, manipulation
running applications or processes concurrently.
and retrieval.
In this both structured and unstructured data is
2. In this structured data is mostly processed.
processed.
3. It is best suited for OLTP environment. It is best suited for BIG data.
4. It is less scalable than Hadoop. It is highly scalable.
5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop.
6. It stores transformed and aggregated data. It stores huge volume of data.
7. It has no latency in response. It has some latency in response.
8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
9. High data integrity available. Low data integrity available than RDBMS.
10. Cost is applicable for licensed software. Free of cost, as it is an open source software.
In transaction
processing,
ACID
(atomicity,
consistency,
isolation, and
durability)
is an acronym
and
mnemonic
device used to
refer to the
four essential
properties a
transaction
should
possess to
ensure the
integrity and
reliability of
the data
Hadoop Distributed file system
• HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware
• HDFS is not good fit for following
• Low-latency data access
• Lots of small files
• Multiple writers and arbitrary file modifications
HDFS concepts :going to study
• Block size
• HDFS block size
• Difference between HDFS and other file system
• Why is a block size in HDFS so large?
Main HDFS services
1. Name Node
2. Secondary Name node
3. Job Tracker
4. Data Node
5. Task Tracker
Name Node File.txt :440
MB
Request a.txt : 1
b.txt: 3
Client
Acknowledgement c.txt : 5
d.txt :7
1,3,5,7
b.txt
File.txt : input file a.txt
440MB : File size
Data Node Data Node Data Node Data Node
Input Splits c.txt 1 2 3 4
a.txt : 128 MB
b.txt : 128 MB
c.txt : 128 MB d.txt
d.txt : 56 MB
Data Node Data Node Data Node Data Node
5 6 7 8
HDFS Architecture (may be skip)
NameNode : summary
• NameNode is the master node in the Apache Hadoop HDFS
Architecture that maintains and manages the blocks present on the
DataNodes (slave nodes).
• NameNode is highly available server that manages the File System
Namespace and controls access to files by clients.
Functions of NameNode: How it
achieves
• It is master daemon that maintains and manages the DataNodes (slave
nodes)
• It records the metadata of all the files stored in the cluster, e.g.
The location of blocks stored, the size of the files, permissions,
hierarchy, etc. There are two files associated with the metadata:
• FsImage (file system image): It contains the file system namespace's complete
state since the NameNode's start.
• EditLogs: It contains all the recent modifications made to the file system with
respect to the most recent FsImage.
• It records each change that takes place to the file system metadata. For
example, if a file is deleted in HDFS, the NameNode will immediately
record this in the EditLog.
•HDFS metadata (such as file creation, deletion, and block replication) in chronological order.
•Purpose: It ensures that all changes are captured and can be replayed to reconstruct the current
state of the file system after a restart or crash.
txn_id=1001, timestamp=2024-08-01 10:00:00, operation=MKDIR, path=/user/data
txn_id=1002, timestamp=2024-08-01 10:05:00, operation=CREATE, path=/user/data/file1.txt, replication=3,
blocksize=128MB
txn_id=1003, timestamp=2024-08-01 10:06:00, operation=RENAME, src=/https/www.scribd.com/user/data/file1.txt,
dst=/user/data/file1_renamed.txt
txn_id=1004, timestamp=2024-08-01 10:07:00, operation=DELETE, path=/user/data/file1_renamed.txt
txn_id=1005, timestamp=2024-08-01 10:08:00, operation=SET_REPLICATION, path=/user/data/file2.txt, replication=2
•FSImage is a snapshot of the entire file system's metadata at a specific point in time.
It represents the state of the file system when it was last saved.
•Purpose: It allows the NameNode to load the current state of the file system quickly during startup,
•without replaying all the transactions from the Edit Log.
Functions of Namenode : how it
achieves
• It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live.
• It keeps a record of all the blocks in HDFS and in which nodes these
blocks are located.
• The NameNode is also responsible to take care of
the replication factor of all the blocks which we will discuss in detail
later in this
• In case of the DataNode failure, the NameNode chooses new
DataNodes for new replicas, balance disk usage and manages the
communication traffic to the DataNodes.
Data Nodes
• DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode
is a commodity hardware, that is, a non-expensive system which is
not of high quality or high-availability.
Functions of Data Node
• These are slave daemons or process which runs on each slave
machine.
• The actual data is stored on DataNodes.
• The DataNodes perform the low-level read and write requests from
the file system’s clients.
• They send heartbeats to the NameNode periodically to report the
overall health of HDFS, by default, this frequency is set to 3 seconds.
Secondary NameNode
• Apart from these two daemons, there is a third daemon or a process
called Secondary NameNode. The Secondary NameNode works
concurrently with the primary NameNode as a helper daemon.
• And don’t be confused about the Secondary NameNode being
a backup NameNode because it is not.
Functions of Secondary
NameNode
• The Secondary NameNode is one which constantly reads all the file
systems and metadata from the RAM of the NameNode and writes it
into the hard disk or the file system.
• It is responsible for combining the EditLogs with FsImage from the
NameNode.
• It downloads the EditLogs from the NameNode at regular intervals
and applies to FsImage. The new FsImage is copied back to the
NameNode, which is used whenever the NameNode is started the
next time.
1. Checkpointing:
• The primary function of the Secondary NameNode is to periodically merge the Edit Log and the FSImage (File
System Image) stored by the NameNode. This process is known as checkpointing.
• The Edit Log is a file where the NameNode records every change made to the HDFS, while the FSImage is a
snapshot of the filesystem's metadata at a certain point in time.
• Over time, the Edit Log can grow very large, making the recovery process slow and cumbersome because the
NameNode would need to replay all the transactions from a potentially large Edit Log to rebuild the filesystem
state.
• The Secondary NameNode periodically downloads the current FSImage and Edit Log from the NameNode, merges
them into a new FSImage, and then uploads this new FSImage back to the NameNode. The Edit Log is then
truncated to start fresh.
2. Reducing NameNode Load:
• By performing the checkpointing task, the Secondary NameNode helps reduce the workload on the NameNode. The
NameNode does not have to continuously merge the Edit Log and FSImage, as this is handled by the Secondary
NameNode at intervals.
3. Assisting in Recovery:
• In the event of a NameNode failure, the Secondary NameNode’s latest checkpointed FSImage and a much smaller
Edit Log can be used to restart the NameNode more quickly. However, it is important to note that the Secondary
NameNode itself does not automatically take over in case of a NameNode failure; it simply provides the latest
metadata to assist in recovery.
• Note:
• The NameNode collects block report from DataNode
periodically to maintain the replication factor.
• Therefore, whenever a block is over-replicated or under-
replicated the NameNode deletes or add replicas as
needed.
Rack Awareness
• The NameNode also ensures that all the
replicas are not stored on the same rack or a
single rack.
• It follows an in-built Rack Awareness
Algorithm to reduce latency as well as provide
fault tolerance.
• Considering the replication factor is 3, the
Rack Awareness Algorithm says that
• The first replica of a block will be stored on
a local rack and
• The next two replicas will be stored on a
different (remote) rack but, on a different
DataNode within that (remote) rack as
shown in the figure above.
• If you have more replicas, the rest of the
replicas will be placed on random
DataNodes provided not more than two
why do we need a Rack Awareness
algorithm?
• The reasons are:
• To improve the network performance: The communication between nodes
residing on different racks is directed via switch.
• To prevent loss of data: We don’t have to worry about the data even if an
entire rack fails because of the switch failure or power failure. And if you think
about it, it will make sense, as it is said that never put all your eggs in the
same basket [proverb] (i.e. don't risk everything on the success of one
venture.)
NETWORK TOPOLOGY AND
HADOOP
What does it mean for two nodes in a local network to be “close” to
each other?
• Processes on the same node
• Different nodes on the same rack
• Nodes on different racks in the same data center
• Nodes in different data centers
• Let's imagine your cluster as a tree with the following levels:
• Abstract global root (Top or root)
• Data centers (1st level)
• Racks (2nd level)
• Nodes (3rd level or leaves)
• If we draw this tree there should be something like this:
For example, imagine a node n1 on rack r1 in data center d1. This can be
represented as /d1/r1/n1. Using this notation, here are the distances for the four
scenarios:
• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)
• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
• distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same
data center)
• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers).
• distance(/d1/r1/n1, /d2/r3/n10) = ?
What is the network distance?
• Let's imagine your cluster as a tree with the following levels:
• Abstract global root (Top or root)
• Data centers (1st level)
• Racks (2nd level)
• Nodes (3rd level or leaves)
• If we draw this tree there should be something like this:
• Let's count distance between any circle and its parent as 1.
• Then the distance between any two circles is the sum of their distance to their closest
common ancestor or 0 for the same node.
• So it's always 6 for any two nodes in different data centers (like between /d1/r1/n1
and /d2/r4/n10).
HDFS Write Architecture:
• Suppose, the NameNode provided following lists of IP addresses to the client:
• For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
• For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
• Each block will be copied in three different DataNodes to maintain the replication
factor consistent throughout the cluster.
• Now the whole data copy process will happen in three stages:
•For Block A: 1A -> 2A -> 3A -> 4A
•For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B
• While serving read request of the client, HDFS selects the replica
which is closest to the client.
• This reduces the read latency and the bandwidth consumption.
• Therefore, that replica is selected which resides on the same rack as
the reader node, if possible.
Anatomy of File Read
Anatomy of a File Write
Configuration XML files in Hadoop
1. hadoop-env.sh
• This file specifies environment variables that affect the JDK used by Hadoop
Daemon (bin/hadoop). (JAVA_HOME)
• It informs Hadoop daemon where NameNode runs in the cluster. It contains
the configuration settings for Hadoop Core such as I/O settings that are
common to HDFS and MapReduce.
2. hdfs-site.xml
• This is the main configuration file for HDFS. It defines the namenode and
datanode paths as well as replication factor.
• Replication factor is specified by dfs.replication property; as it is a single node
cluster hence we will set replication to 1.
• The default heartbeat interval is 3 seconds. One can change it by using
dfs.heartbeat.interval in hdfs-site.xml.
3. mapred-site.xml
• In order to specify which framework should be used for MapReduce, we use
mapreduce.framework.name property, yarn is used here.
• yarn-site.xml
• In order to specify auxiliary service need to run with nodemanager
“yarn.nodemanager.aux-services” property is used.
• Here Shuffling is used as auxiliary service. And in order to know the class that
should be used for shuffling we user “yarn.nodemanager.aux-
services.mapreduce.shuffle.class”
• core-site.xml
• This file defines port number, memory, memory limits, size of read/write buffers
used by Hadoop. Find this file in the etc/hadoop directory
• Location of namenode is specified by fs.defaultFS property
namenode running at 9000 port on localhost.
• hadoop.tmp.dir property to specify the location where temporary as well as
permanent data of Hadoop will be stored.