Introduction To Hadoop
Introduction To Hadoop
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data
and is based on the MapReduce programming model, which allows for the parallel
processing of large datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a large amount of
data and performing the computation. Its framework is based on Java programming with
some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data
and is based on the MapReduce programming model, which allows for the parallel
processing of large datasets.
Hadoop has two main components:
HDFS (Hadoop Distributed File System): This is the storage component of Hadoop,
which allows for the storage of large amounts of data across multiple machines. It is
designed to work with commodity hardware, which makes it cost-effective.
YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as CPU and
memory) for processing the data stored in HDFS.
Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level platform
for creating MapReduce programs), and HBase (a non-relational, distributed
database).
Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis,
and data mining. It enables the distributed processing of large data sets across
clusters of computers using a simple programming model.
History of Hadoop:-
Apache Software Foundation is the developers of Hadoop, and it’s co-founders
are Doug Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on
his son’s toy elephant. In October 2003 the first paper release was Google File
System. In January 2006, MapReduce development started on the Apache Nutch
which consisted of around 6000 lines coding for it and around 5000 lines coding for
HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data.
It was created by Apache Software Foundation in 2006, based on a white paper
written by Google in 2003 that described the Google File System (GFS) and the
MapReduce programming model.
The Hadoop framework allows for the distributed processing of large data sets
across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local
computation and storage.
It is used by many organizations, including Yahoo, Facebook, and IBM, for a
variety of purposes such as data warehousing, log processing, and research. Hadoop
has been widely adopted in the industry and has become a key technology for big
data processing.
1
Features of hadoop:
it is fault tolerance.
it is highly available.
it’s programming is easy.
it have huge flexible storage.
it is low cost.
Hadoop has several key features that make it well-suited for big data processing:
Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to
operate even in the presence of hardware failures.
Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and improve
the performance
High Availability: Hadoop provides High Availability feature, which helps to make sure that
the data is always available and is not lost.
Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of data
processing tasks.
Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the
data stored is consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps to replicate the data
across the cluster for fault tolerance.
Data Compression: Hadoop provides built-in data compression feature, which helps to
reduce the storage space and improve the performance.
YARN: A resource management platform that allows multiple data processing engines like
real-time streaming, batch processing, and interactive SQL, to run and process data stored in
HDFS.
Hadoop Distributed File Systems (HDFS):
HDFS is a distributed file system that handles large data sets running on commodity
hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even
thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the others
being MapReduce and YARN.
2
The goals of HDFS:-
Fast recovery from hardware failures:-
Because one HDFS instance may consist of thousands of servers, failure of at least one server
is inevitable. HDFS has been built to detect faults and automatically recover quickly.
Access to streaming data:
HDFS is intended more for batch processing versus interactive use, so the emphasis in the
design is for high data throughput rates, which accommodate streaming access to data sets.
Accommodation of large data sets:
HDFS accommodates applications that have data sets typically gigabytes to terabytes in size.
HDFS provides high aggregate data bandwidth and can scale to hundreds of nodes in a single
cluster.
Portability:
To facilitate adoption, HDFS is designed to be portable across multiple hardware platforms
and to be compatible with a variety of underlying operating systems.
Nodes:
NameNode(MasterNode):
Manages all the slave nodes and assign work to them.
It executes filesystem namespace operations like opening, closing, renaming files and
directories.
It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
DataNode(SlaveNode):
Actual worker nodes, who do the actual work like reading, writing, processing etc.
3
They also perform creation, deletion, and replication upon instruction from the
master.
They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
Namenodes:
Run on the master node.
Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
Require high amount of RAM.
Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
DataNodes:
Run on slave nodes.
Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the
file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these
blocks are stored across different datanodes(slavenode). Datanodes(slavenode)replicate the
blocks among themselves and the information of what blocks they contain is sent to the
master. Default replication factor is 3 means for each block 3 replicas are created (including
itself). In hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its
configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each and
every single data nodes and the blocks they contain, i.e. nothing is done without the
permission of masternode.
Why divide the file into blocks?
Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a single
machine. Even if we store, then each read and write operation on that whole file is going to
take very high seek time. But if we have multiple blocks of size 128MB then its become
easy to perform various read and write operations on it compared to doing it on a whole file
at once. So we divide the file to have faster data access i.e. reduce seek time.
Why replicate the blocks in data nodes while storing?
Let’s assume we don’t replicate and only one yellow block is present on datanode D1. Now
if the data node D1 crashes we will lose the block and which will make the overall data
inconsistent and faulty. So we replicate the blocks to achieve fault-tolerance.
Terms related to HDFS:
HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode
doesn’t receive heartbeat from a datanode then it will consider it dead.
Balancing : If a datanode is crashed the blocks present on it will be gone too and the
blocks will be under-replicated compared to the remaining blocks. Here master
node(namenode) will give a signal to datanodes containing replicas of those lost blocks
to replicate so that overall distribution of blocks is balanced.
Replication:: It is done by datanode.
4
Note: No two replicas of the same block are present on the same datanode.
Advantages of HDFS:
It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults,
scalable, block structured, can process a large amount of data simultaneously and many
more.
Disadvantages of HDFS:
It’s the biggest disadvantage is that it is not fit for small quantities of data.