0% found this document useful (0 votes)
29 views

Unit 2 Big Data Notes

big data 702 notes

Uploaded by

pahadesunanda17
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Unit 2 Big Data Notes

big data 702 notes

Uploaded by

pahadesunanda17
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

INTRODUCTION:

Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle
big data and is based on the MapReduce programming model, which allows for the
parallel processing of large data set.
What is Hadoop?
Hadoop is an open source software programming framework for storing a large amount
of data and performing the computation. Its framework is based on Java programming
with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle
big data and is based on the MapReduce programming model, which allows for the
parallel processing of large datasets.
Hadoop has two main components:

 HDFS (Hadoop Distributed File System): This is the storage component of Hadoop,
which allows for the storage of large amounts of data across multiple machines. It is
designed to work with commodity hardware, which makes it cost-effective.
 YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as CPU and
memory) for processing the data stored in HDFS.
 Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level platform
for creating MapReduce programs), and HBase (a non-relational, distributed
database).
 Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis,
and data mining. It enables the distributed processing of large data sets across
clusters of computers using a simple programming model.

History of Hadoop

Apache Software Foundation is the developers of Hadoop, and it’s co-founders


are Doug Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on his
son’s toy elephant. In October 2003 the first paper release was Google File System. In
January 2006, MapReduce development started on the Apache Nutch which consisted
of around 6000 lines coding for it and around 5000 lines coding for HDFS. In April
2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data. It
was created by Apache Software Foundation in 2006, based on a white paper written by
Google in 2003 that described the Google File System (GFS) and the MapReduce
programming model. The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. It is used by many organizations, including Yahoo, Facebook,
and IBM, for a variety of purposes such as data warehousing, log processing, and
research. Hadoop has been widely adopted in the industry and has become a key
technology for big data processing.
Features of hadoop:

1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features that make it well-suited for big data
processing:

 Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines, making
it easy to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can
continue to operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is stored on the
same node where it will be processed, this feature helps to reduce the network traffic
and improve the performance
 High Availability: Hadoop provides High Availability feature, which helps to make
sure that the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide
variety of data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure
that the data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to replicate
the data across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature, which helps
to reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data processing
engines like real-time streaming, batch processing, and interactive SQL, to run and
process data stored in HDFS.

Hadoop Distributed File System (HDFS)

It has distributed file system known as HDFS and this HDFS splits files into blocks and
sends them across various nodes in form of large clusters. Also in case of a node failure,
the system operates and data transfer takes place between the nodes which are
facilitated by HDFS.

HDFS

Advantages of HDFS:

It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults,


scalable, block structured, can process a large amount of data simultaneously and many
more. Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for
small quantities of data. Also, it has issues related to potential stability, restrictive and
rough in nature. Hadoop also supports a wide range of software packages such as
Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache
Storm, Apache Pig, Apache Hive, Apache Phoenix.

Introduction:

Hadoop Ecosystem is a platform or a suite which provides various services to solve the
big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common Utilities. Most of the tools or solutions are used to supplement
or support these major elements. All these tools work collectively to provide services
such as absorption, analysis, storage and maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem:


 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other components
too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.

HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is responsible


for storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware in the distributed environment. Undoubtedly,
making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.

YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.

MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it


possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.
PIG:

Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data
sets.
 Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.

HIVE:

 With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive Query
Language).
 It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query processing
easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.
Mahout:
 Mahout, allows Machine Learnability to a system or application. Machine Learning,
as the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine learning. It
allows invoking algorithms as per our need with the help of its own libraries.

Apache Spark:

 It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization,
etc.
 It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.

Apache HBase:

 It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus
able to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At
such times, HBase comes handy as it gives us a tolerant way of storing limited data.

Other Components:
Apart from all of these, there are some other components too that carry out a huge task
in order to make Hadoop capable of processing large datasets. They are as follows:

 Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
 Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted in
inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those
that are triggered when some data or external stimulus is given to it.

Hadoop – Architecture


As we all know Hadoop is a framework written in Java that utilizes a large cluster of
commodity hardware to maintain and store big size data. Hadoop works on MapReduce
Programming Algorithm that was introduced by Google. Today lots of Big Brand
Companies are using Hadoop in their Organization to deal with big data, eg. Facebook,
Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

Let’s understand the role of each one of this component in detail.

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on the
YARN framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so fast. When
you are dealing with Big Data, serial processing is no more of any use. MapReduce has
mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then it’s output is used
as an input to the Reduce function and after that, we receive our final output. Let’s
understand What this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big Data. The
Input is a set of Data. The Map() function here breaks this DataBlocks into Tuples that
are nothing but a key-value pair. These key-value pairs are now sent as input to the
Reduce(). The Reduce() function then combines this broken Tuples or key-value pair
based on its Key value and form set of Tuples, and perform some operation like sorting,
summation type job, etc. which is then sent to the final Output Node. Finally, the Output
is Obtained.
The data processing is always done in Reducer depending upon the business requirement
of that industry. This is How First Map() and then Reduce is utilized one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:

RecordReader
The purpose of recordreader is to break the records. It is responsible for providing key-
value pairs in a Map() function. The key is actually is its locational information and value
is the data associated with it.
 Map: A map is nothing but a user-defined function whose work is to process the
Tuples obtained from record reader. The Map() function either does not generate any
key-value pair or generate multiple pairs of these tuples.
 Combiner: Combiner is used for grouping the data in the Map workflow. It is similar
to a Local reducer. The intermediate key-value that are generated in the Map is
combined with the help of this combiner. Using a combiner is not necessary as it is
optional.
 Partitionar: Partitional is responsible for fetching key-value pairs generated in the
Mapper Phases. The partitioner generates the shards corresponding to each reducer.
Hashcode of each key is also fetched by this partition. Then partitioner performs
it’s(Hashcode) modulus with the number of reducers(key.hashcode()%(number of
reducers)).

Reduce Task

 Shuffle and Sort: The Task of Reducer starts with this step, the process in which the
Mapper generates the intermediate key-value and transfers them to the Reducer task is
known as Shuffling. Using the Shuffling process the system can sort the data using its
key value.
Once some of the Mapping tasks are done Shuffling begins that is why it is a faster
process and does not wait for the completion of the task performed by Mapper.
 Reduce: The main function or task of the Reduce is to gather the Tuple generated
from Map and then perform some sorting and aggregation sort of process on those
key-value depending on its key element.
 OutputFormat: Once all the operations are performed, the key-value pairs are written
into the file with the help of record writer, each record in a new line, and the key and
value in a space-separated manner.

2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly


designed for working on commodity Hardware devices(inexpensive devices), working on
a distributed file system design. HDFS is designed in such a way that it believes more in
storing the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and
the other devices present in that Hadoop cluster. Data storage Nodes in HDFS.
 NameNode(Master)
 DataNode(Slave)

NameNode:

NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).


Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data
can be the transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the closest
DataNode for Faster Communication. Namenode instructs the DataNodes with the
operation like delete, create, Replicate, etc.
DataNode:

DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that.
The more number of DataNode, the Hadoop cluster will be able to store more data. So it
is advised that the DataNode should have High storing capacity to store a large number of
file blocks.

High Level Architecture Of Hadoop


File Block In HDFS:

Data in HDFS is always stored in terms of blocks. So the single block of data is divided
into multiple blocks of size 128MB which is default and you can also change it
manually.

Let’s understand this concept of breaking down of file in blocks with an example.
Suppose you have uploaded a file of 400MB to your HDFS then what happens is this file
got divided into blocks of 128MB+128MB+128MB+16MB = 400MB size. Means 4
blocks are created each of 128MB except the last one. Hadoop doesn’t know or it doesn’t
care about what data is stored in these blocks so it considers the final file blocks as a
partial record as it does not have any idea regarding it. In the Linux file system, the size
of a file block is about 4KB which is very much less than the default size of file blocks in
the Hadoop file system. As we all know Hadoop is mainly configured for storing the
large size data which is in petabyte, this is what makes Hadoop file system different from
other file systems as it can be scaled, nowadays file blocks of 128MB to 256MB are
considered in Hadoop.
Replication In HDFS

Replication ensures the availability of the data. Replication is making a copy of


something and the number of times you make a copy of that particular thing can be
expressed as it’s Replication Factor. As we have seen in File blocks that the HDFS stores
the data in the form of various blocks at the same time Hadoop is also configured to make
a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured means
you can change it manually as per your requirement like in above example we have made
4 file blocks which means that 3 Replica or copy of each file block is made means total of
4×3 = 12 blocks are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive
system hardware) which can be crashed at any time. We are not using the supercomputer
for our Hadoop setup. That is why we need such a feature in HDFS which can make
copies of that file blocks for backup purposes, this is known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of our file
blocks we are wasting so much of our storage but for the big brand organization the data
is very much important than the storage so nobody cares for this extra storage. You can
configure the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our
Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so many Racks .
with the help of this Racks information Namenode chooses the closest Datanode to
achieve the maximum performance while performing the read/write information which
reduces the Network Traffic.

HDFS Architecture
3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations that


are Job scheduling and Resource Management. The Purpose of Job schedular is to divide
a big task into small jobs so that each job can be assigned to various slaves in a Hadoop
cluster and Processing can be Maximized. Job Scheduler also keeps track of which job is
important, which job has more priority, dependencies between the jobs and all the other
information like job timing, etc. And the use of Resource Manager is to manage all the
resources that are made available for running a Hadoop cluster.
Features of YARN

 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files or we
can say the java scripts that we need for all the other components present in a Hadoop cluster.
these utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop
Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.
Hive Architecture
The following architecture explains the flow of submission of query into Hive.

Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect
to Hive.

Hive Services
The following are the services provided by Hive:-

 Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
 Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
 Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column
and its type information, the serializers and deserializers which is used to read and
write data and the corresponding HDFS files where the data is stored.
 Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
 Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
 Hive Compiler - The purpose of the compiler is to parse the query and perform
semantic analysis on the different query blocks and expressions. It converts HiveQL
statements into MapReduce jobs.
 Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.

RDBMS and Hadoop both are used for data handling, storing, and processing data but
they are different in terms of design, implementation, and use cases. In RDBMS, store
primarily structured data and processing by SQL while in Hadoop, store or handle
structured and unstructured data and processing using Map-Reduce or Spark. In this
article, we will go into detail about RDBMS and Hadoop and also look into the
difference between RDBMS and Hadoop.

What is the RDMS?

RDBMS is an information management system, which is based on a data model. In


RDBMS tables are used for information storage. Each row of the table represents a
record and the column represents an attribute of data. The organization of data and their
manipulation processes are different in RDBMS from other databases. RDBMS ensures
ACID (atomicity, consistency, integrity, durability) properties required for designing a
database. The purpose of RDBMS is to store, manage, and retrieve data as quickly and
reliably as possible.

Advantages

 Support High data integrity.


 Provide multiple levels of security.
 Create a replicate of data so that the data can be used in case of disasters.
 Normalization is present.
Disadvantages

 It is less scalable as compared to Hadoop.


 Huge costs are required.
 Complex design.
 Structure of database is fixed.
What is the Hadoop?

It is an open-source software framework used for storing data and running applications
on a group of commodity hardware. It has a large storage capacity and high processing
power. It can manage multiple concurrent processes at the same time. It is used in
predictive analysis, data mining, and machine learning. It can handle both structured
and unstructured forms of data. It is more flexible in storing, processing, and managing
data than traditional RDBMS. Unlike traditional systems, Hadoop enables multiple
analytical processes on the same data at the same time. It supports scalability very
flexibly.
Advantages
 Highly scalable.
 Cost effective because it is an open source software.
 Store any kind of data (structured, semi-structured and un-structured ).
 High throughput.
Disadvantages
 Not effective for small files.
 Security feature is not available.
 High up processing.
 Supports only batch processing (Map reduce).

Differences Between RDBMS and Hadoop

RDBMS Hadoop

Traditional row-column based databases, An open-source software used for


basically used for data storage, manipulation storing data and running applications
and retrieval. or processes concurrently.

In this both structured and


In this structured data is mostly processed.
unstructured data is processed.

It is best suited for OLTP environment. It is best suited for BIG data.

It is less scalable than Hadoop. It is highly scalable.

Data normalization is not required in


Data normalization is required in RDBMS.
Hadoop.

It stores transformed and aggregated data. It stores huge volume of data.

It has no latency in response. It has some latency in response.

The data schema of Hadoop is


The data schema of RDBMS is static type.
dynamic type.

High data integrity available. Low data integrity available than


RDBMS Hadoop

RDBMS.

Free of cost, as it is an open source


Cost is applicable for licensed software.
software.

Data is process using SQL. Data is process using Map-Reduce.

Follow ACID properties. Does not follow ACID properties.

Introduction to Hadoop Distributed File


System(HDFS)

With growing data velocity the data size easily outgrows the storage limit of a machine.
A solution would be to store the data across a network of machines. Such filesystems are
called distributed filesystems. Since data is stored across a network all the complications
of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely
large files with streaming data access pattern and it runs on commodity hardware. Let’s
elaborate the terms:
 Extremely large files: Here we are talking about the data in range of petabytes(1000
TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times. Once data is written large portions of dataset can be processed any
number times.
 Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other file
system.
Notes: Master-slave nodes typically forms the HDFS cluster.

1. NameNode(MasterNode):

 Manages all the slave nodes and assign work to them.


 It executes filesystem namespace operations like opening, closing, renaming files
and directories.
 It should be deployed on reliable hardware which has the high config. not on
commodity hardware.

2. DataNode(SlaveNode):
 Actual worker nodes, who do the actual work like reading, writing, processing etc.
 They also perform creation, deletion, and replication upon instruction from the
master.
 They can be deployed on commodity hardware.
HDFS daemons:
Daemons are the processes running in background.
 Namenodes:
o Run on the master node.
o Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
o Require high amount of RAM.
o Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
 DataNodes:
o Run on slave nodes.
o Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.

Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the
file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these
blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate the blocks among themselves and the information of what
blocks they contain is sent to the master. Default replication factor is 3 means for each
block 3 replicas are created (including itself). In hdfs.site.xml we can increase or decrease
the replication factor i.e we can edit its configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each
and every single data nodes and the blocks they contain, i.e. nothing is done without the
permission of masternode.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file
on a single machine. Even if we store, then each read and write operation on that whole
file is going to take very high seek time. But if we have multiple blocks of size 128MB
then its become easy to perform various read and write operations on it compared to
doing it on a whole file at once. So we divide the file to have faster data access i.e. reduce
seek time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow block is present on
datanode D1. Now if the data node D1 crashes we will lose the block and which will
make the overall data inconsistent and faulty. So we replicate the blocks to achieve fault-
tolerance.
Terms related to HDFS:
 HeartBeat : It is the signal that datanode continuously sends to namenode. If
namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
 Balancing : If a datanode is crashed the blocks present on it will be gone too and the
blocks will be under-replicated compared to the remaining blocks. Here master
node(namenode) will give a signal to datanodes containing replicas of those lost
blocks to replicate so that overall distribution of blocks is balanced.
 Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
 Distributed data storage.
 Blocks reduce seek time.
 The data is highly available as the same block is present at multiple datanodes.
 Even if multiple datanodes are down we can still do our work, thus making it highly
reliable.
 High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it doesn’t
work well.
 Low latency data access: Applications that require low-latency access to data i.e in
the range of milliseconds will not work well with HDFS, because HDFS is designed
keeping in mind that we need high-throughput of data even at the cost of latency.
 Small file problem: Having lots of small files will result in lots of seeks and lots of
movement from one datanode to another datanode to retrieve each small file, this
whole process is a very inefficient data access pattern.

MapReduce Architecture


MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner. The data is first split
and then combined to produce the final result. The libraries for M apReduce is written in
so many programming languages with various different-different optimizations. The
purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce it to
equivalent tasks for providing less overhead over the cluster network and to reduce the
processing power. The MapReduce task is mainly divided into two phases Map
Phase and Reduce Phase.
MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The
result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to
the Hadoop MapReduce Master. Now, the MapReduce master will divide this job into
further equivalent job-parts. These job-parts are then made available for the Map and
Reduce Task. This Map and Reduce task will contain the program as per the
requirement of the use-case that the particular company is solving. The developer writes
their logic to fulfill the requirement that the industry requires. The input data which we
are using is then fed to the Map Task and the Map will generate intermediate key-value
pair as its output. The output of Map i.e. these key-value pairs are then fed to the
Reducer and the final output is stored on the HDFS. There can be n number of Map and
Reduce tasks made available for processing the data as per the requirement. The
algorithm for Map and Reduce is made with a very optimized way such that the time
complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs.
The input to the map may be a key-value pair where the key can be the id of some
kind of address and value is the actual value that it keeps. The Map() function will
be executed in its memory repository on each of these input key-value pairs and
generates the intermediate key-value pair which works as input for the Reducer
or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or group the
data based on its key-value pair as per the reducer algorithm written by the
developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the jobs
across the cluster and also to schedule each map on the Task Tracker running on the
same data node since there can be hundreds of data nodes available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that are
working on the instruction given by the Job Tracker. This Task Tracker is deployed
on each of the nodes available in the cluster that executes the Map and Reduce task
as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon process that saves and stores
historical information about the task or application, like the logs which are generated
during or after the job execution are stored on Job History Server.

You might also like