0% found this document useful (0 votes)
18 views

CH-05_cc

Uploaded by

226025covrr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

CH-05_cc

Uploaded by

226025covrr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT – V

HDFS and it’s Architecture


HDFS and it’s Architecture
5.1 Introduction to Hadoop Distributed File System
and Google File System.
5.2 Architecture of HDFS,
5.3 Comparison with Traditional Technology with distributed file system
5.4 What is Big Data?
5.5 Human Generated Data and Machine Generated Data
5.6 Where does Big Data come from
5.7 Examples of Big Data in the Real world
5.8 Challenges of Big Data
Hadoop Distributed File System
 Hadoop Distributed file system is a Java based distributed file system that allows you
to store large data across multiple nodes in a Hadoop cluster. So, if you install
Hadoop, you get HDFS as an underlying storage system for storing the data in the
distributed environment.
 Hadoop File System was developed using distributed file system design. It runs on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant
and designed using low-cost hardware.
 HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant
fashion to rescue the system from possible data losses in case of failure. HDFS also
makes applications available to parallel processing.
 HDFS runs on top of the existing file systems on each node in a Hadoop cluster. It is
a blockstructured file system where each file is divided into blocks of a pre-
determined size.These blocks are stored across a cluster of one or several machines.
Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of namenode and datanode help users to easily
check the status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
Google File System
 Google Inc. developed the Google File System (GFS), a scalable
distributed file system (DFS), to meet the company’s growing data
processing needs.
 GFS offers fault tolerance, dependability, scalability, availability, and
performance to big networks and connected nodes.
 The Google File System reduced hardware flaws while gains of
commercially available servers.
 GoogleFS is another name for GFS. It manages two types of data
namely File metadata and File Data.
 The GFS node cluster consists of a single master and several chunk
servers that various client systems regularly access.
Use below link to more details:
https://www.geeksforgeeks.org/google-file-system/
Architecture of HDFS
Architecture of HDFS
 Hadoop comes with a distributed file system called HDFS (HADOOP Distributed File Systems)
HADOOP based applications make use of HDFS.
 HDFS is designed for storing very large data files, running on clusters of commodity hardware.
 It is fault tolerant, scalable, and extremely simple to expand.
 Hadoop HDFS has a Master/Slave architecture in which Master is NameNode and Slave is DataNode.
 HDFS Architecture consists of single NameNode and all the other nodes are DataNodes
 HDFS NameNode
It is also known as Master node.
HDFS Namenode stores meta-data i.e. number of data blocks, replicas and other details.
This meta-data is available in memory in the master for faster retrieval of data.
NameNode maintains and manages the slave nodes, and assigns tasks to them.
It should deploy on reliable hardware as it is the centerpiece of HDFS.
 Task of NameNode
 Manage file system namespace.
 Regulates client’s access to files.
 It also executes file system execution such as naming, closing, opening files/directories.
 All DataNodes sends a Heartbeat and block report to the NameNode in the Hadoop cluster.
 It ensures that the DataNodes are alive. A block report contains a list of all blocks on a datanode.
 NameNode is also responsible for taking care of the Replication Factor of all the blocks.
Architecture of HDFS
 Files present in the NameNode metadata are as follows-
1. FsImage – It is an “Image file”. FsImage contains the entire filesystem namespace and stored as a file in the namenode’s
local file system. It also contains a serialized form of all the directories and file inodes in the filesystem. Each inode is
an internal representation of file or directory’s metadata.
2. EditLogs – It contains all the recent modifications made to the file system on the most recent FsImage. Namenode
receives a create/update/delete request from the client. After that this request is first recorded to edits file.
 HDFS DataNode
 It is also known as Slave.
 In Hadoop HDFS Architecture, DataNode stores actual data in HDFS.
 It performs read and write operation as per the request of the client.
 DataNodes can deploy on commodity hardware.

 Task of DataNode
 Block replica creation, deletion, and replication according to the instruction of Namenode.
 DataNode manages data storage of the system.
 DataNodes send heartbeat to the NameNode to report the health of HDFS. By default, this frequency
is set to 3 seconds.
Architecture of HDFS
 Secondary Namenode:
Secondary NameNode downloads the FsImage and EditLogs from the NameNode. And then merges
EditLogs with the FsImage (FileSystem Image). It keeps edits log size within a limit. It stores the modified
FsImage into persistent storage. And we can use it in the case of NameNode failure. Secondary NameNode
performs a regular checkpoint in HDFS.
 Checkpoint Node
The Checkpoint node is a node which periodically creates checkpoints of the namespace. Checkpoint
Node in Hadoop first downloads FsImage and edits from the Active Namenode. Then it merges them
(FsImage and edits) locally, and at last, it uploads the new image back to the active NameNode.
 Backup Node
In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace. The Backup
node checkpoint process is more efficient as it only needs to save the namespace into the local FsImage file
and reset edits. NameNode supports one Backup node at a time.
 Blocks
HDFS in Apache Hadoop split huge files into small chunks known as Blocks. These are the smallest unit of
data in a filesystem. We (client and admin) do not have any control on the block like block location.
NameNode decides all such things
Comparison with Traditional Technology
with distributed file system
There are many reasons that we might want a DFS. Some of the more
common ones are listed below:
 More storage than can fit on a single system
 More fault tolerance than can be achieved if "all of the eggs are in one
basket."
 The user is "distributed" and needs to access the file system from many
places
What is Data
 Data: Data is the collection of raw facts and figures.
 Information: Processed data is called information. Information is
data that has been processed in such a way as to be meaningful values to
the person who receives it.
 Ex:The data collected is in a survey report is:‘HYD20M’
If we process the above data we understand that code is information about a
person as follows: HYD is city name ‘Hyderabad’, 20 is age and M is to
represent ‘MALE’.
 Units of data:
What is Big Data
 Big Data is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available
database management tools or traditional data processing
applications.
 According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information
assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision
making.”
Characteristics of Big Data
 Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety,
Velocity, and Volume.
 Let’s discuss the characteristics of big data. These characteristics, isolated, are enough
to know what big data is. Let’s look at them in depth:
a) Variety Variety of Big Data refers to structured, unstructured, and semi-
structured data that is gathered from multiple sources. While in the past, data could
only be collected from spreadsheets and databases, today data comes in an array of
forms such as emails, PDFs, photos, videos, audios, SM posts, and so much more.
Variety is one of the important characteristics of big data.
b) Velocity Velocity essentially refers to the speed at which data is being
created in real-time. In a broader prospect, it comprises the rate of change,
linking of incoming data sets at varying speeds, and activity bursts.
c) Volume Volume is one of the characteristics of big data. We already know
that Big Data indicates huge ‘volumes’ of data that is being generated on a
daily basis from various sources like social media platforms, business
processes, machines, networks, human interactions, etc. Such a large
amount of data is stored in data warehouses. Thus comes to the end of
characteristics of big data.
Human Generated Data and Machine
Generated Data
 Human Generated Data is emails, documents, photos and tweets. We
are generating this data faster than ever. Just imagine the number of
videos uploaded to You Tube and tweets swirling around. This data can
be Big Data too.
 Machine Generated Data is a new breed of data. This category consists
of sensor data, and logs generated by 'machines' such as email logs,
click stream logs, etc. Machine generated data is orders of magnitude
larger than Human Generated Data.
Where does Big Data come from
The major sources of Big Data:
social media sites,
sensor networks,
digital images/videos,
cell phones,
purchase transaction records,
web logs,
medical records,
archives,
military surveillance,
ecommerce,
Complex scientific research and so on.
Examples of Big Data in the Real world
1.The New York Stock Exchange generates about one terabyte of new trade data per day.
2. Social Media Impact:Statistic shows that 500+terabytes of new data gets ingested into
the databases of social media site Facebook, every day. This data is mainly generated in terms
of photo and video uploads, message exchanges, putting comments etc.
3. Walmart handles more than 1 million customer transactions every hour.
4. Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
5. 230+ millions of tweets are created every day.
6. More than 5 billion people are calling, texting, tweeting and browsing on mobile phones
worldwide.
7.YouTube users upload 48 hours of new video every minute of the day.
8. Amazon handles 15 million customer click stream user data per day to recommend
products.
9. 294 billion emails are sent every day. Services analyses this data to find the spams.
10.Single Jet engine can generate 10+ terabytes of data in 30 minutes of a flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Challenges of Big Data
1. Need For Synchronization Across Disparate Data Sources
As data sets are becoming bigger and more diverse, there is a big challenge to
incorporate them into an analytical platform. If this is overlooked, it will
create gaps and lead to wrong messages and insights.
2. Acute Shortage Of Professionals Who Understand Big Data
Analysis
The analysis of data is important to make this voluminous amount of data
being produced in every minute, useful. With the exponential rise of data, a
huge demand for big data scientists and Big Data analysts has been created in
the market. It is important for business organizations to hire a data scientist
having skills that are varied as the job of a data scientist is multidisciplinary.
Another major challenge faced by businesses is the shortage of professionals
who understand Big Data analysis. There is a sharp shortage of data scientists
in comparison to the massive amount of data being produced.
Challenges of Big Data
4. Getting Meaningful Insights Through The Use Of Big Data
Analytics
It is imperative for business organizations to gain important insights from Big Data
analytics, and also it is important that only the relevant department has access to this
information. A big challenge faced by the companies in the Big Data analytics is
mending this wide gap in an effective manner.
5. GettingVoluminous Data IntoThe Big Data Platform
It is hardly surprising that data is growing with every passing day. This simply
indicates that business organizations need to handle a large amount of data on daily
basis. The amount and variety of data available these days can overwhelm any data
engineer and that is why it is considered vital to make data accessibility easy and
convenient for brand owners and managers.
6. Uncertainty Of Data Management Landscape
With the rise of Big Data, new technologies and companies are being developed
every day. However, a big challenge faced by the companies in the Big Data analytics
is to find out which technology will be best suited to them without the introduction
of new problems and potential risks.
Challenges of Big Data
7. Data Storage And Quality
Business organizations are growing at a rapid pace. With the tremendous growth of the
companies and large business organizations, increases the amount of data produced.
The storage of this massive amount of data is becoming a real challenge for everyone.
Popular data storage options like data lakes/ warehouses are commonly used to gather
and store large quantities of unstructured and structured data in its native format. The
real problem arises when a data lakes/ warehouse try to combine unstructured and
inconsistent data from diverse sources, it encounters errors. Missing data, inconsistent
data, logic conflicts, and duplicates data all result in data quality challenges.
8. Security And Privacy Of Data
Once business enterprises discover how to use Big Data, it brings them a wide range
of possibilities and opportunities. However, it also involves the potential risks
associated with big data when it comes to the privacy and the security of the data. The
Big Data tools used for analysis and storage utilizes the data disparate sources. This
eventually leads to a high risk of exposure of the data, making it vulnerable. Thus, the
rise of voluminous amount of data increases privacy and security concerns.
USE BELOW LINK FOR CHAPTER V

https://hadoopilluminated.com/hadoop_illuminated/Big_Data.html
#d1575e261

You might also like