CH-05_cc
CH-05_cc
Task of DataNode
Block replica creation, deletion, and replication according to the instruction of Namenode.
DataNode manages data storage of the system.
DataNodes send heartbeat to the NameNode to report the health of HDFS. By default, this frequency
is set to 3 seconds.
Architecture of HDFS
Secondary Namenode:
Secondary NameNode downloads the FsImage and EditLogs from the NameNode. And then merges
EditLogs with the FsImage (FileSystem Image). It keeps edits log size within a limit. It stores the modified
FsImage into persistent storage. And we can use it in the case of NameNode failure. Secondary NameNode
performs a regular checkpoint in HDFS.
Checkpoint Node
The Checkpoint node is a node which periodically creates checkpoints of the namespace. Checkpoint
Node in Hadoop first downloads FsImage and edits from the Active Namenode. Then it merges them
(FsImage and edits) locally, and at last, it uploads the new image back to the active NameNode.
Backup Node
In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace. The Backup
node checkpoint process is more efficient as it only needs to save the namespace into the local FsImage file
and reset edits. NameNode supports one Backup node at a time.
Blocks
HDFS in Apache Hadoop split huge files into small chunks known as Blocks. These are the smallest unit of
data in a filesystem. We (client and admin) do not have any control on the block like block location.
NameNode decides all such things
Comparison with Traditional Technology
with distributed file system
There are many reasons that we might want a DFS. Some of the more
common ones are listed below:
More storage than can fit on a single system
More fault tolerance than can be achieved if "all of the eggs are in one
basket."
The user is "distributed" and needs to access the file system from many
places
What is Data
Data: Data is the collection of raw facts and figures.
Information: Processed data is called information. Information is
data that has been processed in such a way as to be meaningful values to
the person who receives it.
Ex:The data collected is in a survey report is:‘HYD20M’
If we process the above data we understand that code is information about a
person as follows: HYD is city name ‘Hyderabad’, 20 is age and M is to
represent ‘MALE’.
Units of data:
What is Big Data
Big Data is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available
database management tools or traditional data processing
applications.
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information
assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision
making.”
Characteristics of Big Data
Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety,
Velocity, and Volume.
Let’s discuss the characteristics of big data. These characteristics, isolated, are enough
to know what big data is. Let’s look at them in depth:
a) Variety Variety of Big Data refers to structured, unstructured, and semi-
structured data that is gathered from multiple sources. While in the past, data could
only be collected from spreadsheets and databases, today data comes in an array of
forms such as emails, PDFs, photos, videos, audios, SM posts, and so much more.
Variety is one of the important characteristics of big data.
b) Velocity Velocity essentially refers to the speed at which data is being
created in real-time. In a broader prospect, it comprises the rate of change,
linking of incoming data sets at varying speeds, and activity bursts.
c) Volume Volume is one of the characteristics of big data. We already know
that Big Data indicates huge ‘volumes’ of data that is being generated on a
daily basis from various sources like social media platforms, business
processes, machines, networks, human interactions, etc. Such a large
amount of data is stored in data warehouses. Thus comes to the end of
characteristics of big data.
Human Generated Data and Machine
Generated Data
Human Generated Data is emails, documents, photos and tweets. We
are generating this data faster than ever. Just imagine the number of
videos uploaded to You Tube and tweets swirling around. This data can
be Big Data too.
Machine Generated Data is a new breed of data. This category consists
of sensor data, and logs generated by 'machines' such as email logs,
click stream logs, etc. Machine generated data is orders of magnitude
larger than Human Generated Data.
Where does Big Data come from
The major sources of Big Data:
social media sites,
sensor networks,
digital images/videos,
cell phones,
purchase transaction records,
web logs,
medical records,
archives,
military surveillance,
ecommerce,
Complex scientific research and so on.
Examples of Big Data in the Real world
1.The New York Stock Exchange generates about one terabyte of new trade data per day.
2. Social Media Impact:Statistic shows that 500+terabytes of new data gets ingested into
the databases of social media site Facebook, every day. This data is mainly generated in terms
of photo and video uploads, message exchanges, putting comments etc.
3. Walmart handles more than 1 million customer transactions every hour.
4. Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
5. 230+ millions of tweets are created every day.
6. More than 5 billion people are calling, texting, tweeting and browsing on mobile phones
worldwide.
7.YouTube users upload 48 hours of new video every minute of the day.
8. Amazon handles 15 million customer click stream user data per day to recommend
products.
9. 294 billion emails are sent every day. Services analyses this data to find the spams.
10.Single Jet engine can generate 10+ terabytes of data in 30 minutes of a flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Challenges of Big Data
1. Need For Synchronization Across Disparate Data Sources
As data sets are becoming bigger and more diverse, there is a big challenge to
incorporate them into an analytical platform. If this is overlooked, it will
create gaps and lead to wrong messages and insights.
2. Acute Shortage Of Professionals Who Understand Big Data
Analysis
The analysis of data is important to make this voluminous amount of data
being produced in every minute, useful. With the exponential rise of data, a
huge demand for big data scientists and Big Data analysts has been created in
the market. It is important for business organizations to hire a data scientist
having skills that are varied as the job of a data scientist is multidisciplinary.
Another major challenge faced by businesses is the shortage of professionals
who understand Big Data analysis. There is a sharp shortage of data scientists
in comparison to the massive amount of data being produced.
Challenges of Big Data
4. Getting Meaningful Insights Through The Use Of Big Data
Analytics
It is imperative for business organizations to gain important insights from Big Data
analytics, and also it is important that only the relevant department has access to this
information. A big challenge faced by the companies in the Big Data analytics is
mending this wide gap in an effective manner.
5. GettingVoluminous Data IntoThe Big Data Platform
It is hardly surprising that data is growing with every passing day. This simply
indicates that business organizations need to handle a large amount of data on daily
basis. The amount and variety of data available these days can overwhelm any data
engineer and that is why it is considered vital to make data accessibility easy and
convenient for brand owners and managers.
6. Uncertainty Of Data Management Landscape
With the rise of Big Data, new technologies and companies are being developed
every day. However, a big challenge faced by the companies in the Big Data analytics
is to find out which technology will be best suited to them without the introduction
of new problems and potential risks.
Challenges of Big Data
7. Data Storage And Quality
Business organizations are growing at a rapid pace. With the tremendous growth of the
companies and large business organizations, increases the amount of data produced.
The storage of this massive amount of data is becoming a real challenge for everyone.
Popular data storage options like data lakes/ warehouses are commonly used to gather
and store large quantities of unstructured and structured data in its native format. The
real problem arises when a data lakes/ warehouse try to combine unstructured and
inconsistent data from diverse sources, it encounters errors. Missing data, inconsistent
data, logic conflicts, and duplicates data all result in data quality challenges.
8. Security And Privacy Of Data
Once business enterprises discover how to use Big Data, it brings them a wide range
of possibilities and opportunities. However, it also involves the potential risks
associated with big data when it comes to the privacy and the security of the data. The
Big Data tools used for analysis and storage utilizes the data disparate sources. This
eventually leads to a high risk of exposure of the data, making it vulnerable. Thus, the
rise of voluminous amount of data increases privacy and security concerns.
USE BELOW LINK FOR CHAPTER V
https://hadoopilluminated.com/hadoop_illuminated/Big_Data.html
#d1575e261