0% found this document useful (0 votes)

18 views57 pages

5.apache Hadoop Updated

Apache Hadoop is an open-source framework for storing and processing large data sets across clusters of commodity hardware, created in 2005 by Doug Cutting and Mike Cafarella. It consists of several modules, including HDFS for distributed file storage, YARN for resource management, and MapReduce for data processing. While companies like Facebook and Yahoo! have shifted towards cloud-native solutions, others like Spotify and Airbnb continue to leverage Hadoop for its scalability and data processing capabilities.

Uploaded by

yogipatel2724

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views57 pages

5.apache Hadoop Updated

Uploaded by

yogipatel2724

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 57

Apache Hadoop

Historical Review of Big Data

Apache Hadoop
• Apache Hadoop is an open source software framework for storage
and large scale processing of data-sets on clusters of commodity
hardware.
• Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
• Doug Cutting , who was working at Yahoo! at the
time
• He named the project after his son's toy elephant.
• Cutting's son was 2 years old at the time and just
beginning to talk. He called his beloved stuffed
yellow elephant "Hadoop"
The Apache Hadoop framework modules

• Hadoop Common: contains libraries and utilities needed by other

Hadoop modules
• Hadoop Distributed File System (HDFS): a distributed file-system that
stores data on the commodity machines, providing very high
aggregate bandwidth across the cluster
• Hadoop YARN: a resource-management platform responsible for
managing compute resources in clusters and using them for
scheduling users' applications
• Hadoop MapReduce: a programming model for large-scale data
processing
HADOOP use by
• In 2010, Facebook claimed to have one of the largest HDFS cluster
storing 21 Petabytes of data.
• In 2012, Facebook declared that they have the largest single HDFS
cluster with more than 100 PB of data.
• And Yahoo! has more than 100,000 CPU in over 40,000
servers running Hadoop, with its biggest Hadoop cluster
running 4,500 nodes. All told, Yahoo! stores 455 petabytes of data in
HDFS.
• In fact, by 2013, most of the big names in the Fortune 50 started
using Hadoop.
Facebook and Yahoo! have shifted from heavy reliance on Hadoop to
integrating more cloud-native solutions. While Yahoo! still uses Hadoop
extensively, both companies are adopting modern tools like Apache Spark and
cloud storage systems to handle big data more efficiently. The overall trend in
the industry is moving away from traditional Hadoop clusters in favor of more
scalable and cost-effective cloud-based alternatives
• Spotify (2023): Spotify continues to use Hadoop to process large datasets, particularly for its
recommendation systems. The scalability of Hadoop allows Spotify to manage and analyze vast amounts
of user data effectively.

• Airbnb (2022): Airbnb leverages Hadoop for its data infrastructure, focusing on storage and batch
processing. This enables them to perform extensive data analytics, helping in service optimization and
decision-making processes.

• LinkedIn (2021): LinkedIn remains invested in Hadoop for its data lake architecture. Hadoop's
capabilities in handling large-scale data processing are essential for LinkedIn's algorithms, especially in
delivering personalized content and recommendations.

• Twitter (2020): Twitter continues to rely on Hadoop for managing and processing the massive volumes
of data generated by its platform. Hadoop is integral to their analytics and machine learning
frameworks, helping to enhance user experience.
Let’s start
• Hadoop has two fundamental units – Storage and Processing. storage
part of Hadoop, refers to HDFS which stands for Hadoop Distributed
File System. So, let’s start with HDFS.
• What is HDFS?
• Advantages of HDFS
• Features of HDFS
what is a Distributed File System?

• Distributed File System talks about managing data, i.e. files or folders across
multiple computers or servers.
• DFS is a file system that allows us
• To store data over multiple nodes or machines in a cluster and allow multiple users to access
data.
• So basically, it serves the same purpose as the file system which is available in your
machine, like for windows you have NTFS (New Technology File System) or for Mac
you have HFS (Hierarchical File System).
• The only difference is that, in case of Distributed File System,
• You store data in multiple machines rather than single machine.
• Even though the files are stored across the network,
• DFS organizes, and displays data in such a manner that a user sitting on a machine will feel like
all the data is stored in that very machine.
HDFS versus NFS

Network File System Hadoop Distributed File System

(NFS) (HDFS)
 Single machine makes part  Single virtual file system
of its file system available to spread over many
other machines machines
 Sequential or random access  Optimized for sequential
 PRO: Simplicity, generality, read and local accesses
transparency  PRO: High throughput,
 CON: Storage capacity and high capacity
throughput limited by single  CON: Specialized for
server particular types of
02/15/2025
applications 11
Next topic :

Comparing Hadoop and RDBMS

S.No. RDBMS Hadoop

Traditional row-column based databases,

An open-source software used for storing data and
1. basically used for data storage, manipulation
running applications or processes concurrently.
and retrieval.

In this both structured and unstructured data is

2. In this structured data is mostly processed.
processed.

3. It is best suited for OLTP environment. It is best suited for BIG data.

4. It is less scalable than Hadoop. It is highly scalable.

5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

6. It stores transformed and aggregated data. It stores huge volume of data.

7. It has no latency in response. It has some latency in response.

8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.

9. High data integrity available. Low data integrity available than RDBMS.

10. Cost is applicable for licensed software. Free of cost, as it is an open source software.
In transaction
processing,
ACID
(atomicity,
consistency,
isolation, and
durability)
is an acronym
and
mnemonic
device used to
refer to the
four essential
properties a
transaction
should
possess to
ensure the
integrity and
reliability of
the data
Hadoop Distributed file system
• HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware
• HDFS is not good fit for following
• Low-latency data access
• Lots of small files
• Multiple writers and arbitrary file modifications
HDFS concepts :going to study
• Block size
• HDFS block size
• Difference between HDFS and other file system
• Why is a block size in HDFS so large?
Main HDFS services
1. Name Node
2. Secondary Name node
3. Job Tracker

4. Data Node
5. Task Tracker
Name Node File.txt :440
MB
Request a.txt : 1
b.txt: 3
Client
Acknowledgement c.txt : 5
d.txt :7
1,3,5,7
b.txt
File.txt : input file a.txt
440MB : File size
Data Node Data Node Data Node Data Node
Input Splits c.txt 1 2 3 4
a.txt : 128 MB
b.txt : 128 MB
c.txt : 128 MB d.txt
d.txt : 56 MB

Data Node Data Node Data Node Data Node

5 6 7 8
HDFS Architecture (may be skip)
NameNode : summary

• NameNode is the master node in the Apache Hadoop HDFS

Architecture that maintains and manages the blocks present on the
DataNodes (slave nodes).
• NameNode is highly available server that manages the File System
Namespace and controls access to files by clients.
Functions of NameNode: How it
achieves
• It is master daemon that maintains and manages the DataNodes (slave
nodes)
• It records the metadata of all the files stored in the cluster, e.g.
The location of blocks stored, the size of the files, permissions,
hierarchy, etc. There are two files associated with the metadata:
• FsImage (file system image): It contains the file system namespace's complete
state since the NameNode's start.
• EditLogs: It contains all the recent modifications made to the file system with
respect to the most recent FsImage.
• It records each change that takes place to the file system metadata. For
example, if a file is deleted in HDFS, the NameNode will immediately
record this in the EditLog.
•HDFS metadata (such as file creation, deletion, and block replication) in chronological order.

•Purpose: It ensures that all changes are captured and can be replayed to reconstruct the current
state of the file system after a restart or crash.

txn_id=1001, timestamp=2024-08-01 10:00:00, operation=MKDIR, path=/user/data

txn_id=1002, timestamp=2024-08-01 10:05:00, operation=CREATE, path=/user/data/file1.txt, replication=3,
blocksize=128MB
txn_id=1003, timestamp=2024-08-01 10:06:00, operation=RENAME, src=/https/www.scribd.com/user/data/file1.txt,
dst=/user/data/file1_renamed.txt
txn_id=1004, timestamp=2024-08-01 10:07:00, operation=DELETE, path=/user/data/file1_renamed.txt
txn_id=1005, timestamp=2024-08-01 10:08:00, operation=SET_REPLICATION, path=/user/data/file2.txt, replication=2
•FSImage is a snapshot of the entire file system's metadata at a specific point in time.
It represents the state of the file system when it was last saved.
•Purpose: It allows the NameNode to load the current state of the file system quickly during startup,
•without replaying all the transactions from the Edit Log.
Functions of Namenode : how it
achieves
• It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live.
• It keeps a record of all the blocks in HDFS and in which nodes these
blocks are located.
• The NameNode is also responsible to take care of
the replication factor of all the blocks which we will discuss in detail
later in this
• In case of the DataNode failure, the NameNode chooses new
DataNodes for new replicas, balance disk usage and manages the
communication traffic to the DataNodes.
Data Nodes
• DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode
is a commodity hardware, that is, a non-expensive system which is
not of high quality or high-availability.
Functions of Data Node
• These are slave daemons or process which runs on each slave
machine.
• The actual data is stored on DataNodes.
• The DataNodes perform the low-level read and write requests from
the file system’s clients.
• They send heartbeats to the NameNode periodically to report the
overall health of HDFS, by default, this frequency is set to 3 seconds.
Secondary NameNode
• Apart from these two daemons, there is a third daemon or a process
called Secondary NameNode. The Secondary NameNode works
concurrently with the primary NameNode as a helper daemon.
• And don’t be confused about the Secondary NameNode being
a backup NameNode because it is not.
Functions of Secondary
NameNode
• The Secondary NameNode is one which constantly reads all the file
systems and metadata from the RAM of the NameNode and writes it
into the hard disk or the file system.
• It is responsible for combining the EditLogs with FsImage from the
NameNode.
• It downloads the EditLogs from the NameNode at regular intervals
and applies to FsImage. The new FsImage is copied back to the
NameNode, which is used whenever the NameNode is started the
next time.
1. Checkpointing:
• The primary function of the Secondary NameNode is to periodically merge the Edit Log and the FSImage (File
System Image) stored by the NameNode. This process is known as checkpointing.
• The Edit Log is a file where the NameNode records every change made to the HDFS, while the FSImage is a
snapshot of the filesystem's metadata at a certain point in time.
• Over time, the Edit Log can grow very large, making the recovery process slow and cumbersome because the
NameNode would need to replay all the transactions from a potentially large Edit Log to rebuild the filesystem
state.
• The Secondary NameNode periodically downloads the current FSImage and Edit Log from the NameNode, merges
them into a new FSImage, and then uploads this new FSImage back to the NameNode. The Edit Log is then
truncated to start fresh.
2. Reducing NameNode Load:
• By performing the checkpointing task, the Secondary NameNode helps reduce the workload on the NameNode. The
NameNode does not have to continuously merge the Edit Log and FSImage, as this is handled by the Secondary
NameNode at intervals.
3. Assisting in Recovery:
• In the event of a NameNode failure, the Secondary NameNode’s latest checkpointed FSImage and a much smaller
Edit Log can be used to restart the NameNode more quickly. However, it is important to note that the Secondary
NameNode itself does not automatically take over in case of a NameNode failure; it simply provides the latest
metadata to assist in recovery.
• Note:
• The NameNode collects block report from DataNode
periodically to maintain the replication factor.
• Therefore, whenever a block is over-replicated or under-
replicated the NameNode deletes or add replicas as
needed.
Rack Awareness
• The NameNode also ensures that all the
replicas are not stored on the same rack or a
single rack.
• It follows an in-built Rack Awareness
Algorithm to reduce latency as well as provide
fault tolerance.
• Considering the replication factor is 3, the
Rack Awareness Algorithm says that
• The first replica of a block will be stored on
a local rack and
• The next two replicas will be stored on a
different (remote) rack but, on a different
DataNode within that (remote) rack as
shown in the figure above.
• If you have more replicas, the rest of the
replicas will be placed on random
DataNodes provided not more than two
why do we need a Rack Awareness
algorithm?
• The reasons are:
• To improve the network performance: The communication between nodes
residing on different racks is directed via switch.
• To prevent loss of data: We don’t have to worry about the data even if an
entire rack fails because of the switch failure or power failure. And if you think
about it, it will make sense, as it is said that never put all your eggs in the
same basket [proverb] (i.e. don't risk everything on the success of one
venture.)
NETWORK TOPOLOGY AND
HADOOP
What does it mean for two nodes in a local network to be “close” to
each other?
• Processes on the same node
• Different nodes on the same rack
• Nodes on different racks in the same data center
• Nodes in different data centers
• Let's imagine your cluster as a tree with the following levels:
• Abstract global root (Top or root)
• Data centers (1st level)
• Racks (2nd level)
• Nodes (3rd level or leaves)
• If we draw this tree there should be something like this:
For example, imagine a node n1 on rack r1 in data center d1. This can be
represented as /d1/r1/n1. Using this notation, here are the distances for the four
scenarios:
• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)
• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
• distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same
data center)
• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers).
• distance(/d1/r1/n1, /d2/r3/n10) = ?

What is the network distance?

• Let's imagine your cluster as a tree with the following levels:
• Abstract global root (Top or root)
• Data centers (1st level)
• Racks (2nd level)
• Nodes (3rd level or leaves)
• If we draw this tree there should be something like this:
• Let's count distance between any circle and its parent as 1.
• Then the distance between any two circles is the sum of their distance to their closest
common ancestor or 0 for the same node.
• So it's always 6 for any two nodes in different data centers (like between /d1/r1/n1
and /d2/r4/n10).
HDFS Write Architecture:
• Suppose, the NameNode provided following lists of IP addresses to the client:
• For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of DataNode 6}
• For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of DataNode 9}
• Each block will be copied in three different DataNodes to maintain the replication
factor consistent throughout the cluster.
• Now the whole data copy process will happen in three stages:
•For Block A: 1A -> 2A -> 3A -> 4A
•For Block B: 1B -> 2B -> 3B -> 4B -> 5B -> 6B
• While serving read request of the client, HDFS selects the replica
which is closest to the client.
• This reduces the read latency and the bandwidth consumption.
• Therefore, that replica is selected which resides on the same rack as
the reader node, if possible.
Anatomy of File Read
Anatomy of a File Write
Configuration XML files in Hadoop

1. hadoop-env.sh
• This file specifies environment variables that affect the JDK used by Hadoop
Daemon (bin/hadoop). (JAVA_HOME)
• It informs Hadoop daemon where NameNode runs in the cluster. It contains
the configuration settings for Hadoop Core such as I/O settings that are
common to HDFS and MapReduce.
2. hdfs-site.xml
• This is the main configuration file for HDFS. It defines the namenode and
datanode paths as well as replication factor.
• Replication factor is specified by dfs.replication property; as it is a single node
cluster hence we will set replication to 1.
• The default heartbeat interval is 3 seconds. One can change it by using
dfs.heartbeat.interval in hdfs-site.xml.
3. mapred-site.xml
• In order to specify which framework should be used for MapReduce, we use
mapreduce.framework.name property, yarn is used here.
• yarn-site.xml
• In order to specify auxiliary service need to run with nodemanager
“yarn.nodemanager.aux-services” property is used.
• Here Shuffling is used as auxiliary service. And in order to know the class that
should be used for shuffling we user “yarn.nodemanager.aux-
services.mapreduce.shuffle.class”
• core-site.xml
• This file defines port number, memory, memory limits, size of read/write buffers
used by Hadoop. Find this file in the etc/hadoop directory
• Location of namenode is specified by fs.defaultFS property
namenode running at 9000 port on localhost.
• hadoop.tmp.dir property to specify the location where temporary as well as
permanent data of Hadoop will be stored.

HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Dsu 2023 Summer Model Answer Paper
86% (7)
Dsu 2023 Summer Model Answer Paper
25 pages
Unit 3
No ratings yet
Unit 3
5 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Custom Single-Purpose Processors
63% (8)
Custom Single-Purpose Processors
54 pages
DC Mod 6
No ratings yet
DC Mod 6
9 pages
BD Unit II
No ratings yet
BD Unit II
57 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Bda 3
No ratings yet
Bda 3
70 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
RTK Notes m1
No ratings yet
RTK Notes m1
16 pages
HDFS
No ratings yet
HDFS
11 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Cloud Computing - Unit 3
No ratings yet
Cloud Computing - Unit 3
38 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Module II
No ratings yet
Module II
46 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit I
No ratings yet
Unit I
38 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
4
No ratings yet
4
53 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Input and Output Functions in C
No ratings yet
Input and Output Functions in C
10 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
General Purpose Registers
No ratings yet
General Purpose Registers
3 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Tia Portal V17 Technical Highlights
No ratings yet
Tia Portal V17 Technical Highlights
50 pages
IT Application in Business
No ratings yet
IT Application in Business
24 pages
Common Server Administration Content Manager OnDemand For I Version 7 Release 2
No ratings yet
Common Server Administration Content Manager OnDemand For I Version 7 Release 2
344 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Grade 7 - UNIT 2 REVISION - PYTHON PROGRAMMING
No ratings yet
Grade 7 - UNIT 2 REVISION - PYTHON PROGRAMMING
14 pages
F5 Big-Ip
No ratings yet
F5 Big-Ip
81 pages
Introduction To Mobile Programming
No ratings yet
Introduction To Mobile Programming
24 pages
Manual For VRI8-En
No ratings yet
Manual For VRI8-En
5 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
ASR 1000 Architecture Overview and Use Cases
No ratings yet
ASR 1000 Architecture Overview and Use Cases
54 pages
HSI
No ratings yet
HSI
16 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Lecture 2 - CSC 202
No ratings yet
Lecture 2 - CSC 202
36 pages
Symantec Protection Engine For Nas - Netapp Sizing Calculator
No ratings yet
Symantec Protection Engine For Nas - Netapp Sizing Calculator
20 pages
Unit 2: Networking: Unit Code H/615/1619 Unit Type Core Unit Level 4 Credit Value 15
No ratings yet
Unit 2: Networking: Unit Code H/615/1619 Unit Type Core Unit Level 4 Credit Value 15
6 pages
Network Programming in Java
No ratings yet
Network Programming in Java
18 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Backup Exec
No ratings yet
Backup Exec
1,311 pages
Exam 70-762: Developing SQL Databases - Skills Measured: Audience Profile
No ratings yet
Exam 70-762: Developing SQL Databases - Skills Measured: Audience Profile
3 pages
CHPT 3 Array Complete
No ratings yet
CHPT 3 Array Complete
10 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Sintak Armada Bis Delphi Xe8
No ratings yet
Sintak Armada Bis Delphi Xe8
7 pages
CC Answer
No ratings yet
CC Answer
4 pages
FIOT Practical 12
No ratings yet
FIOT Practical 12
4 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Local Autonomy
No ratings yet
Local Autonomy
5 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
TCL TK Tutorial
0% (1)
TCL TK Tutorial
19 pages
Fullstack Developer - Resume
No ratings yet
Fullstack Developer - Resume
2 pages
K007791E - Getting Started
No ratings yet
K007791E - Getting Started
180 pages
HP LaserJet Pro 400 MFP M425dn Bypass (Manual) Separation Pad, Genuine (Z9100)
No ratings yet
HP LaserJet Pro 400 MFP M425dn Bypass (Manual) Separation Pad, Genuine (Z9100)
2 pages
HT Commands 10.09
No ratings yet
HT Commands 10.09
12 pages

5.apache Hadoop Updated

Uploaded by

5.apache Hadoop Updated

Uploaded by

Apache Hadoop

Historical Review of Big Data

• Hadoop Common: contains libraries and utilities needed by other

Network File System Hadoop Distributed File System

Comparing Hadoop and RDBMS

Traditional row-column based databases,

In this both structured and unstructured data is

4. It is less scalable than Hadoop. It is highly scalable.

5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

6. It stores transformed and aggregated data. It stores huge volume of data.

7. It has no latency in response. It has some latency in response.

Data Node Data Node Data Node Data Node

• NameNode is the master node in the Apache Hadoop HDFS

txn_id=1001, timestamp=2024-08-01 10:00:00, operation=MKDIR, path=/user/data

What is the network distance?

You might also like