0% found this document useful (0 votes)

3 views

Introduction to Hadoop - Copy

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed computing environment, utilizing the MapReduce model for parallel processing. Its core components include HDFS for storage and YARN for resource management, providing features such as fault tolerance, scalability, and cost-effectiveness. The architecture supports both structured and unstructured data, differentiating it from traditional RDBMS systems which are more suited for structured data and OLTP environments.

Uploaded by

chandramaryt

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Introduction to Hadoop - Copy

Uploaded by

chandramaryt

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Introduction to Hadoop:

• Hadoop is an open-source software framework that is

used for storing and processing large amounts of data in
a distributed computing environment.
• It is designed to handle big data and is based on the
MapReduce programming model, which allows for the
parallel processing of large datasets.
• Hadoop is an open source software programming
framework for storing a large amount of data and
performing the computation.
• Its framework is based on Java programming with some
native code in C and shell scripts.
Hadoop has two main components:

• HDFS (Hadoop Distributed File System):

This is the storage component of Hadoop, which allows

for the storage of large amounts of data across multiple
machines. It is designed to work with commodity
hardware, which makes it cost-effective.

• YARN (Yet Another Resource Negotiator):

This is the resource management component of Hadoop,

which manages the allocation of resources (such as CPU
and memory) for processing the data stored in HDFS.
Features of hadoop:

• 1. it is fault tolerance.
• 2. it is highly available.
• 3. it’s programming is easy.
• 4. it have huge flexible storage.
• 5. it is low cost.
Differences Between RDBMS and Hadoop

RDBMS Hadoop
Traditional row-column based databases, basically used for data storage, An open-source software used for storing data and running applications or
manipulation and retrieval. processes concurrently.

In this structured data is mostly processed. In this both structured and unstructured data is processed.

It is best suited for OLTP environment. It is best suited for BIG data.

It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

It stores transformed and aggregated data. It stores huge volume of data.

It has no latency in response. It has some latency in response.

The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.

High data integrity available. Low data integrity available than RDBMS.
Hadoop – Architecture

1. MapReduce

2. HDFS(Hadoop Distributed
File System)

3. YARN(Yet Another
Resource Negotiator)

4. Common Utilities or
Hadoop Common
1. MapReduce

• MapReduce nothing but just like an Algorithm or a data

structure that is based on the YARN framework.
• The major feature of MapReduce is to perform the
distributed processing in parallel in a Hadoop cluster
which Makes Hadoop working so fast.
• When you are dealing with Big Data, serial processing is
no more of any use.
• MapReduce has mainly 2 tasks which are divided phase-
wise:
• In first phase, Map is utilized and in next
phase Reduce is utilized.
Map Task:

• RecordReader The purpose of recordreader is to break the records. It is

responsible for providing key-value pairs in a Map() function. The key is
actually is its locational information and value is the data associated with
it.

• Map: A map is nothing but a user-defined function whose work is to

process the Tuples obtained from record reader. The Map() function either
does not generate any key-value pair or generate multiple pairs of these
tuples.

• Combiner: Combiner is used for grouping the data in the Map workflow. It
is similar to a Local reducer. The intermediate key-value that are generated
in the Map is combined with the help of this combiner. Using a combiner is
not necessary as it is optional.

• Partitionar: Partitional is responsible for fetching key-value pairs

generated in the Mapper Phases. The partitioner generates the shards
corresponding to each reducer. Hashcode of each key is also fetched by
this partition.
Reduce Task

• Shuffle and Sort: The Task of Reducer starts with this step, the
process in which the Mapper generates the intermediate key-
value and transfers them to the Reducer task is known
as Shuffling. Using the Shuffling process the system can sort the
data using its key value. Once some of the Mapping tasks are
done Shuffling begins that is why it is a faster process and does
not wait for the completion of the task performed by Mapper.
• Reduce: The main function or task of the Reduce is to gather the
Tuple generated from Map and then perform some sorting and
aggregation sort of process on those key-value depending on its
key element.

• OutputFormat: Once all the operations are performed, the key-

value pairs are written into the file with the help of record writer,
each record in a new line, and the key and value in a space-
separated manner.
HDFS

• HDFS(Hadoop Distributed File System) is utilized for storage

permission. It is mainly designed for working on commodity
Hardware devices(inexpensive devices), working on a
distributed file system design.
• HDFS is designed in such a way that it believes more in
storing the data in a large chunk of blocks rather than storing
small data blocks.
• HDFS in Hadoop provides Fault-tolerance and High availability
to the storage layer and the other devices present in that
Hadoop cluster. Data storage Nodes in HDFS.
• NameNode(Master)

• DataNode(Slave)
• NameNode:NameNode works as a Master in a Hadoop
cluster that guides the Datanode(Slaves). Namenode is
mainly used for storing the Metadata i.e. the data about
the data. Meta Data can be the transaction logs that
keep track of the user’s activity in a Hadoop cluster.
• DataNode: DataNodes works as a Slave DataNodes are
mainly utilized for storing the data in a Hadoop cluster,
the number of DataNodes can be from 1 to 500 or even
more than that. The more number of DataNode, the
Hadoop cluster will be able to store more data. So it is
advised that the DataNode should have High storing
capacity to store a large number of file blocks.
YARN(Yet Another Resource Negotiator)

• YARN is a Framework on which MapReduce works.

• YARN performs 2 operations that are Job scheduling and Resource Management.
• The Purpose of Job schedular is to divide a big task into small jobs so that each job
can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized.
• Job Scheduler also keeps track of which job is important, which job has more priority,
dependencies between the jobs and all the other information like job timing, etc.
• And the use of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster.
Features of YARN
1. Multi-Tenancy

2. Scalability

3. Cluster-Utilization

4. Compatibility
Hadoop common or Common Utilities

• Hadoop common or Common utilities are nothing but

our java library and java files or we can say the java
scripts that we need for all the other components
present in a Hadoop cluster.
• These utilities are used by HDFS, YARN, and MapReduce
for running the cluster.
• Hadoop Common verify that Hardware failure in a
Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.

Creative Arts GR 8 END Year Examination Booklet 2023
80% (5)
Creative Arts GR 8 END Year Examination Booklet 2023
12 pages
Concept of Belief in Islamic Theology A Semantic Analysis of Iman and Islam by Toshihiko Izutsu 083699261x PDF
50% (2)
Concept of Belief in Islamic Theology A Semantic Analysis of Iman and Islam by Toshihiko Izutsu 083699261x PDF
5 pages
MMW Module-1 (SY.2021-2022)
100% (4)
MMW Module-1 (SY.2021-2022)
23 pages
Pistol Correction Chart
No ratings yet
Pistol Correction Chart
1 page
NAPA Software Brochure PDF
No ratings yet
NAPA Software Brochure PDF
8 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit 5
No ratings yet
Unit 5
7 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
bioDiesel_research
No ratings yet
bioDiesel_research
29 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Unit-2-_Hadoop2_
No ratings yet
Unit-2-_Hadoop2_
30 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
IDS Unit3
No ratings yet
IDS Unit3
16 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
500+ Interview Questions-1
No ratings yet
500+ Interview Questions-1
126 pages
500+ Data Engineering Interview_Questions
No ratings yet
500+ Data Engineering Interview_Questions
118 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Hadoop
No ratings yet
Hadoop
7 pages
Unit 3 & 4 big data
No ratings yet
Unit 3 & 4 big data
18 pages
BDM 2
No ratings yet
BDM 2
5 pages
CC unit5
No ratings yet
CC unit5
27 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Big Data Notes
No ratings yet
Big Data Notes
8 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Module II
No ratings yet
Module II
46 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
No ratings yet
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
10 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
BDA-Unit-1
No ratings yet
BDA-Unit-1
35 pages
Hadoop
No ratings yet
Hadoop
5 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
No ratings yet
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
6 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
Hadoop Features 2
No ratings yet
Hadoop Features 2
3 pages
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
No ratings yet
SDL Module-No SQL Module Assignment No. 2: Q1 What Is Hadoop and Need For It? Discuss It's Architecture
6 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Great Visualization w.ppt
No ratings yet
Great Visualization w.ppt
9 pages
Reading graphs_white.ppt
No ratings yet
Reading graphs_white.ppt
8 pages
Developing the Visual Aesthetics
No ratings yet
Developing the Visual Aesthetics
6 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Journalism
No ratings yet
Journalism
2 pages
Paper Presentation Score Sheet 1
No ratings yet
Paper Presentation Score Sheet 1
1 page
Turboelectric Propulsion in Commercial Avaition
No ratings yet
Turboelectric Propulsion in Commercial Avaition
18 pages
Strategic
No ratings yet
Strategic
8 pages
ATSEA SAP V8 4july2024
No ratings yet
ATSEA SAP V8 4july2024
90 pages
G.R. No. 246702
No ratings yet
G.R. No. 246702
6 pages
"I Learned More in 10 Minutes Than 1 Month of Chemistry Classes" - Ashlee P
No ratings yet
"I Learned More in 10 Minutes Than 1 Month of Chemistry Classes" - Ashlee P
5 pages
Supreme Court: Juan Amor and Simeon J. Tolentino For Appellant. Office of The Solicitor-General Hilado For Appellee
No ratings yet
Supreme Court: Juan Amor and Simeon J. Tolentino For Appellant. Office of The Solicitor-General Hilado For Appellee
1 page
Aleksander Soriano Named Account Administrator at RT Specialty
No ratings yet
Aleksander Soriano Named Account Administrator at RT Specialty
2 pages
IOT lab manual (1)
No ratings yet
IOT lab manual (1)
4 pages
Robotic Fish1
No ratings yet
Robotic Fish1
10 pages
Bulletin - 11 04 2012
No ratings yet
Bulletin - 11 04 2012
8 pages
SelectedWorksOfMao VIII Partial PDF
No ratings yet
SelectedWorksOfMao VIII Partial PDF
353 pages
Biology Semester 1 PST Modul 1 Kolej Matrikulasi Perak Chap 3 Notes
50% (2)
Biology Semester 1 PST Modul 1 Kolej Matrikulasi Perak Chap 3 Notes
7 pages
Logano g334 X
No ratings yet
Logano g334 X
72 pages
Lady Diana
No ratings yet
Lady Diana
2 pages
En 1996 06
No ratings yet
En 1996 06
124 pages
Thread by @G_S_Bhogal on Thread Reader App – Thread Reader App
No ratings yet
Thread by @G_S_Bhogal on Thread Reader App – Thread Reader App
10 pages
Respondents.: (G.R. No. 180845. June 6, 2018.)
100% (2)
Respondents.: (G.R. No. 180845. June 6, 2018.)
2 pages
Ass 5
No ratings yet
Ass 5
2 pages
maths sample paper 2
No ratings yet
maths sample paper 2
6 pages
Sansevieria Roxburghiana Schult Amp
No ratings yet
Sansevieria Roxburghiana Schult Amp
11 pages
Corporate Branding Proposal
No ratings yet
Corporate Branding Proposal
26 pages
Madison Fox Porter Transcript
No ratings yet
Madison Fox Porter Transcript
40 pages
Order in The Matter of GDR Issue of Vikash Metal and Power Limited
No ratings yet
Order in The Matter of GDR Issue of Vikash Metal and Power Limited
95 pages
1882 Mears The Institutes of Gaius and Justinian and Twelve Tables
No ratings yet
1882 Mears The Institutes of Gaius and Justinian and Twelve Tables
717 pages
Financial Modeling
100% (4)
Financial Modeling
18 pages

Introduction to Hadoop - Copy

Uploaded by

Introduction to Hadoop - Copy

Uploaded by

Introduction to Hadoop:

• Hadoop is an open-source software framework that is

• HDFS (Hadoop Distributed File System):

This is the storage component of Hadoop, which allows

• YARN (Yet Another Resource Negotiator):

This is the resource management component of Hadoop,

It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

It stores transformed and aggregated data. It stores huge volume of data.

It has no latency in response. It has some latency in response.

• MapReduce nothing but just like an Algorithm or a data

• RecordReader The purpose of recordreader is to break the records. It is

• Map: A map is nothing but a user-defined function whose work is to

• Partitionar: Partitional is responsible for fetching key-value pairs

• OutputFormat: Once all the operations are performed, the key-

• HDFS(Hadoop Distributed File System) is utilized for storage

• YARN is a Framework on which MapReduce works.

• Hadoop common or Common utilities are nothing but

You might also like