0% found this document useful (0 votes)
8 views

HDFS MAP REDUCE

The document provides an overview of the Hadoop Distributed File System (HDFS), highlighting its architecture, components, and benefits such as streaming access, data integrity management, and fault tolerance. It also discusses the MapReduce programming model for distributed data processing and introduces Google App Engine (GAE) as a Platform-as-a-Service for web applications, detailing its components and programming environment. Additionally, the document mentions the Google File System (GFS) designed for processing large data amounts with a focus on performance and reliability.

Uploaded by

rbsraja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

HDFS MAP REDUCE

The document provides an overview of the Hadoop Distributed File System (HDFS), highlighting its architecture, components, and benefits such as streaming access, data integrity management, and fault tolerance. It also discusses the MapReduce programming model for distributed data processing and introduces Google App Engine (GAE) as a Platform-as-a-Service for web applications, detailing its components and programming environment. Additionally, the document mentions the Google File System (GFS) designed for processing large data amounts with a focus on performance and reliability.

Uploaded by

rbsraja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Hadoop Distributed File system (HDFS)

• The Hadoop Distributed File system (HDFS) is the Hadoop


implementation of distributed file system design that hold
large amount of data.
• It provide easier access to stored data to many clients
distributed across the network.
• The HDFS file composed of fixed size blocks or chunks
that are stored on data nodes.
• The name node is responsible for storing the metadata
about each file that includes attributes of files like type of
file, size, date and time of creation, properties of the files as
well as the mapping of blocks to files at the data nodes
Hadoop Distributed File system (HDFS)
The significant benefits provided by HDFS are given as
follows
• It provides streaming access to file system data
• It is suitable for distributed storage and processing
• It is optimized to support high streaming read
operations with limited set.
• It supports file operations like read, write, delete but
append not update.
• It provides Java APIs and command line command
line interfaces to interact with HDFS.
Hadoop Distributed File system (HDFS)
• It provides different File permissions and authentications for files
on HDFS.
• It provides continuous monitoring of name nodes and data
nodes based on continuous “heartbeat” communication between
the data nodes to the name node.
• It provides Rebalancing of data nodes so as to equalize the load
by migrating blocks of data from one data node to another.
• It uses checksums and digital signatures to manage the integrity
of data stored in a file.
• It has built-in metadata replication so as to recover data during
the failure or to protect against corruption.
Architecture of HDFS
Architecture of HDFS
The Components of HDFS composed of following elements
1. Name Node
An HDFS cluster consists of single name node called master
server that manages the file system namespaces and regulate
access to files by client

2. Data Node

In HDFS there are multiple data nodes exist that manages storages
attached to the node that they run on. They are usually used to store
users’ data on HDFS clusters.
Architecture of HDFS
3. HDFS Client
In Hadoop distributed file system, the user applications access the file
system using the HDFS client. Like any other file systems, HDFS
supports various operations to read, write and delete files, and
operations to create and delete directories

4. HDFS Blocks
In general, the user’s data stored in HDFS in terms of block. The files
in file system are divided in to one or more segments called blocks.
The default size of HDFS block is 64 MB that can be increase as
per need
Map Reduce
The MapReduce is a programming model provided by Hadoop
that allows expressing distributed computations on huge amount of
data.it provides easy scaling of data processing over multiple
computational nodes or clusters.
In MapReduce s model the data processing primitives used are
called mapper and reducer
Features of MapReduce
Synchronization : The MapReduce supports execution of
concurrent tasks. When the concurrent tasks are executed, they
need synchronization.
The synchronization is provided by reading the state of
each MapReduce operation during the execution and uses
shared variables for those.
Map Reduce
• Data locality : In MapReduce although the data resides on different
clusters, it appears like a local to the users’ application. To obtain the
best result the code and data of application should resides on same
machine.

Error handling : MapReduce engine provides different fault


tolerance mechanisms in case of failure.

Scheduling : The MapReduce involves map and reduce


operations that divide large problems in to smaller chunks and those are
run in parallel by different machines
Working of MapReduce Framework

The unit of work in MapReduce is a job. During map phase the input
data is divided in to input splits for analysis where each split is an
independent task. These tasks run in parallel across Hadoop clusters
Working of MapReduce Framework
Google App Engine
Google App Engine (GAE) is a Platform-as-a-Service cloud
computing model that supports many programming languages. GAE
is a scalable runtime environment mostly devoted to execute Web
applications

.It allows developers to use readymade platform to develop and


deploy web applications using development tools, runtime engine,
databases and middleware solutions. It supports languages like Java,
Python, .NET, PHP, Ruby, Node.js
Google App Engine
Google App Engine (GAE) is a Platform-as-a-Service cloud
computing model that supports many programming languages. GAE
is a scalable runtime environment mostly devoted to execute Web
applications

.It allows developers to use readymade platform to develop and


deploy web applications using development tools, runtime engine,
databases and middleware solutions. It supports languages like Java,
Python, .NET, PHP, Ruby, Node.js
Google App Engine
The GAE platform comprises five main components like
• Application runtime environment offers a platform that has built-in execution engine for
scalable web programming and execution.
• Software Development Kit (SDK) for local application development and deployment over
google cloud platform.
• Datastore to provision object-oriented, distributed, structured data storage to store
application and data. It also provides secures data management operations based on
BigTable techniques.
• Admin console used for easy management of user application development and resource
management
• GAE web service for providing APIs and interfaces.
Google App Engine
Programming Environment for Google App Engine
The Google provides programming support for its cloud
environment, that is, Google Apps Engine, through Google File
System (GFS), Big Table, and Chubby
The Google File System (GFS)

Google has designed a distributed file system, named GFS,


for meeting its exacting demands off processing a large
amount of data. Most of the objectives of designing the
GFS are similar to those of the earlier designed distributed
systems. Some of the objectives include availability,
performance, reliability, and scalability of systems.
The Google File System (GFS)

Google has designed a distributed file system, named GFS,


for meeting its exacting demands off processing a large
amount of data. Most of the objectives of designing the
GFS are similar to those of the earlier designed distributed
systems. Some of the objectives include availability,
performance, reliability, and scalability of systems.

Architecture of GFS clusters

You might also like