0% found this document useful (0 votes)
3 views

Introduction to Hadoop - Copy

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed computing environment, utilizing the MapReduce model for parallel processing. Its core components include HDFS for storage and YARN for resource management, providing features such as fault tolerance, scalability, and cost-effectiveness. The architecture supports both structured and unstructured data, differentiating it from traditional RDBMS systems which are more suited for structured data and OLTP environments.

Uploaded by

chandramaryt
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Introduction to Hadoop - Copy

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed computing environment, utilizing the MapReduce model for parallel processing. Its core components include HDFS for storage and YARN for resource management, providing features such as fault tolerance, scalability, and cost-effectiveness. The architecture supports both structured and unstructured data, differentiating it from traditional RDBMS systems which are more suited for structured data and OLTP environments.

Uploaded by

chandramaryt
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction to Hadoop:

• Hadoop is an open-source software framework that is


used for storing and processing large amounts of data in
a distributed computing environment.
• It is designed to handle big data and is based on the
MapReduce programming model, which allows for the
parallel processing of large datasets.
• Hadoop is an open source software programming
framework for storing a large amount of data and
performing the computation.
• Its framework is based on Java programming with some
native code in C and shell scripts.
Hadoop has two main components:

• HDFS (Hadoop Distributed File System):

This is the storage component of Hadoop, which allows


for the storage of large amounts of data across multiple
machines. It is designed to work with commodity
hardware, which makes it cost-effective.

• YARN (Yet Another Resource Negotiator):

This is the resource management component of Hadoop,


which manages the allocation of resources (such as CPU
and memory) for processing the data stored in HDFS.
Features of hadoop:

• 1. it is fault tolerance.
• 2. it is highly available.
• 3. it’s programming is easy.
• 4. it have huge flexible storage.
• 5. it is low cost.
Differences Between RDBMS and Hadoop

RDBMS Hadoop
Traditional row-column based databases, basically used for data storage, An open-source software used for storing data and running applications or
manipulation and retrieval. processes concurrently.

In this structured data is mostly processed. In this both structured and unstructured data is processed.

It is best suited for OLTP environment. It is best suited for BIG data.

It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

It stores transformed and aggregated data. It stores huge volume of data.

It has no latency in response. It has some latency in response.

The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.

High data integrity available. Low data integrity available than RDBMS.
Hadoop – Architecture

1. MapReduce

2. HDFS(Hadoop Distributed
File System)

3. YARN(Yet Another
Resource Negotiator)

4. Common Utilities or
Hadoop Common
1. MapReduce

• MapReduce nothing but just like an Algorithm or a data


structure that is based on the YARN framework.
• The major feature of MapReduce is to perform the
distributed processing in parallel in a Hadoop cluster
which Makes Hadoop working so fast.
• When you are dealing with Big Data, serial processing is
no more of any use.
• MapReduce has mainly 2 tasks which are divided phase-
wise:
• In first phase, Map is utilized and in next
phase Reduce is utilized.
Map Task:

• RecordReader The purpose of recordreader is to break the records. It is


responsible for providing key-value pairs in a Map() function. The key is
actually is its locational information and value is the data associated with
it.

• Map: A map is nothing but a user-defined function whose work is to


process the Tuples obtained from record reader. The Map() function either
does not generate any key-value pair or generate multiple pairs of these
tuples.

• Combiner: Combiner is used for grouping the data in the Map workflow. It
is similar to a Local reducer. The intermediate key-value that are generated
in the Map is combined with the help of this combiner. Using a combiner is
not necessary as it is optional.

• Partitionar: Partitional is responsible for fetching key-value pairs


generated in the Mapper Phases. The partitioner generates the shards
corresponding to each reducer. Hashcode of each key is also fetched by
this partition.
Reduce Task

• Shuffle and Sort: The Task of Reducer starts with this step, the
process in which the Mapper generates the intermediate key-
value and transfers them to the Reducer task is known
as Shuffling. Using the Shuffling process the system can sort the
data using its key value. Once some of the Mapping tasks are
done Shuffling begins that is why it is a faster process and does
not wait for the completion of the task performed by Mapper.
• Reduce: The main function or task of the Reduce is to gather the
Tuple generated from Map and then perform some sorting and
aggregation sort of process on those key-value depending on its
key element.

• OutputFormat: Once all the operations are performed, the key-


value pairs are written into the file with the help of record writer,
each record in a new line, and the key and value in a space-
separated manner.
HDFS

• HDFS(Hadoop Distributed File System) is utilized for storage


permission. It is mainly designed for working on commodity
Hardware devices(inexpensive devices), working on a
distributed file system design.
• HDFS is designed in such a way that it believes more in
storing the data in a large chunk of blocks rather than storing
small data blocks.
• HDFS in Hadoop provides Fault-tolerance and High availability
to the storage layer and the other devices present in that
Hadoop cluster. Data storage Nodes in HDFS.
• NameNode(Master)

• DataNode(Slave)
• NameNode:NameNode works as a Master in a Hadoop
cluster that guides the Datanode(Slaves). Namenode is
mainly used for storing the Metadata i.e. the data about
the data. Meta Data can be the transaction logs that
keep track of the user’s activity in a Hadoop cluster.
• DataNode: DataNodes works as a Slave DataNodes are
mainly utilized for storing the data in a Hadoop cluster,
the number of DataNodes can be from 1 to 500 or even
more than that. The more number of DataNode, the
Hadoop cluster will be able to store more data. So it is
advised that the DataNode should have High storing
capacity to store a large number of file blocks.
YARN(Yet Another Resource Negotiator)

• YARN is a Framework on which MapReduce works.


• YARN performs 2 operations that are Job scheduling and Resource Management.
• The Purpose of Job schedular is to divide a big task into small jobs so that each job
can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized.
• Job Scheduler also keeps track of which job is important, which job has more priority,
dependencies between the jobs and all the other information like job timing, etc.
• And the use of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster.
Features of YARN
1. Multi-Tenancy

2. Scalability

3. Cluster-Utilization

4. Compatibility
Hadoop common or Common Utilities

• Hadoop common or Common utilities are nothing but


our java library and java files or we can say the java
scripts that we need for all the other components
present in a Hadoop cluster.
• These utilities are used by HDFS, YARN, and MapReduce
for running the cluster.
• Hadoop Common verify that Hardware failure in a
Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.

You might also like