Hadoop: Er. Gursewak Singh Dsce
Hadoop: Er. Gursewak Singh Dsce
It is written in Java and currently used by Google, Facebook, LinkedIn, Yahoo, Twitter etc.
Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
Map Reduce: This is a framework which helps Java programs to do the parallel computation on
data using key value pair. The Map task takes input data and converts it into a data set which can
be computed in Key value pair. The output of Map task is consumed by reduce task and then the
out of reducer gives the desired result.
Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop
modules.
Features of Hadoop Which Makes It Popular
1. Open Source:
Hadoop is open-source, which means it is free to use. Since it is an open-source project the source-code is available online for
Hadoop is a highly scalable model. A large amount of data is divided into multiple inexpensive machines in a cluster which is
processed parallelly. the number of these machines or nodes can be increased or decreased as per the enterprise’s
requirements.
Hadoop uses commodity hardware (inexpensive systems) which can be crashed at any moment. In Hadoop data is replicated
on various DataNodes in a Hadoop cluster which ensures the availability of data if somehow any of your systems got crashed.
You can read all of the data from a single machine if this machine faces a technical issue data can also be read from other
High Availability means the availability of data on the Hadoop cluster. Due to fault tolerance in case if any of the DataNode
goes down the same data can be retrieved from any other node where the data is replicated. The High available Hadoop
cluster also has 2 or more than two Name Node i.e. Active NameNode and Passive NameNode also known as stand by
NameNode. In case if Active NameNode fails then the Passive node will take the responsibility of Active Node and provide
the same data as that of Active NameNode which can easily be utilized by the user.
5. Cost-Effective:
Hadoop is open-source and uses cost-effective commodity hardware which provides a cost-efficient model, unlike traditional
Relational databases that require expensive hardware and high-end processors to deal with Big Data.
Hadoop is designed in such a way that it can deal with any kind of dataset like structured(MySql Data), Semi-
Structured(XML, JSON), Un-structured (Images and Videos) very efficiently. This means it can easily process any kind of
data independent of its structure which makes it highly flexible.
7. Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the processing work since it is managed by
the Hadoop itself. Hadoop ecosystem is also very large comes up with lots of tools like Hive, Pig, Spark, HBase,
Mahout, etc.
The concept of Data Locality is used to make Hadoop processing fast. In the data locality concept, the computation
logic is moved near data rather than moving the data to the computation logic. The cost of Moving data on HDFS
is costliest and with the help of the data locality concept, the bandwidth utilization in the system is minimized.
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed File System). In
DFS(Distributed File System) a large size file is broken into small size file blocks then distributed among the
Nodes available in a Hadoop cluster, as this massive number of file blocks are processed parallelly which makes
Hadoop faster.
History of Hadoop
Hadoop Architecture
The Hadoop Architecture Mainly consists of 4
components.
MapReduce
Map Task:
RecordReader The purpose of recordreader is to break the records. It is responsible for providing
key-value pairs in a Map() function. The key is actually is its locational information and value is the
data associated with it.
Map: A map is nothing but a user-defined function whose work is to process the Tuples obtained from
record reader. The Map() function either does not generate any key-value pair or generate multiple
pairs of these tuples.
Combiner: Combiner is used for grouping the data in the Map workflow. It is similar to a Local
reducer. The intermediate key-value that are generated in the Map is combined with the help of this
combiner. Using a combiner is not necessary as it is optional.
Partitionar: Partitional is responsible for fetching key-value pairs generated in the Mapper Phases.
The partitioner generates the shards corresponding to each reducer. Hashcode of each key is also
fetched by this partition. Then partitioner performs it’s(Hashcode) modulus with the number of
reducers(key.hashcode()%(number of reducers))
Reduce Task
Shuffle and Sort: The Task of Reducer starts with this step, the process in which the Mapper
generates the intermediate key-value and transfers them to the Reducer task is known
as Shuffling. Using the Shuffling process the system can sort the data using its key value. Once
some of the Mapping tasks are done Shuffling begins that is why it is a faster process and does
not wait for the completion of the task performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the Tuple generated from Map
and then perform some sorting and aggregation sort of process on those key-value depending
on its key element.
OutputFormat: Once all the operations are performed, the key-value pairs are written into the
file with the help of record writer, each record in a new line, and the key and value in a space-
separated manner.
Usage of MapReduce
It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
It was used by Google to regenerate Google's index of the World Wide Web.