Hadoop
Hadoop
History:
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002
when they both started to work on Apache Nutch project.
Apache Nutch project was the process of building a search engine system
that can index 1 billion pages. After a lot of research on Nutch, they
concluded that such a system will cost around half a million dollars in
hardware, and along with a monthly running cost of $30, 000 approximately,
which is very expensive.
It states that the files will be broken into blocks and stored in nodes over
the distributed architecture.
Yarn( Yet another Resource Negotiator) is used for job scheduling and
manage the cluster.
Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes input
data and converts it into a data set which can be computed in Key value
pair. The output of Map task is consumed by reduce task and then the out
of reducer gives the desired result.
Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.
Hadoop architecture
NameNode(Master):
NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data.
Meta Data can be the transaction logs that keep track of the user’s activity in a
Hadoop cluster. Meta Data can also be the name of the file, size, and the
information about the location(Block number, Block ids) of Datanode.
Data Node(Slave):
DataNodes works as a Slave DataNodes are mainly utilized for storing the data in
a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more
than that.
The more number of DataNode, the Hadoop cluster will be able to store more
data. So it is advised that the DataNode should have High storing capacity to
store a large number of file blocks.
2. MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on
the YARN framework.
The major feature of MapReduce is to perform the distributed processing in
parallel in a Hadoop cluster which Makes Hadoop working so fast.
When you are dealing with Big Data, serial processing is no more of any use.
MapReduce has mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
Job Tracker:
Job tracker is a component which manages all activities related to
MapReduce jobs in the Hadoop cluster like initiating the task, resource
management and killing tasks.
It acts as a master for all task tracker processes in MapReduce like how
Name node is master for all data node processes in HDFS.
Job tracker has all the info about the task tracker nodes and its health at any
given point of time.
Task Tracker:
Task tracker is process running on all slave nodes along with data node
processes.
It updates Job tracker about resource and task status on the machine in
which it’s running.
3. YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in
Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in
Hadoop 1.0.