Hadoop Important Lecture
Hadoop Important Lecture
2. Architecture in detail
3. Hadoop in industry
What is Hadoop?
1996
1996
1996
1997
Google search engines
1998
2013
Hadoop’s Developers
2003
2004
2006
Some Hadoop Milestones
• Hadoop:
• Goals / Requirements:
• Fault-tolerance
NameNode:
DataNode:
MapReduce Engine:
• Design requirements:
• System requirements
o Low latency
o Disk-efficient sequential
and random read
performance
Hadoop in the Wild
• Classic alternatives
• A Block Sever
– Stores data in local file system
– Stores meta-data of a block - checksum
– Serves data and meta-data to clients
• Block Report
– Periodically sends a report of all existing blocks to
NameNode
• Facilitate Pipelining of Data
– Forwards data to other specified DataNodes
Block Placement
• Replication Strategy
– One replica on local node
– Second replica on a remote rack
– Third replica on same remote rack
– Additional replicas are randomly placed
• Clients read from nearest replica
Data Correctness
• Log processing
• Web search indexing
• Ad-hoc queries
Closer Look
• MapReduce Component
– JobClient
– JobTracker
– TaskTracker
– Child
• Job Creation/Execution Process
MapReduce Process
(org.apache.hadoop.mapred)
• JobClient
– Submit job
• JobTracker
– Manage and schedule job, split job into tasks
• TaskTracker
– Start and monitor the task execution
• Child
– The process that really execute the task
Inter Process Communication
IPC/RPC (org.apache.hadoop.ipc)
• Protocol
JobSubmissionProtocol
– JobClient <-------------> JobTracker
InterTrackerProtocol
– TaskTracker <------------> JobTracker
TaskUmbilicalProtocol
– TaskTracker <-------------> Child
• JobTracker impliments both protocol and works as server in
both IPC
• TaskTracker implements the TaskUmbilicalProtocol; Child
gets task information and reports task status through it.