Lecture 10 MapReduce Hadoop
Lecture 10 MapReduce Hadoop
http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-MIE-PDB/
Lecture 10:
MapReduce, Hadoop
26. 4. 2016
Shared memory
Tasks share a common address space
Tasks interact by reading and writing from/to this space
Asynchronously
Data parallelization
Data are partitioned across tasks
Tasks execute a sequence of independent operations
MapReduce Framework
Divide-and-conquer paradigm
Map breaks down a problem into sub-problems
Processes input data to generate a set of intermediate key/value
pairs
Reduce receives and combines the sub-solutions to solve the
problem
Processes intermediate values associated with the same
intermediate key
Many real world tasks can be expressed this way
Programmer focuses on map/reduce code
Framework cares about data partitioning, scheduling execution
across machines, handling machine failures, managing inter-
machine communication, …
MapReduce
A Bit More Formally
Map
Input: a key/value pair
Output: a set of intermediate key/value pairs
Usually different domain
(k1,v1) → list(k2,v2)
Reduce
Input: an intermediate key and a set of values for that
key
Output: a possibly smaller set of values
The same domain
(k2,list(v2)) → (k2,possibly smaller list(v2))
MapReduce
Example: Word Frequency
Clients can check for this condition and retry the MapReduce
operation if they desire
MapReduce
Stragglers
Straggler = a machine that takes an unusually
long time to complete one of the map/reduce
tasks in the computation
Example: a machine with a bad disk
Solution:
When a MapReduce operation is close to completion,
the master schedules backup executions of the
remaining in-progress tasks
A task is marked as completed whenever either the
primary or the backup execution completes
MapReduce
Task Granularity
M pieces of Map phase and R pieces of Reduce phase
Ideally both much larger than the number of worker machines
How to set them?
Master makes O(M + R) scheduling decisions
Master keeps O(M * R) status information in memory
For each Map/Reduce task: state (idle/in-progress/completed)
For each non-idle task: identity of worker machine
For each completed Map task: locations and sizes of the R intermediate
file regions
R is often constrained by users
The output of each Reduce task ends up in a separate output file
Practical recommendation (Google):
Choose M so that each individual task is roughly 16 – 64 MB of input
data
Make R a small multiple of the number of worker machines we expect to
use
MapReduce Criticism
David DeWitt and Michael Stonebraker – 2008
1. MapReduce is a step backwards in database access based on
Schema describing data structure
Separating schema from the application
Advanced query languages
2. MapReduce is a poor implementation
Instead of indexes is uses brute force
3. MapReduce is not novel (ideas more than 20 years old and
overcome)
4. MapReduce is missing features common in DBMSs
Indexes, transactions, integrity constraints, views, …
5. MapReduce is incompatible with applications implemented over
DBMSs
Data mining, business intelligence, …
Apache Hadoop
Open-source software framework
Running of applications on large clusters of
commodity hardware
Multi-terabyte
data-sets
Thousands of nodes
Implements MapReduce
Derived from Google's MapReduce and Google
File System (GFS)
Not open-source
http://hadoop.apache.org/
Apache Hadoop
Modules
Hadoop Common
Common utilities
Support for other Hadoop modules
Hadoop Distributed File System (HDFS)
Distributed file system
High-throughput access to application data
Hadoop YARN
Framework for job scheduling and cluster resource management
Hadoop MapReduce
YARN-based system for parallel processing of large data sets
HDFS (Hadoop Distributed File System)
Basic Features
Fault-tolerant
Highly scalable
HDFS
Data Characteristics
Assumes:
Streaming data access
Batch processing rather than interactive user access
Large data sets and files
Write-once / read-many
A file once created, written and closed does not need to be
changed
Or not often
This assumption simplifies coherency
Optimal applications for this model: MapReduce, web-
crawlers, …
HDFS
Fault Tolerance
Like a scheduler:
1. A client application is sent to the JobTracker
2. It “talks” to the NameNode (= HDFS master) and
locates the TaskTracker (Hadoop client) near the
data
3. It moves the work to the chosen TaskTracker node
MapReduce
TaskTracker (Client)
Accepts tasks from JobTracker
Map, Reduce, Combine, …
Input, output paths
Has a number of slots for the tasks
Execution slots available on the machine (or machines on the
same rack)
Spawns a separate JVM for execution of a task
Indicates the number of available slots through the
hearbeat message to the JobTracker
A failed task is re-executed by the JobTracker