Big Data Technologies - Hadoop
Big Data Technologies - Hadoop
• Critique
• Introduction to the Hadoop Ecosystem
• Open Source
• Software Framework
• Storage
• Large Scale Processing
• Clusters of Commodity Hardware
• Open Source
• Software Framework
• Storage
• Large Scale Processing
• Clusters of Commodity Hardware
• Scalability
• Fault tolerance
MITB B.5 Big Data Technologies
What is Hadoop
Hadoop is about raising the level of abstraction to create
building blocks for programmers who have
‒ Lots of data to store
Cluster
- Group of computers working together
Node
- Individual computers in the cluster
- Can be of different types depending on the role they play (E.g.
Master, Slave, Name Node, Data Node etc.)
Daemon
- A computer program that runs as a background process
- Typically long running processes that control system resources and
core functions
- Not under direct control of an interactive user
- Term originated in Unix; Windows equivalent can be a “service”
Read fsimage file from Read actions that are Write modified in-
disk and load into logged in edits log memory representation
memory and apply to fsimage
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
MITB B.5 Big Data Technologies
HDFS – NameNode and DataNodes
DataNodes
• Responsible for serving read and
write requests from the file
system’s clients.
• Perform block creation, deletion,
and replication upon instruction
from the NameNode.
• DataNodes manage storage
attached to the nodes that they
run on.
• Internally, a file is split into one or
more blocks and these blocks are
stored in a set of DataNodes.
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
MITB B.5 Big Data Technologies
HDFS – NameNode and DataNodes
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
MITB B.5 Big Data Technologies
HDFS – Data Replication
BLOCKS AND REPLICATION KEY CONCEPTS
• Each file is stored as a sequence • Optimization of block placement – rack aware policy; Improves write
of block – All blocks except the performance without compromising data reliability or read
performance
last are of the same size
• Replica selection – read request from replica closest to reader
• Blocks of a file are replicated for
• Safemode for NameNode on start-up
fault tolerance
• Persistence of File System Name Space maintained by a write-ahead
• Block size and replication factor journal and checkpoints
are configurable per file and can
‒ Journal transactions persisted into EditLog before replying to
be application specified client
‒ Checkpoints periodically written to fsimage file
‒ Block locations discovered from DataNodes during startup via
block reports
MITB B.5 Big Data Technologies
HDFS – Block Size
• Block size is the smallest unit of data that a file system can store
‒ For a 1000 MB file size and a block size of 4KB, you would 256,000 request to
get the file; With 64MB block size, 16 requests are needed
‒ Default block size in Hadoop is 64MB to optimize space, seek time, node traffic
and meta data on Namenode
Staging File Temporary DataNode details for storage Block 1 Block 1 Block 1
• Splits the input • Maps tasks • Worker nodes • Takes as input the
data-set into process the redistribute data output of the map
independent chunks of data based on the or shuffle task and
chunks using the logic output keys combines the
defined in the map (produced by the pieces into a
• Provides oversight
task "map()" function), composite output
to the parallel
such that all data
processing • Map() methods • Reduce() methods
belonging to one
commonly perform commonly perform
key is located on
functions like functions like
the same worker
filtering, sorting summary
node
operations and
aggregations
MITB B.5 Big Data Technologies
MapReduce Daemons
Client
• Accepts job from client
Client • Schedules/assigns TaskTrackers
JobTracker
• Maintain data locality
Client
• Fault Monitoring
Reduce
Map Task Map Task Map Task Reduce Task Map Task Map Task
Task
Client
JobTracker
Client
Reduce
Map Task Map Task Map Task Reduce Task Map Task Map Task
Task
Data Node
Solution
• Mapper
‒ For every word in a document, output key value pair, e.g. (word, 1)
• Reducer
‒ Sum all occurrences of words and output (word, total_count)
Solution
• Naïve solution with basic MR needs to MR functions
‒ MR 1: count all words in the documents
‒ MR 2: count number of each word and divide it by total count from MR1
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36249.pdf
Note:
1. Data for the period~2006
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36249.pdf
Execution Engines
MapReduce, Spark, Tez, Giraph
Configuration,
Synchronization
Resource Management ZooKeeper
YARN
Provision,
Manage, Monitor
Data Storage and Management
Ambari
HDFS, Hbase, Cassandra, HCatalog
Scheduling
Data Integration Oozie
Sqoop, Flume
Apache Flume
• Distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large
amounts of log data.
• It has a simple and flexible architecture based on
streaming data flows.
• It is robust and fault tolerant with tunable reliability
mechanisms and many failover and recovery
mechanisms.
• It uses a simple extensible data model that allows
for online analytic application.