0% found this document useful (0 votes)
10 views

Big Data Technologies - Hadoop

Uploaded by

Gotchi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Big Data Technologies - Hadoop

Uploaded by

Gotchi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

MITB B.

Big Data Technologies


Reference Notes

Chapter 2 – Hadoop Overview and Architecture


Agenda Today

• Motivation for Hadoop


• Introduction to Hadoop
• Core Components, Architecture and Operations
‒ HDFS
‒ MapReduce

• Critique
• Introduction to the Hadoop Ecosystem

MITB B.5 Big Data Technologies


Motivation for
Hadoop

MITB B.5 Big Data Technologies


Fundamental assumptions invalidated
Hardware can be reliable
• At a premium, hardware with MTTF > Lifespan
• For 1000 node cluster,
‒ MTTF of 4 years = 5 failures / week
‒ MTTF of 2 years = 10 failures / week (commodity
hardware)

MITB B.5 Big Data Technologies


Fundamental assumptions invalidated
Machines have identities
• Explicit communication at scale not feasible
‒ Verification problem at least as large as data processing set
• Idea of “some” machine sending data or communication to
“some other” machine

MITB B.5 Big Data Technologies


Fundamental assumptions invalidated
Data can be stored on a single machine
• Today’s data sets > capacity of 1 machine
• Intractable to compute on for 1 machine
• If each machine stores small part of each data set, any
machine can process part of the data set by reading from
local disk

MITB B.5 Big Data Technologies


Big Data has changed computation requirements

Failure Support Component Recovery


• Support Partial Failure • On recovery, rejoin the system
• Graceful degradation of perf. without requiring a restart

Consistency Shared Nothing


• Component failures should not • Adding load – graceful decline in
affect outcome of job perf.
• No loss of data • Inc. in resource, inc. in load
capacity
MITB B.5 Big Data Technologies
Traditional approaches limited

MITB B.5 Big Data Technologies


“Shared Everything” Distributed Computing Arch.
• “Scale up” architecture
• All processors share the same
memory space
• Complexity
‒ Programming model
‒ Data synchronization
• Failures are expensive and need to be
managed
• Low speed – data movement
• Sharing same system bus –
bandwidth choking for memory access
MITB B.5 Big Data Technologies
“Shared Nothing” Distributed Computing

• Data partitioned among nodes


• Computation is local to nodes
• No synchronization – simple
implementation
• Data is local and does not have to
move across n/w
• Designed to handle failure
• Consistent architecture –
individual failures do not fail the
job

MITB B.5 Big Data Technologies


Common Distributed Computing Architectures
Storage / Disk Memory Compute /
Space Processing

Common Common Local


Shared Everything

Local Local Local


Shared Nothing

MITB B.5 Big Data Technologies


Common Distributed Computing Architectures
Fault Scalability Programming Performance
Tolerance Complexity

Low Low High Low


Shared Single point of Comm. overhead; Data synch. and High disk/
failure resource constraint programming network traffic;
Everything on master model same system bus

High High Med. High


Each node is Relatively easy to Low synch. Data and
Shared independent add nodes issues, prog. compute is local
model (use case to node
Nothing dependent)

MITB B.5 Big Data Technologies


Introduction to
Hadoop

MITB B.5 Big Data Technologies


What is Hadoop

Apache Hadoop is an open source software framework for


storage and large scale processing of data sets on clusters
of commodity hardware

MITB B.5 Big Data Technologies


What is Hadoop

• Open Source
• Software Framework
• Storage
• Large Scale Processing
• Clusters of Commodity Hardware

MITB B.5 Big Data Technologies


What is Hadoop

Apache Hadoop provides an integrated storage and


processing fabric that scales horizontally with commodity
hardware and provides fault tolerance through software

MITB B.5 Big Data Technologies


What is Hadoop

• Open Source
• Software Framework
• Storage
• Large Scale Processing
• Clusters of Commodity Hardware
• Scalability
• Fault tolerance
MITB B.5 Big Data Technologies
What is Hadoop
Hadoop is about raising the level of abstraction to create
building blocks for programmers who have
‒ Lots of data to store

‒ or lots of data to analyze

But don’t have the time, skill or resources to become distributed


system experts to build the infrastructure to handle it

MITB B.5 Big Data Technologies


Evolution of Hadoop

MITB B.5 Big Data Technologies


Core Components
and Architecture

MITB B.5 Big Data Technologies


Core components of Hadoop – Hadoop Core

Designed for storing very Designed for fast and


large files efficient computation of
work by taking code to
data

MITB B.5 Big Data Technologies


HDFS
• Storage layer for Hadoop
• Filesystem written in Java, based on
Google’s GFS
HDFS
• Sits on top of a native filesystem – Linux, Native OS
xfs etc. filesystem

• Distributes data when ingested or stored Disk Storage

• Takes care of replication of data


• Designed for a write once, read many
model
MITB B.5 Big Data Technologies
Hadoop Cluster Terminology

Cluster
- Group of computers working together
Node
- Individual computers in the cluster
- Can be of different types depending on the role they play (E.g.
Master, Slave, Name Node, Data Node etc.)

MITB B.5 Big Data Technologies


Hadoop Cluster Terminology

Daemon
- A computer program that runs as a background process
- Typically long running processes that control system resources and
core functions
- Not under direct control of an interactive user
- Term originated in Unix; Windows equivalent can be a “service”

MITB B.5 Big Data Technologies


Simplified view of a Hadoop cluster

MITB B.5 Big Data Technologies


How HDFS stores Files
1 Data split 2 Blocks replicated and stored on
3 Namenode
into blocks different nodes stores meta data

Block 1 Data Node 1 Data Node 2 • Stores (in RAM)


location
Block 1 Block 1
information of
Block 2 the files
Block 2 Block 4 Node 4
• Maintains a
Large Block 2
persistent
checkpoint of the
File Data Node 3 Data Node 5
Block 3
metadata in a file
Block 3 Block 2 stored on the
Block 1
disk called the
Block 3 Block 3 fsimage file
Block 4 Block 4 Block 4

MITB B.5 Big Data Technologies


What happens when NameNode starts up
• Namenode is single point of failure
• Checkpoint data maintained in fsimage file can be written to local disk as well
as remote disk
• High Availability options in Hadoop 2.x support two Namenodes in an
active/passive configuration

Read fsimage file from Read actions that are Write modified in-
disk and load into logged in edits log memory representation
memory and apply to fsimage

MITB B.5 Big Data Technologies


HDFS – Master Slave Architecture
NameNode
• An HDFS cluster typically consists
of a single NameNode, a master
server that manages the file
system namespace and regulates
access to files by clients
• Executes file system namespace
operations like opening, closing,
and renaming files and directories.
• Determines the mapping of blocks
to DataNodes.

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
MITB B.5 Big Data Technologies
HDFS – NameNode and DataNodes
DataNodes
• Responsible for serving read and
write requests from the file
system’s clients.
• Perform block creation, deletion,
and replication upon instruction
from the NameNode.
• DataNodes manage storage
attached to the nodes that they
run on.
• Internally, a file is split into one or
more blocks and these blocks are
stored in a set of DataNodes.

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
MITB B.5 Big Data Technologies
HDFS – NameNode and DataNodes

File System Namespace


• HDFS supports a traditional
hierarchical file organization.
• A user or an application can create
directories and store files inside these
directories.
• The NameNode maintains the file
system namespace. Any change to the
file system namespace or its
properties is recorded by the
NameNode.
• An application can specify the
replication factor of a file which is
stored by the NameNode.

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
MITB B.5 Big Data Technologies
HDFS – Data Replication
BLOCKS AND REPLICATION KEY CONCEPTS
• Each file is stored as a sequence • Optimization of block placement – rack aware policy; Improves write
of block – All blocks except the performance without compromising data reliability or read
performance
last are of the same size
• Replica selection – read request from replica closest to reader
• Blocks of a file are replicated for
• Safemode for NameNode on start-up
fault tolerance
• Persistence of File System Name Space maintained by a write-ahead
• Block size and replication factor journal and checkpoints
are configurable per file and can
‒ Journal transactions persisted into EditLog before replying to
be application specified client
‒ Checkpoints periodically written to fsimage file
‒ Block locations discovered from DataNodes during startup via
block reports
MITB B.5 Big Data Technologies
HDFS – Block Size
• Block size is the smallest unit of data that a file system can store

‒ A 1 KB file will still take up one block of 64MB

‒ For a 1000 MB file size and a block size of 4KB, you would 256,000 request to
get the file; With 64MB block size, 16 requests are needed

‒ Default block size in Hadoop is 64MB to optimize space, seek time, node traffic
and meta data on Namenode

• Can be configured by user through the hdfs-site.xml file

MITB B.5 Big Data Technologies


HDFS – Small File Problem
• A small file is one which is smaller than the HDFS block size (Default 64 MB)
• Memory overhead on Namenode
‒ Every file, directory and block in HDFS is represented as an object in the
namenode’s memory, each of which occupies 150 bytes, as a rule of thumb
‒ 10 million files, each using a block, would use about 3 gigabytes of memory.
• Performance and seek time
‒ Reading small files causes a lot of seeks and hops across nodes, which is
inefficient
‒ Map tasks process a block at a time, small files cause a lot of inefficient map
tasks with associated overhead

MITB B.5 Big Data Technologies


HDFS – Data Flow
Client Name Node Data Node 1 Data Node 1 Data Node 1

Staging File Temporary DataNode details for storage Block 1 Block 1 Block 1

Client buffers data in temp file NN inserts file name Receives


1 3 5 Receives Receives
until it reaches block size into file system 5 5
block from block block
When temp file reaches block hierarchy and allocates client, stores and stores and
2 size, client asks NN for block data blocks for it and stores and further further
list for replication returns to client further replicates replicates
replicates
Flushes block from temp Closes file and commits
6
4 location to specified DataNode to persistent store
with replication details
On file closure, remaining un-
5 flushed data is sent to
DataNode and NN informed
MITB B.5 Big Data Technologies
HDFS – Robustness
SCENARIO – Data Disk Failure SCENARIO – Data Integrity
• Heart beat messages • HDFS client implements checksum checking
‒ Stop I/O on contents of files
‒ Node data made unavailable to HDFS • Block-level checksum stores in same HDFS
‒ Re-initiate block replication namespace

SCENARIO – Cluster Rebalancing SCENARIO – Metadata Disk Failure


• Manage disk space • Corruption of EditLog and FsImage
• Additional replica creation on sudden high • Multiple copies of these files, synch. updates
demand • Perf. degradation for namespace transactions

MITB B.5 Big Data Technologies


HDFS – Summary
• Resiliency and Fault Tolerance – Data is distributed in multiple nodes as it is
initially stored in the system
• Data Locality – nodes work on the data that they have locally
• Shared Nothing architecture – nodes talk to each other as little as possible
• Node failures are handled systemically by the file system
• Provides a high level programming abstraction for the distributed system
• HDFS is typically set up for High Availability – two Name Nodes (Active and
Standby)

MITB B.5 Big Data Technologies


MapReduce

• Based on MapReduce programming model, inspired by Inspired by


Map/Reduce in functional programming languages
• The core Hadoop processing engine before Spark was introduced
• Many existing ecosystem tools are abstractions on top of MR
• Built to take advantage of parallelism
• Extensive and mature fault tolerance built into framework
• Data locality optimization
MITB B.5 Big Data Technologies
MapReduce – Overview

MapReduce Job Map Function Shuffle Operation Reduce Function

• Splits the input • Maps tasks • Worker nodes • Takes as input the
data-set into process the redistribute data output of the map
independent chunks of data based on the or shuffle task and
chunks using the logic output keys combines the
defined in the map (produced by the pieces into a
• Provides oversight
task "map()" function), composite output
to the parallel
such that all data
processing • Map() methods • Reduce() methods
belonging to one
commonly perform commonly perform
key is located on
functions like functions like
the same worker
filtering, sorting summary
node
operations and
aggregations
MITB B.5 Big Data Technologies
MapReduce Daemons
Client
• Accepts job from client
Client • Schedules/assigns TaskTrackers
JobTracker
• Maintain data locality
Client
• Fault Monitoring

Task Tracker Task Tracker Task Tracker Task Tracker

Reduce
Map Task Map Task Map Task Reduce Task Map Task Map Task
Task

MITB B.5 Big Data Technologies


Typical MapReduce Cluster
Client

Client
JobTracker
Client

Task Tracker Task Tracker Task Tracker Task Tracker

Reduce
Map Task Map Task Map Task Reduce Task Map Task Map Task
Task

• Accepts task from JobTracker


• Performs the actual task (Map or Reduce or both)
• Sends periodic heartbeat to JobTracker
• Sends data and compute availability to JobTracker

MITB B.5 Big Data Technologies


Typical MapReduce Cluster
Job Tracker
Job Tracker
Secondary 0000
NameNode NameNode
0000
Task Tracker

Data Node

Task Tracker Task Tracker Task Tracker


0000 0000 0000
Data Node Data Node Data Node

MITB B.5 Big Data Technologies


MapReduce – Example of Word Count
• Input: Large number of text documents
• Task: compute word count across all the documents

Solution
• Mapper
‒ For every word in a document, output key value pair, e.g. (word, 1)
• Reducer
‒ Sum all occurrences of words and output (word, total_count)

MITB B.5 Big Data Technologies


MapReduce – Example of Word Frequency
• Input: Large number of text documents
• Task: Compute word frequency across all the document; Frequency using total word count

Solution
• Naïve solution with basic MR needs to MR functions
‒ MR 1: count all words in the documents
‒ MR 2: count number of each word and divide it by total count from MR1

MITB B.5 Big Data Technologies


MapReduce

MITB B.5 Big Data Technologies


MapReduce – Node View

MITB B.5 Big Data Technologies


MapReduce Roles

MITB B.5 Big Data Technologies


MapReduce in Google

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36249.pdf

MITB B.5 Big Data Technologies


MapReduce in Google
Googler’s hammer for 80% of data crunching1

• Large scale web search indexing


• Clustering problems for Google News
• Produce reports for popular web queries, e.g. Google Trends
• Processing of satellite imagery data
• Language model processing for statistical machine translation
• System health monitoring

Note:
1. Data for the period~2006
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36249.pdf

MITB B.5 Big Data Technologies


Critique of Hadoop

MITB B.5 Big Data Technologies


Key Advantages of Hadoop

MITB B.5 Big Data Technologies


Key Limitations of Hadoop

MITB B.5 Big Data Technologies


Some commercial Hadoop deployments
Yahoo! cluster – 2010
- 70 million files, 80 million blocks
- 15 PB capacity
- 4000+ nodes, 24,000 clients
- 50 GB heap for NN

Data warehouse Hadoop cluster at Facebook 2010


- 55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)
- 21 PB capacity
- 2000 nodes, 30,000 clients
- 108 GB heap for NN (should allow for 400 million objects)

Analytics cluster at Ebay (Ares)


- 77 million files, 85 million blocks
- 1000 nodes; 24 TB of local disk storage, 72 GB of RAM, 12-core CPU
- Cluster capacity 19 PB raw disk space
- Runs upto 38,000 MR tasks simultaneously Konstantin Shvachko, , HDFS Design Principles

MITB B.5 Big Data Technologies


Introduction to the
Hadoop Stack

MITB B.5 Big Data Technologies


Introduction to the Hadoop Stack
Data Access Operations and
Scripting SQL ML Streaming Search Management
Pig Hive Mahout SparkStreaming Solr
Impala Spark Storm

Execution Engines
MapReduce, Spark, Tez, Giraph
Configuration,
Synchronization
Resource Management ZooKeeper
YARN
Provision,
Manage, Monitor
Data Storage and Management
Ambari
HDFS, Hbase, Cassandra, HCatalog
Scheduling
Data Integration Oozie
Sqoop, Flume

MITB B.5 Big Data Technologies


Introduction to the Hadoop Stack – Data Integration
Apache Sqoop
• System for bulk data transfer between HDFS and
structured datastores as RDBMS. Like Flume but
from HDFS to RDBMS

Apache Flume
• Distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large
amounts of log data.
• It has a simple and flexible architecture based on
streaming data flows.
• It is robust and fault tolerant with tunable reliability
mechanisms and many failover and recovery
mechanisms.
• It uses a simple extensible data model that allows
for online analytic application.

MITB B.5 Big Data Technologies


Introduction to the Hadoop Stack – Data Storage
HDFS
• Distributed File System to store large files across
multiple machines
Hbase
• A scalable, distributed database that supports
structured data storage for large tables
• Google BigTable Inspired. Non-relational
distributed database.
• Random, real-time r/w operations in column-
oriented very large tables
Cassandra
• A scalable multi-master database with no single
points of failure.
Hcatalog
• HCatalog’s table abstraction presents users with a
relational view of data in the Hadoop Distributed
File System (HDFS)
MITB B.5 Big Data Technologies
Introduction to the Hadoop Stack – Resource
Management
YARN
• A framework for job scheduling and cluster
resource management
• YARN stands for “Yet-Another-Resource-
Negotiator”
• Facilitates writing arbitrary distributed processing
frameworks and applications.
• YARN’s execution model is more generic than the
earlier MapReduce implementation.
• YARN can run applications that do not follow the
MapReduce model, unlike the original Apache
Hadoop MapReduce (also called MR1).
• Hadoop YARN is an attempt to take Apache
Hadoop beyond MapReduce for data-processing.

MITB B.5 Big Data Technologies


Introduction to the Hadoop Stack – Execution Engines
MapReduce
• Programming model for processing large data sets with a
parallel, distributed algorithm on a cluster
Spark
• Fast and general compute engine for Hadoop; . Provides a
simple and expressive programming model that supports a
wide range of applications, (ETL, ML, stream processing, and
graph computation.
Tez
• Generalized data-flow programming framework, built on
Hadoop YARN, which provides a powerful and flexible engine
to execute an arbitrary DAG of tasks to process data for both
batch and interactive use-cases.
Giraph
• Iterative graph processing system built for high scalability.
• currently used at Facebook to analyze the social graph
formed by users and their connections

MITB B.5 Big Data Technologies


Introduction to the Hadoop Stack – Execution Engines
Scripting (Pig)
• A high-level data-flow language and execution
framework for parallel computation.
• Pig uses MapReduce to execute all of its data
processing.
SQL (Hive)
• A data warehouse infrastructure that provides data
summarization and ad hoc querying (SQL-like)
Streaming (Storm)
• CEP processor; distributed real-time computation
system for processing fast, large streams of data.
• Storm is an architecture based on master-workers
paradigm.
Search (Solr)
• Search and navigation engine providing distributed
indexing, replication and load-balanced querying,
automated failover and recovery, centralized
configuration and more.
MITB B.5 Big Data Technologies
Introduction to the Hadoop Stack – Operations and
Management
Configuration, Synchronization (ZooKeeper)
• A high-performance coordination service for distributed
applications developed at Yahoo! Research
Provision, Manage, Monitor (Ambari)
• Web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters
• Ambari also provides a dashboard for viewing cluster
health such as heatmaps
• Ability to view MapReduce, Pig and Hive applications
visually along with features to diagnose their
performance characteristics in a user-friendly manner.
Scheduling (Oozie)
• Workflow scheduler system for MR jobs using DAGs.
Oozie Coordinator can trigger jobs by time (frequency)
and data availability

MITB B.5 Big Data Technologies


Introduction to the Hadoop Stack – Commercial
Distributions

MITB B.5 Big Data Technologies


Introduction to the Hadoop Stack – Hortonworks Stack

MITB B.5 Big Data Technologies


Introduction to the Hadoop Stack – Cloudera Stack

MITB B.5 Big Data Technologies


References
Standard Text
Hadoop, The Definitive Guide, 4th Edition; Tom White; Publisher: O’Reilly Media, Inc.
Other References, Papers and Articles
5 Common Questions About Apache Hadoop
http://blog.cloudera.com/blog/2009/05/5-common-questions-about-hadoop/
HDFS Architecture Guide
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
Small Files and Block Size
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
MapReduce – The Programming Model and Practice
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36249.pdf

MITB B.5 Big Data Technologies

You might also like