BIA BigData Overview
BIA BigData Overview
50x 30 Billion
35 ZB RFID sensors 80% of the
and counting worlds data is
unstructured
2010 2020
<
What is Apache Hadoop?
Flexible, for storing and processing large volumes of data
Inspired by Google technologies (MapReduce, GFS, BigTable, …)
Initiated at Yahoo
CPU Speeds:
– 1990 – 44 MIPS at 40 MHz
– 2010 – 147,600 MIPS at 3.3 GHz
RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2010 – 8-32GB (and more)
Disk Capacity
– 1990 – 20MB
– 2010 – 1TB
Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently
around 70 – 80MB / sec
The file system is built from a cluster of data nodes, each of which serves
up blocks of data over the network using a block protocol specific to HDFS.
Enables applications to work with thousands of nodes and petabytes
of data in a highly cost effective manner
CPU + storage disks = “node”
Each node has a Linux O/S
Nodes can be combined into clusters
New nodes can be added as needed without changing
10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100
Logical File
(Very quick) Introduction to MapReduce
Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the program
– rather than bringing the data to your programs, as you do in a traditional programming,
you write your program in a specific way that allows the program to be moved to the
data.
Logical File
in a traditional DW you already know what you can ask, while in
Hadoop, you dump all the raw data into the HDFS and then start
asking the questions
Raw data
Raw data
Schema Storage
to filter (unfiltered,
raw data)
Schema
to filter
Storage
(pre-filtered data) Output
HDFS MapReduce
• Distributed • Parallel Programming
• Reliable • Fault Tolerant
• Commodity gear
28
HDFS – Architecture
Master / Slave architecture
Each node has a Linux O/S NameNode File1
CPU + storage disks = “node”
a
b
Master: NameNode c
– manages the file system namespace and d
metadata
• FsImage
• EditLog
– regulates client access to files
– Has the replica factors and the properties
of data and data addresses
Slave: DataNode
– many per cluster
– manages storage attached to the nodes a b a c
– periodically reports status to NameNode b a d b
– Each DataNode has blocks of data d c c d
DataNodes
29
Hadoop Distributed File System (HDFS)
Files split into blocks
Data in a Hadoop cluster is broken down into smaller pieces (called blocks).
30
HDFS – Replication
Blocks of data are replicated to multiple nodes
– Behavior is controlled by replication factor, configurable per file
– Default is 3 replicas
Common case:
one replica on one node in the
local rack
another replica on a different
node in the local rack
and the last on a different node
in a different rack
31
Summary
The Hadoop Distributed File System (HDFS) is where Hadoop stores
its data. This file system spans all the nodes in a cluster. Effectively,
HDFS links together the data that resides on many local nodes, making
the data part of one big file system. Furthermore, HDFS assumes nodes
will fail, so it replicates a given chunk of data across multiple nodes to
achieve reliability. The degree of replication can be customized by the
Hadoop administrator or programmer. However, by default is to
replicate every chunk of data across 3 nodes: 2 on the same rack, and
1 on a different rack.
<
33
MapReduce Explained
MapReduce is a software framework introduced by Google to support
distributed computing on large data sets of clusters of computers.
This is essentially a representation of the divide processing model,
where your input is split into many small pieces (the map step) on the
hadoop nodes, and the Hadoop nodes are processed in parallel.
Once these pieces are processed, the results are distilled (in the reduce
step) down to a single answer.
• "Map" step: Each working node applies the "map()" function to the
local data, and writes the output to a temporary storage. A master node
orchestrates that for redundant copies of input data, only one is
processed.
• "Shuffle" step: Working nodes redistribute data based on the output
keys (produced by the "map()" function), such that all data belonging to
one key is located on the same worker node.
• "Reduce" step: Working nodes now process each group of output data,
per key, in parallel.
© 2013 IBM Corporation
Introduction to MapReduce
Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the
program
Data is stored across the entire cluster (the DFS) and replicated
– Blocks of a single file are distributed across the cluster
– BLOCK SIZE IN MapReduce is 128KB
10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100
Logical File
35
MapReduce Overview
Results can be
written to HDFS or a
database
37
"if you write your programs in a special way" the programs can be
brought to the data. This special way is called MapReduce, and
involves breaking your program down into two discrete parts: Map
and Reduce.