Big Data Overview
February 5, 2024 © 2013 IBM Corporation
Big Data is a term that
applies to data that can’t
be analyzed or processed
using traditional means.
© 2013 IBM Corporation
Increasingly organizations
are facing more challenges
related to Big Data.
They have access to
wealth of data but they
don’t know how to get
value out of this data.
As it sits in its most raw
form or in unstructured
format.
© 2013 IBM Corporation
Characteristics of Big Data
Volume: Addressing the rapidly increasing volume of data available
Velocity: (speed of data generation)
Addressing the analysis and handling of streaming data, data coming very fast
and very changing
Variety: Addressing the different forms of data available: videos,
tweets, Facebook posts etc..
Veracity: Another way to think of veracity is just "accuracy,"
"fidelity" or "truthfulness." In fact, it's the only thing that matters when you
want to actually DO something with all that data you collected.
4 © 2013 IBM Corporation
Characteristics of Big Data
V4 = Volume Velocity Variety Veracity
Cost efficiently Responding to the Collectively analyzing
processing the increasing Velocity the broadening Variety
growing Volume
50x 30 Billion
35 ZB RFID sensors 80% of the
and counting worlds data is
unstructured
2010 2020
Establishing the 1 in 3 business leaders don’t trust
Veracity of big the information they use to make
data sources decisions
There are five main Big Data use cases
that we feel represents the majority of
activity in this space that’s of
commercial interest.
6 © 2013 IBM Corporation
The 5 Key Big Data Use Cases
Big Data Exploration Enhanced 360o View Security/Intelligence
Find, visualize, understand of the Customer Extension
all big data to improve Extend existing customer Lower risk, detect fraud
decision making views by incorporating and monitor cyber security
additional internal and in real-time
external data sources
Operations Analysis Data Warehouse Augmentation
Analyze a variety of machine Integrate big data and data warehouse
data for improved business results capabilities to increase operational efficiency
7 © 2013 IBM Corporation
Traditional RDBMS can’t handle the huge volume of the
explosion of the BIG DATA
NOT only BIG but unstructured
That is why Hadoop is there which was build on technology
started by Google “Distributed file system GDFS”
Data was in different nodes on a cluster and whenever you
want to add data then you can add a new node called
GFS “Google file system”
8 © 2013 IBM Corporation
Hadoop
IBM introduces BigInsights
based on Hadoop
<
What is Apache Hadoop?
Flexible, for storing and processing large volumes of data
Inspired by Google technologies (MapReduce, GFS, BigTable, …)
Initiated at Yahoo
10 © 2013 IBM Corporation
Hardware improvements through the years...
CPU Speeds:
– 1990 – 44 MIPS at 40 MHz
– 2010 – 147,600 MIPS at 3.3 GHz
RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2010 – 8-32GB (and more)
Disk Capacity
– 1990 – 20MB
– 2010 – 1TB
Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently
around 70 – 80MB / sec
How long will it take to read 1TB of data?
1TB (at 80Mb / sec):
1 disk - 3.4 hours
10 disks - 20 min
100 disks - 2 min
1000 disks - 12 sec
Parallel Data Processing is the answer!
GRID computing idea is to increase processing power of multiple
computers and bring data to place where processing capacity is available.
Distributed workload is bring processing to data. This is great, but
writing applications that do this is very hard.
You need to worry how to exchange data between the different nodes,
you need to think about what happens if one of these nodes goes down,
Really difficult. You spend lots of time coding on passing the data than
dealing
with the problem itself.
- Hadoop, as we will see next is the answer to parallel data processing
without having the issues of GRID, distributed workload.
12 © 2013 IBM Corporation
You may be familiar with OLTP (Online Transactional processing) where data
is randomly accessed on structured data like a relational db.
For example when you access your bank account.
You may also be familiar with OLAP (Online Analytical processing) or DSS
(Decision Support Systems) where multidimensional access on structured data
like a relational database to generate reports that provide business intelligence.
Now, you may not be that familiar with the concept of “Big Data”.
Big Data is a term used to describe large collections of data (also
known as datasets) that may be unstructured, and grow so large
and quickly that is difficult to manage with regular database.
It is optimized to handle massive amounts of data which could be
structured, unstructured, using commodity hardware, that is,
relatively inexpensive computers.
Hadoop replicates its data across different computers, so that if
one goes down, the data is processed on one of the replicated
computers.
13 © 2013 IBM Corporation
Hadoop is not used for OLTP nor
OLAP, but for Big Data.
So Hadoop is NOT a replacement for
a RDBMS.
14 © 2013 IBM Corporation
Hadoop is not for all types of work
Not good when work cannot be parallelized
Not good for processing lots of small files
Not good for intensive calculations with little data
What is Hadoop?
Consists of 3 sub projects:
−Hadoop Distributed File System HDFS
−MapReduce
−Hadoop Common
Two Key Aspects of Hadoop
Hadoop Distributed File System = HDFS
– Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop cluster
– It links together the file systems on many local nodes to make them
into one big file system
– Hadoop replicates its data across different computers, so that if one goes
down, the data is processed on one of the replicated computers.
MapReduce framework
– MapReduce is a software framework to support distributed processing on
large data sets of clusters of computers.
– How Hadoop understands and assigns work to the nodes
(machines)
What is the Hadoop Distributed File System?
Manage storage of data on different nodes since the
HDFS stores data across multiple nodes
HDFS assumes nodes will fail, so it achieves reliability by replicating data
across multiple nodes
The degree of replication can be customized by the Hadoop administrator. However, by default is to replicate every chunk of data across 3 nodes: 2 on
the same rack, and 1 on a different rack.
and hence does not require RAID storage on hosts. Data nodes can talk to each other to rebalance data, to move copies around.
The file system is built from a cluster of data nodes, each of which serves
up blocks of data over the network using a block protocol specific to HDFS.
Enables applications to work with thousands of nodes and petabytes
of data in a highly cost effective manner
CPU + storage disks = “node”
Each node has a Linux O/S
Nodes can be combined into clusters
New nodes can be added as needed without changing
20 © 2013 IBM Corporation
Data is stored across the entire cluster (the DFS)
The Distributed File System (DFS) is responsible for spreading data across
the cluster, by making the entire cluster look like one giant file system.
When a file is written to the cluster, blocks of the file are spread out and
replicated across the whole cluster.
– The entire cluster participates in the file system
– Blocks of a single file are replicated across the cluster
10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100
Logical File
(Very quick) Introduction to MapReduce
Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the program
– rather than bringing the data to your programs, as you do in a traditional programming,
you write your program in a specific way that allows the program to be moved to the
data.
The Distributed File System (DFS) is at the heart of MapReduce.
MapReduce chooses to run the data on which node on the cluster then
replicates on the other node on the cluster
10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100
Logical File
in a traditional DW you already know what you can ask, while in
Hadoop, you dump all the raw data into the HDFS and then start
asking the questions
Traditional warehouses are mostly ideal for analyzing structured
data from various systems and producing insights. A Hadoop-based
platform is well suited to deal with unstructured data.
In addition, when you consider where data should be stored, you
need to understand how data is stored today. When storing data in a
traditional data warehouse, typically, this data must shine with
respect to quality (good data); subsequently, it’s cleaned up via
cleansing, matching, modeling, and other services before it’s ready
for analysis. The data that lands in the warehouse is of high value,
then it has a broad purpose: it’s going to be used in reports and
dashboards where the accuracy of that data is the key.
23 © 2013 IBM Corporation
In contrast, big data repositories rarely undergo the full quality
control prior of data being injected into a warehouse, We could say
that data warehouse the data is trusted enough to be “public,” while
Hadoop data isn’t as trusted.
24 © 2013 IBM Corporation
Big Difference:
Regular database Big Data (Hadoop)
Raw data
Raw data
Schema Storage
to filter (unfiltered,
raw data)
Schema
to filter
Storage
(pre-filtered data) Output
25 © 2013 IBM Corporation
RDBMS vs Hadoop
RDBMS Hadoop
Data
Structured data with known schemas Unstructured and structured
sources
Data type Records, long fields, objects, XML Files
Data
Updates allowed Only inserts and deletes
Updates
Language SQL & XQuery Pig (Pig Latin), Hive (HiveQL), Jaql
Processing
Quick response, random access Batch processing
type
Compress Sophisticated data compression Simple file compression
Hardware Enterprise hardware Commodity hardware
Data
Random access (indexing) Access files only (streaming)
access
History ~40 years of innovation < 5 years old
Community Widely used, abundant resources Not widely adopted yet
Two Key Aspects of Hadoop
HDFS MapReduce
• Distributed • Parallel Programming
• Reliable • Fault Tolerant
• Commodity gear
27 © 2013 IBM Corporation
Hadoop Distributed File System (HDFS)
Distributed, scalable, fault tolerant, high throughput
Data access through MapReduce
Files split into blocks
3 replicas for each piece of data by default
Can create, delete, copy, but NOT update
Designed for streaming reads, not random access
Data locality: processing data on the physical storage to decrease
transmission of data
28
HDFS – Architecture
Master / Slave architecture
Each node has a Linux O/S NameNode File1
CPU + storage disks = “node”
a
b
Master: NameNode c
– manages the file system namespace and d
metadata
• FsImage
• EditLog
– regulates client access to files
– Has the replica factors and the properties
of data and data addresses
Slave: DataNode
– many per cluster
– manages storage attached to the nodes a b a c
– periodically reports status to NameNode b a d b
– Each DataNode has blocks of data d c c d
DataNodes
29
Hadoop Distributed File System (HDFS)
Files split into blocks
Data in a Hadoop cluster is broken down into smaller pieces (called blocks).
Data locality: processing data on the physical storage to decrease
transmission of data
30
HDFS – Replication
Blocks of data are replicated to multiple nodes
– Behavior is controlled by replication factor, configurable per file
– Default is 3 replicas
Common case:
one replica on one node in the
local rack
another replica on a different
node in the local rack
and the last on a different node
in a different rack
31
Summary
The Hadoop Distributed File System (HDFS) is where Hadoop stores
its data. This file system spans all the nodes in a cluster. Effectively,
HDFS links together the data that resides on many local nodes, making
the data part of one big file system. Furthermore, HDFS assumes nodes
will fail, so it replicates a given chunk of data across multiple nodes to
achieve reliability. The degree of replication can be customized by the
Hadoop administrator or programmer. However, by default is to
replicate every chunk of data across 3 nodes: 2 on the same rack, and
1 on a different rack.
© 2013 IBM Corporation
MapReduce
<
33
MapReduce Explained
MapReduce is a software framework introduced by Google to support
distributed computing on large data sets of clusters of computers.
This is essentially a representation of the divide processing model,
where your input is split into many small pieces (the map step) on the
hadoop nodes, and the Hadoop nodes are processed in parallel.
Once these pieces are processed, the results are distilled (in the reduce
step) down to a single answer.
• "Map" step: Each working node applies the "map()" function to the
local data, and writes the output to a temporary storage. A master node
orchestrates that for redundant copies of input data, only one is
processed.
• "Shuffle" step: Working nodes redistribute data based on the output
keys (produced by the "map()" function), such that all data belonging to
one key is located on the same worker node.
• "Reduce" step: Working nodes now process each group of output data,
per key, in parallel.
© 2013 IBM Corporation
Introduction to MapReduce
Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the
program
Data is stored across the entire cluster (the DFS) and replicated
– Blocks of a single file are distributed across the cluster
– BLOCK SIZE IN MapReduce is 128KB
10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100
Logical File
35
MapReduce Overview
Results can be
written to HDFS or a
database
Map Shuffle Reduce
Distributed
FileSystem HDFS,
data in blocks
37
"if you write your programs in a special way" the programs can be
brought to the data. This special way is called MapReduce, and
involves breaking your program down into two discrete parts: Map
and Reduce.
A mapper is typically a relatively small program with a relatively
simple task: it is responsible for reading a portion of the input data,
interpreting, filtering or transforming the data as necessary and then
finally producing a stream of <key, value>.
As shown in the diagram, the MapReduce environment will
automatically take care of taking your small "map" program (the blue
boxes) and pushing that program out to every machine that has a
block of the file you are trying to process. This means that the bigger
the file, the bigger the cluster, the more mappers get involved in
processing the data! That's a pretty powerful idea.
38 © 2013 IBM Corporation
Notes
The main objective of Big data is distributed processing
and here comes the distributed storage “not the
opposite”.
Normally, we bring data to the mainframe processor and
process it but here we keep the data on the nodes
“maybe a PC with a Linux O/S” and process it there, by
sending the program over there “MapReduce”.
Comes after;
tools for Reporting Analysis “to describe
what happened” such as Cognos
And
tools for Reporting a predictive Analysis “to
describe what may happen”
such as IBM SPSS and IBM Modular
39 © 2013 IBM Corporation
Questions?