0% found this document useful (0 votes)

67 views38 pages

Big Data Overview and Hadoop Insights

Uploaded by

Shehab Eldin Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views38 pages

Big Data Overview and Hadoop Insights

Uploaded by

Shehab Eldin Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Big Data Overview

February 5, 2024 © 2013 IBM Corporation

Big Data is a term that
applies to data that can’t
be analyzed or processed
using traditional means.

© 2013 IBM Corporation

Increasingly organizations
are facing more challenges
related to Big Data.

They have access to

wealth of data but they
don’t know how to get
value out of this data.
As it sits in its most raw
form or in unstructured
format.
© 2013 IBM Corporation
Characteristics of Big Data

 Volume: Addressing the rapidly increasing volume of data available

 Velocity: (speed of data generation)
Addressing the analysis and handling of streaming data, data coming very fast
and very changing
 Variety: Addressing the different forms of data available: videos,
tweets, Facebook posts etc..

 Veracity: Another way to think of veracity is just "accuracy,"

"fidelity" or "truthfulness." In fact, it's the only thing that matters when you
want to actually DO something with all that data you collected.

4 © 2013 IBM Corporation

Characteristics of Big Data

 V4 = Volume Velocity Variety Veracity

Cost efficiently Responding to the Collectively analyzing
processing the increasing Velocity the broadening Variety
growing Volume

50x 30 Billion
35 ZB RFID sensors 80% of the
and counting worlds data is
unstructured

2010 2020

Establishing the 1 in 3 business leaders don’t trust

Veracity of big the information they use to make
data sources decisions
There are five main Big Data use cases
that we feel represents the majority of
activity in this space that’s of
commercial interest.

6 © 2013 IBM Corporation

The 5 Key Big Data Use Cases

Big Data Exploration Enhanced 360o View Security/Intelligence

Find, visualize, understand of the Customer Extension
all big data to improve Extend existing customer Lower risk, detect fraud
decision making views by incorporating and monitor cyber security
additional internal and in real-time
external data sources

Operations Analysis Data Warehouse Augmentation

Analyze a variety of machine Integrate big data and data warehouse
data for improved business results capabilities to increase operational efficiency

7 © 2013 IBM Corporation

Traditional RDBMS can’t handle the huge volume of the
explosion of the BIG DATA
NOT only BIG but unstructured
That is why Hadoop is there which was build on technology
started by Google “Distributed file system GDFS”
Data was in different nodes on a cluster and whenever you
want to add data then you can add a new node called
GFS “Google file system”

8 © 2013 IBM Corporation

Hadoop
IBM introduces BigInsights
based on Hadoop

<
What is Apache Hadoop?
Flexible, for storing and processing large volumes of data
Inspired by Google technologies (MapReduce, GFS, BigTable, …)
Initiated at Yahoo

10 © 2013 IBM Corporation

Hardware improvements through the years...

 CPU Speeds:
– 1990 – 44 MIPS at 40 MHz
– 2010 – 147,600 MIPS at 3.3 GHz

 RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2010 – 8-32GB (and more)

 Disk Capacity
– 1990 – 20MB
– 2010 – 1TB

 Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years, currently
around 70 – 80MB / sec

How long will it take to read 1TB of data?

1TB (at 80Mb / sec):
1 disk - 3.4 hours
10 disks - 20 min
100 disks - 2 min
1000 disks - 12 sec
Parallel Data Processing is the answer!
 GRID computing idea is to increase processing power of multiple
computers and bring data to place where processing capacity is available.

 Distributed workload is bring processing to data. This is great, but

writing applications that do this is very hard.
You need to worry how to exchange data between the different nodes,
you need to think about what happens if one of these nodes goes down,
Really difficult. You spend lots of time coding on passing the data than
dealing
with the problem itself.

- Hadoop, as we will see next is the answer to parallel data processing

without having the issues of GRID, distributed workload.

12 © 2013 IBM Corporation

 You may be familiar with OLTP (Online Transactional processing) where data
is randomly accessed on structured data like a relational db.
For example when you access your bank account.
 You may also be familiar with OLAP (Online Analytical processing) or DSS
(Decision Support Systems) where multidimensional access on structured data
like a relational database to generate reports that provide business intelligence.
 Now, you may not be that familiar with the concept of “Big Data”.
 Big Data is a term used to describe large collections of data (also
known as datasets) that may be unstructured, and grow so large
and quickly that is difficult to manage with regular database.
 It is optimized to handle massive amounts of data which could be
structured, unstructured, using commodity hardware, that is,
relatively inexpensive computers.
 Hadoop replicates its data across different computers, so that if
one goes down, the data is processed on one of the replicated
computers.

13 © 2013 IBM Corporation

Hadoop is not used for OLTP nor
OLAP, but for Big Data.
So Hadoop is NOT a replacement for
a RDBMS.

14 © 2013 IBM Corporation

Hadoop is not for all types of work
 Not good when work cannot be parallelized

 Not good for processing lots of small files

 Not good for intensive calculations with little data

What is Hadoop?

 Consists of 3 sub projects:

−Hadoop Distributed File System HDFS
−MapReduce
−Hadoop Common
Two Key Aspects of Hadoop

 Hadoop Distributed File System = HDFS

– Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop cluster
– It links together the file systems on many local nodes to make them
into one big file system
– Hadoop replicates its data across different computers, so that if one goes
down, the data is processed on one of the replicated computers.
 MapReduce framework
– MapReduce is a software framework to support distributed processing on
large data sets of clusters of computers.
– How Hadoop understands and assigns work to the nodes
(machines)
What is the Hadoop Distributed File System?
 Manage storage of data on different nodes since the
HDFS stores data across multiple nodes

 HDFS assumes nodes will fail, so it achieves reliability by replicating data

across multiple nodes
 The degree of replication can be customized by the Hadoop administrator. However, by default is to replicate every chunk of data across 3 nodes: 2 on
the same rack, and 1 on a different rack.
 and hence does not require RAID storage on hosts. Data nodes can talk to each other to rebalance data, to move copies around.

 The file system is built from a cluster of data nodes, each of which serves
up blocks of data over the network using a block protocol specific to HDFS.
Enables applications to work with thousands of nodes and petabytes
of data in a highly cost effective manner
CPU + storage disks = “node”
Each node has a Linux O/S
Nodes can be combined into clusters
New nodes can be added as needed without changing

20 © 2013 IBM Corporation

 Data is stored across the entire cluster (the DFS)
 The Distributed File System (DFS) is responsible for spreading data across
the cluster, by making the entire cluster look like one giant file system.
When a file is written to the cluster, blocks of the file are spread out and
replicated across the whole cluster.
– The entire cluster participates in the file system
– Blocks of a single file are replicated across the cluster

10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100

Logical File
(Very quick) Introduction to MapReduce
 Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the program
– rather than bringing the data to your programs, as you do in a traditional programming,
you write your program in a specific way that allows the program to be moved to the
data.

 The Distributed File System (DFS) is at the heart of MapReduce.

 MapReduce chooses to run the data on which node on the cluster then
replicates on the other node on the cluster
10110100 Cluster
10100100
1
11100111
11100101
1 3 2
00111010
01010010
2
11001001
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110
2 4
10010101 4 2
3
00101010 1
10101110
4
01001101
01110100

Logical File
 in a traditional DW you already know what you can ask, while in
Hadoop, you dump all the raw data into the HDFS and then start
asking the questions

 Traditional warehouses are mostly ideal for analyzing structured

data from various systems and producing insights. A Hadoop-based
platform is well suited to deal with unstructured data.
 In addition, when you consider where data should be stored, you
need to understand how data is stored today. When storing data in a
traditional data warehouse, typically, this data must shine with
respect to quality (good data); subsequently, it’s cleaned up via
cleansing, matching, modeling, and other services before it’s ready
for analysis. The data that lands in the warehouse is of high value,
then it has a broad purpose: it’s going to be used in reports and
dashboards where the accuracy of that data is the key.

23 © 2013 IBM Corporation

 In contrast, big data repositories rarely undergo the full quality
control prior of data being injected into a warehouse, We could say
that data warehouse the data is trusted enough to be “public,” while
Hadoop data isn’t as trusted.

Big Difference:

 Regular database  Big Data (Hadoop)

Raw data
Raw data

Schema Storage
to filter (unfiltered,
raw data)

Schema
to filter

Storage
(pre-filtered data) Output

RDBMS vs Hadoop
RDBMS Hadoop
Data
Structured data with known schemas Unstructured and structured
sources
Data type Records, long fields, objects, XML Files
Data
Updates allowed Only inserts and deletes
Updates
Language SQL & XQuery Pig (Pig Latin), Hive (HiveQL), Jaql
Processing
Quick response, random access Batch processing
type
Compress Sophisticated data compression Simple file compression
Hardware Enterprise hardware Commodity hardware
Data
Random access (indexing) Access files only (streaming)
access
History ~40 years of innovation < 5 years old
Community Widely used, abundant resources Not widely adopted yet
Two Key Aspects of Hadoop

HDFS MapReduce
• Distributed • Parallel Programming
• Reliable • Fault Tolerant
• Commodity gear

Hadoop Distributed File System (HDFS)
 Distributed, scalable, fault tolerant, high throughput
 Data access through MapReduce
 Files split into blocks
 3 replicas for each piece of data by default
 Can create, delete, copy, but NOT update
 Designed for streaming reads, not random access
 Data locality: processing data on the physical storage to decrease
transmission of data

28
HDFS – Architecture
 Master / Slave architecture
 Each node has a Linux O/S NameNode File1
 CPU + storage disks = “node”
a
b
 Master: NameNode c
– manages the file system namespace and d
metadata
• FsImage
• EditLog
– regulates client access to files
– Has the replica factors and the properties
of data and data addresses

 Slave: DataNode
– many per cluster
– manages storage attached to the nodes a b a c
– periodically reports status to NameNode b a d b
– Each DataNode has blocks of data d c c d

DataNodes
29
Hadoop Distributed File System (HDFS)
 Files split into blocks

 Data in a Hadoop cluster is broken down into smaller pieces (called blocks).

 Data locality: processing data on the physical storage to decrease

transmission of data

30
HDFS – Replication
 Blocks of data are replicated to multiple nodes
– Behavior is controlled by replication factor, configurable per file
– Default is 3 replicas

Common case:
 one replica on one node in the
local rack
 another replica on a different
node in the local rack
 and the last on a different node
in a different rack

31
Summary
 The Hadoop Distributed File System (HDFS) is where Hadoop stores
its data. This file system spans all the nodes in a cluster. Effectively,
HDFS links together the data that resides on many local nodes, making
the data part of one big file system. Furthermore, HDFS assumes nodes
will fail, so it replicates a given chunk of data across multiple nodes to
achieve reliability. The degree of replication can be customized by the
Hadoop administrator or programmer. However, by default is to
replicate every chunk of data across 3 nodes: 2 on the same rack, and
1 on a different rack.

MapReduce

33
MapReduce Explained
 MapReduce is a software framework introduced by Google to support
distributed computing on large data sets of clusters of computers.
 This is essentially a representation of the divide processing model,
where your input is split into many small pieces (the map step) on the
hadoop nodes, and the Hadoop nodes are processed in parallel.
 Once these pieces are processed, the results are distilled (in the reduce
step) down to a single answer.
• "Map" step: Each working node applies the "map()" function to the
local data, and writes the output to a temporary storage. A master node
orchestrates that for redundant copies of input data, only one is
processed.
• "Shuffle" step: Working nodes redistribute data based on the output
keys (produced by the "map()" function), such that all data belonging to
one key is located on the same worker node.
• "Reduce" step: Working nodes now process each group of output data,
per key, in parallel.
© 2013 IBM Corporation
Introduction to MapReduce
 Driving principals
– Data is stored across the entire cluster
– Programs are brought to the data, not the data to the
program
 Data is stored across the entire cluster (the DFS) and replicated
– Blocks of a single file are distributed across the cluster
– BLOCK SIZE IN MapReduce is 128KB

Logical File

35
MapReduce Overview

Results can be
written to HDFS or a
database

Map Shuffle Reduce

Distributed
FileSystem HDFS,
data in blocks

37
"if you write your programs in a special way" the programs can be
brought to the data. This special way is called MapReduce, and
involves breaking your program down into two discrete parts: Map
and Reduce.

A mapper is typically a relatively small program with a relatively

simple task: it is responsible for reading a portion of the input data,
interpreting, filtering or transforming the data as necessary and then
finally producing a stream of <key, value>.
As shown in the diagram, the MapReduce environment will
automatically take care of taking your small "map" program (the blue
boxes) and pushing that program out to every machine that has a
block of the file you are trying to process. This means that the bigger
the file, the bigger the cluster, the more mappers get involved in
processing the data! That's a pretty powerful idea.

Notes
 The main objective of Big data is distributed processing
and here comes the distributed storage “not the
opposite”.
 Normally, we bring data to the mainframe processor and
process it but here we keep the data on the nodes
“maybe a PC with a Linux O/S” and process it there, by
sending the program over there “MapReduce”.
 Comes after;
tools for Reporting Analysis “to describe
what happened” such as Cognos
And
tools for Reporting a predictive Analysis “to
describe what may happen”
such as IBM SPSS and IBM Modular

Questions?

Introduction to Hadoop and Big Data Analytics
No ratings yet
Introduction to Hadoop and Big Data Analytics
83 pages
NoSQL and Big Data Management Overview
No ratings yet
NoSQL and Big Data Management Overview
92 pages
Welcome To:: Unit 2 - Introduction To Big Hadoop
No ratings yet
Welcome To:: Unit 2 - Introduction To Big Hadoop
60 pages
GFS vs HDFS in Big Data Context
0% (1)
GFS vs HDFS in Big Data Context
55 pages
Understanding Hadoop for Big Data
No ratings yet
Understanding Hadoop for Big Data
19 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
47 pages
Big Data Processing Tools Overview
No ratings yet
Big Data Processing Tools Overview
15 pages
Big Data and Hadoop Architecture Guide
50% (2)
Big Data and Hadoop Architecture Guide
168 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
29 pages
Big Data Analytics and Hadoop Overview
No ratings yet
Big Data Analytics and Hadoop Overview
30 pages
Overview of Hadoop Architecture
100% (1)
Overview of Hadoop Architecture
32 pages
Understanding Hadoop and BigInsights
No ratings yet
Understanding Hadoop and BigInsights
36 pages
Big Data Storage and Processing Overview
No ratings yet
Big Data Storage and Processing Overview
18 pages
Understanding the 5 V's of Big Data
No ratings yet
Understanding the 5 V's of Big Data
2 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Hadoop and Spark Overview Guide
No ratings yet
Hadoop and Spark Overview Guide
34 pages
4 Hadoop and HDFS
No ratings yet
4 Hadoop and HDFS
33 pages
Skybounds Server IP for Big Data in Astronomy
No ratings yet
Skybounds Server IP for Big Data in Astronomy
98 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Big Data Analytics Unit-1
0% (1)
Big Data Analytics Unit-1
39 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data Processing with Hadoop & Spark
No ratings yet
Big Data Processing with Hadoop & Spark
30 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Understanding Hadoop and Big Data Analysis
No ratings yet
Understanding Hadoop and Big Data Analysis
8 pages
Introduction to Big Data and Hadoop
No ratings yet
Introduction to Big Data and Hadoop
30 pages
Understanding Hadoop: Components & Uses
No ratings yet
Understanding Hadoop: Components & Uses
18 pages
Big Data Analytics with Hadoop Overview
No ratings yet
Big Data Analytics with Hadoop Overview
22 pages
Big Data Analytics with Hadoop Overview
No ratings yet
Big Data Analytics with Hadoop Overview
32 pages
Big Data Analytics Exam Guide
No ratings yet
Big Data Analytics Exam Guide
15 pages
Hadoop Overview and Configuration Guide
No ratings yet
Hadoop Overview and Configuration Guide
8 pages
Hadoop Basics for Big Data Analysis
No ratings yet
Hadoop Basics for Big Data Analysis
23 pages
Spark and Hadoop Use Cases Explained
No ratings yet
Spark and Hadoop Use Cases Explained
48 pages
Understanding Hadoop: Concepts & Analysis
No ratings yet
Understanding Hadoop: Concepts & Analysis
23 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
40 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
82 pages
Understanding Big Data Analytics Concepts
No ratings yet
Understanding Big Data Analytics Concepts
116 pages
Overview of the Hadoop Ecosystem
No ratings yet
Overview of the Hadoop Ecosystem
34 pages
IBM Guardium Map Customer Journey
No ratings yet
IBM Guardium Map Customer Journey
24 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
22 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Introduction to Hadoop and HDFS
No ratings yet
Introduction to Hadoop and HDFS
18 pages
HDFS Default Block Size in Hadoop 2.x
No ratings yet
HDFS Default Block Size in Hadoop 2.x
33 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
25 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
101 pages
Hadoop Overview and Big Data Challenges
100% (1)
Hadoop Overview and Big Data Challenges
397 pages
Is Java Essential for Hadoop Users?
No ratings yet
Is Java Essential for Hadoop Users?
5 pages
Benchmarking I/O with testDFSio
No ratings yet
Benchmarking I/O with testDFSio
109 pages
SergeBazhievsky Introduction To Hadoop MapReduce v2
No ratings yet
SergeBazhievsky Introduction To Hadoop MapReduce v2
67 pages
Understanding Hadoop for Big Data Processing
No ratings yet
Understanding Hadoop for Big Data Processing
61 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
23 pages
Hadoop Overview and Applications
No ratings yet
Hadoop Overview and Applications
12 pages
Advanced Analytics and Unstructured Data Insights
No ratings yet
Advanced Analytics and Unstructured Data Insights
11 pages
Hadoop Lab Guide for Big Data Processing
100% (1)
Hadoop Lab Guide for Big Data Processing
32 pages
Warranty Certificate Template
100% (1)
Warranty Certificate Template
1 page
Vodafone HRM System Overview
No ratings yet
Vodafone HRM System Overview
17 pages
Text Classification Exam Project Guide
No ratings yet
Text Classification Exam Project Guide
24 pages
Teamwork's Impact on Employee Performance
No ratings yet
Teamwork's Impact on Employee Performance
7 pages
Blue Nile Inc.: E-Commerce Success Story
No ratings yet
Blue Nile Inc.: E-Commerce Success Story
13 pages
Strategic Management Process Overview
No ratings yet
Strategic Management Process Overview
4 pages

Big Data Overview and Hadoop Insights

Uploaded by

Big Data Overview and Hadoop Insights

Uploaded by

Big Data Overview

February 5, 2024 © 2013 IBM Corporation

© 2013 IBM Corporation

They have access to

 Volume: Addressing the rapidly increasing volume of data available

 Veracity: Another way to think of veracity is just "accuracy,"

4 © 2013 IBM Corporation

 V4 = Volume Velocity Variety Veracity

Establishing the 1 in 3 business leaders don’t trust

6 © 2013 IBM Corporation

Big Data Exploration Enhanced 360o View Security/Intelligence

Operations Analysis Data Warehouse Augmentation

7 © 2013 IBM Corporation

8 © 2013 IBM Corporation

10 © 2013 IBM Corporation

How long will it take to read 1TB of data?

 Distributed workload is bring processing to data. This is great, but

- Hadoop, as we will see next is the answer to parallel data processing

12 © 2013 IBM Corporation

13 © 2013 IBM Corporation

14 © 2013 IBM Corporation

 Not good for processing lots of small files

 Not good for intensive calculations with little data

 Consists of 3 sub projects:

 Hadoop Distributed File System = HDFS

 HDFS assumes nodes will fail, so it achieves reliability by replicating data

20 © 2013 IBM Corporation

 The Distributed File System (DFS) is at the heart of MapReduce.

 Traditional warehouses are mostly ideal for analyzing structured

23 © 2013 IBM Corporation

24 © 2013 IBM Corporation

 Regular database  Big Data (Hadoop)

25 © 2013 IBM Corporation

27 © 2013 IBM Corporation

 Data locality: processing data on the physical storage to decrease

© 2013 IBM Corporation

Map Shuffle Reduce

A mapper is typically a relatively small program with a relatively

38 © 2013 IBM Corporation

39 © 2013 IBM Corporation

You might also like