0% found this document useful (0 votes)

4 views

07-BigData-DataAnalysis

The document provides an overview of Big Data, including its definition, architecture components, storage methods, and processing techniques such as MapReduce and Spark. It discusses various data storage systems like distributed file systems and key-value stores, as well as the importance of data analytics for business decision-making. Additionally, it highlights the role of data warehousing and the use of machine learning for predictive modeling.

Uploaded by

hieutm0507

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

07-BigData-DataAnalysis

Uploaded by

hieutm0507

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

BIG DATA – DATA ANALYSIS

Lê Hồng Hải
UET-VNUH
Big Data Overview

1 Introduction

2 Big Data storages

3 Big Data processing

4 Streaming

2
Big Data

 The definition of big data is data that

contains greater variety, arriving in
increasing volumes and with more
velocity. This is also known as the 3 Vs

3
Big Data

4
Big data architecture components

• Data sources – relational databases, files (e.g., web

server log files) produced by applications, real-time
data produced by IoT devices.
• Big data storage –storing high data volumes of
different types before filtering, aggregating, and
preparing data for analysis.
• Real-time message ingestion store – to capture and
store real-time messages for stream processing.
• Analytical data store – relational databases for
preparing and structuring big data for further
analytical querying.
• Big data analytics and reporting, which may include
OLAP cubes, ML tools, BI tools, etc. – to provide big
data insights to end users.
5
Big data architecture

6
Big Data Storage

1. Distributed file systems

2. Sharding across multiple databases
3. Key-value storage systems
4. Parallel and distributed databases

7
Distributed File Systems

A distributed file system stores data across

a large collection of machines, but provides
a single file-system view
 Provides redundant storage of massive
amounts of data on cheap and unreliable
computers
◼ Google File System (GFS)
◼ Hadoop File System (HDFS)

8
Hadoop File System Architecture

▪ Single Namespace for entire

cluster
▪ Files are broken up into
blocks
• Typically 64 MB block size
• Each block replicated on
multiple DataNodes
▪ Client
• Finds the location of
blocks from NameNode
• Accesses data directly
from DataNode

9
Hadoop Distributed File System (HDFS)

 Data Coherency
◼ Write-once-read-many access model
◼ Client can only append to existing files
 Distributed file systems good for millions
of large files

10
Big Data Storage

1. Distributed file systems

2. Sharding across multiple databases
3. Key-value storage systems
4. Parallel and distributed databases

11
Sharding

 Sharding: partition data across multiple

databases
 Partitioning usually done on some
partitioning attributes (also known as
partitioning keys or shard keys e.g.
user ID
◼ E.g., records with key values from 1 to
100,000 on database 1,
records with key values from 100,001 to
200,000 on database 2, etc

12
Key Value Storage Systems

 Key-value storage systems store large

numbers (billions or even more) of small
(KB-MB) sized records
 Records are partitioned across multiple
machines and
 Queries are routed by the system to
appropriate machine
 Records are also replicated across
multiple machines, to ensure availability
even if a machine fails
◼ Key-value stores ensure that updates are
applied to all replicas, to ensure that their
values are consistent13
Key Value Storage Systems

 Key-value stores may store

◼ uninterpreted bytes, with an associated key
 E.g., Amazon S3, Amazon Dynamo
◼ Wide-table (can have arbitrarily many
attribute names) with associated key
▪ Google BigTable, Apache Cassandra, Apache Hbase,
Amazon DynamoDB
◼ JSON
 MongoDB, CouchDB (document model)
 Document stores store semi-structured
data, typically JSON
 Some key-value stores support multiple
versions of data, with timestamps/version
numbers 14
Data Representation

 An example of a JSON object is:

{
"ID": "22222",
"name": {
"firstname: "Albert",
"lastname: "Einstein"
},
"deptname": "Physics",
"children": [
{ "firstname": "Hans", "lastname":
"Einstein" },
{ "firstname": "Eduard", "lastname":
"Einstein" }
]
}
15
Key Value Storage Systems

 Key-value stores support

◼ put(key, value): used to store values with an
associated key,
◼ get(key): which retrieves the stored value
associated with the specified key
◼ delete(key) -- Remove the key and its
associated value
 Some systems also support range
queries on key values
 Document stores also support queries on
non-key attributes
◼ See book for MongoDB queries
◼ Also called NoSQL systems
16
Replication and Consistency

 Availability (system can run even if parts have

failed) is essential for parallel/distributed
databases
◼ Via replication, so even if a node has failed, another copy
is available
 Consistency is important for replicated data
◼ All live replicas have same value, and each read sees
latest version
 Network partitions (network can break into two
or more parts, each with active systems that can’t
talk to other parts)
 In presence of partitions, cannot guarantee both
availability and consistency
◼ Brewer’s CAP “Theorem”

17
Big data architecture

18
Big Data Processing

 Map-Reduce
 Spark
 Streaming

19
The MapReduce Paradigm

 Platform for reliable, scalable parallel computing

 Abstracts issues of distributed and parallel
environment from programmer
◼ Programmer provides core logic (via map() and
reduce() functions)
◼ System takes care of parallelization of
computation, coordination, etc

20
MapReduce - Dataflow

21
The MapReduce Paradigm

 Paradigm dates back many decades

◼ But very large scale implementations
running on clusters with 10^3 to 10^4
machines are more recent
◼ Google Map Reduce, Hadoop, ..
 Data storage/access typically done using
distributed file systems or key-value stores

22
MapReduce Programming Model

 Input: a set of key/value pairs

 User supplies two functions:
◼ map(k,v) → list(k1,v1)
◼ reduce(k1, list(v1)) → v2
 (k1,v1) is an intermediate key/value pair
 Output is the set of (k1,v2) pairs

23
Flow of Keys and Values

 Flow of keys and values in a map

reduce task
rk1 rv1 rk1 rv1,rv7,...
rk7 rv2 rk2 rv8,rvi,...
mk1 mv1
rk3 rv3 rk3 rv3,...
mk2 mv2
rk1 rv7
rk7 rv2,...
rk2 rv8

rki ... rvn,...

rk2 rvi
mkn mvn
rki rvn

map inputs map outputs reduce inputs

(key, value) (key, value)

https://www.geeksforgeeks.org/how-to-execute-wordcount-program-
in-mapreduce-using-cloudera-distribution-hadoop-cdh/
24
Example

I am a tiger, you are also a

tiger
I,1 a,2
map am,1 a, 1 also,1
a,1 a,1 reduce am,1
are,1 part0
also,1
tiger,1 am,1
map you,1 are,1
are,1 I,1
tiger,1 I, 1
tiger,1 tiger,2 part1
also,1 you,1 reduce you,1
map a, 1
tiger,1

JobTracker generates JobTracker generates

Hadoop sorts the
three TaskTrackers for two TaskTrackers
25 for
intermediate data
map tasks map tasks
25
Parallel Processing of MapReduce Job

User
Program
copy copy copy

Master
assign assign
map reduce
Part 1 Map 1 Reduce 1 File 1
Part 2
Part 3 Reduce 1 write File 2
Map 2
Part 4
local
write
Part n
read Map n Reduce m File m
Remote
Read, Sort
Input file Intermediate Output files
partitions files

26
Map Reduce vs. Databases

 Map Reduce widely used for parallel

processing
◼ Google, Yahoo, and 100’s of other companies
◼ Example uses: compute PageRank, build
keyword indices, do data analysis of web click
logs, ….
 Many real-world uses of MapReduce
cannot be expressed in SQL
 But many computations are much easier to
express in SQL

27
Map Reduce vs. Databases (Cont.)

 Relational operations (select, project, join,

aggregation, etc.) can be expressed using
Map Reduce
 SQL queries can be translated into Map
Reduce infrastructure for execution
◼ Apache Hive SQL, Apache Pig Latin, Microsoft
SCOPE

28
Where is MapReduce Inefficient?

 Long pipelines sharing data

 Interactive applications
 Streaming applications

(MapReduce would need to write and read

from disk a lot)

29
Spark

 The key idea of Spark is Resilient Distributed

Datasets (RDD)
 It supports in-memory processing computation

30
RDD Spark

 Resilient Distributed Dataset (RDD)

abstraction
◼ Collection of records that can be stored
across multiple machines
 Read-only partitioned collection of records
(like a DFS) but with a record of how the
dataset was created as a combination of
transformations from other dataset(s)

31
Word Count in Spark

32
Spark DataFramesand DataSet

 RDDs in Spark can be typed in programs,

but not dynamically
 The DataSet type allows types to be
specified dynamically
 Row is a row type, with attribute names
◼ In code below, attribute names/types of
instructor and department are inferred from
files read

33
Spark DataFramesand DataSet

 Operations filter, join, groupBy, agg, etc defined

on DataSet, and can execute in parallel
 Dataset<Row> instructor =
spark.read().parquet("...");
Dataset<Row> department =
spark.read().parquet("...");
instructor.filter(instructor.col("salary").gt(100000
))
.join(department, instructor.col("dept name")
.equalTo(department.col("dept name")))
.groupBy(department.col("building"))
.agg(count(instructor.col("ID")));

34
StreamingData

35
Streaming Data and Applications

 Streaming data refers to data that

arrives in a continuous fashion
 Applications include:
◼ Stock market: stream of trades
◼ Sensors: sensor readings
 Internet of things
◼ Network monitoring data
◼ Social media: tweets and posts can be viewed
as a stream
 Queries on streams can be very useful
◼ Monitoring, alerts, automated triggering of
actions

36
Publish Subscribe Systems

 Publish-subscribe (pub-sub) systems

provide a convenient abstraction for
processing streams
◼ Tuples in a stream are published to a topic
◼ Consumers subscribe to topic

37
Apache Kafka

 Apache Kafka is a popular parallel pub-sub

system widely used to manage streaming data
 Parallel pub-sub systems allow tuples in a
topic to be partitioned across multiple
machines

38
Big data architecture

39
Data Analytics

1. Overview
2. Data Warehousing (DW)
3. Online Analytical Processing (OLAP)
4. Data Mining

40
Overview

 Data analytics: the processing of data to

infer patterns, correlations, or models for
prediction
 Primarily used to make business decisions
◼ E.g., what product to suggest for purchase
◼ E.g., what products to manufacture/stock, in
what quantity
 Critical for businesses today

41
Common steps in data analytics

 Gather data from multiple sources into one

location
 Data warehouses also integrate data into a
common schema
 Data often needs to be extracted from
source formats, transformed into
common schema, and loaded into the
data warehouse (ETL)

42
Data Analytics

 Generate aggregates and reports

summarizing data
◼ Dashboards showing graphical charts/reports
◼ Online analytical processing (OLAP)
systems allow interactive querying
◼ Statistical analysis using tools such as
R/SAS/SPSS
 Build predictive models and use the
models for decision making

43
Overview (Cont.)

 Predictive models are widely used today

◼ E.g., use customer profile features and the
history of a customer to predict the likelihood
of default on a loan
◼ E.g., use history of sales to predict future sales
 Other examples of business decisions:
◼ What items to stock?
◼ What insurance premium to change?
◼ To whom to send advertisements?

44
Overview (Cont.)

 Machine learning techniques are key to

finding patterns in data and making
predictions
 Data mining extends techniques
developed by machine-learning
communities to run them on very large
datasets
 The term business intelligence (BI) is
synonym for data analytics

45
Data Warehousing

 A data warehouse is a repository (archive)

of information gathered from multiple
sources, stored under a unified schema, at a
single site

46
Warehouse Design issues

 Data transformation and data

cleansing
◼E.g., correct mistakes in addresses
(misspellings, zip code errors)
 How to propagate updates
 What data to summarize

47
Multidimensional Data

 Data in warehouses can usually be divided

into
◼ Fact tables, which are large
 E.g, sales(item_id, store_id,
customer_id, date, number, price)
◼ Dimension tables, which are relatively
small
 Store extra information about stores,
items, etc.

48
Fact Tables

 Attributes of fact tables can be usually

viewed as
◼ Measure attributes
 measure some value, and can be
aggregated upon
 e.g., the attributes number or price of
the sales relation
◼ Dimension attributes
 dimensions on which measure attributes
are viewed

49
Data Warehouse Star Schema

50
More on Data Warehouse Star Schema

51
Multidimensional Data and Warehouse Schemas

 More complicated schema structures

◼ Snowflake schema: multiple levels of
dimension tables

52
Data lakes

 Some applications do not find it worthwhile

to bring data to a common schema
◼ Data lakes are repositories which allow data to
be stored in multiple formats, without schema
integration
◼ Less upfront effort, but more effort during
querying

53
Database Support for Data Warehouses

 Data in warehouses usually append-only,

not updated. Can avoid concurrency
control overheads
 Data warehouses often use column-
oriented storage

54
Column-oriented storage

 Arrays are compressed, reducing storage,

IO and memory costs significantly
 Queries can fetch only attributes that they
care about, reducing IO and memory cost
 Data warehouses often use parallel storage
and query processing infrastructure

55
Data Analysis and OLAP

 Online Analytical Processing (OLAP)

 Interactive analysis of data, allowing data
to be summarized and viewed in different
ways in an online fashion (with negligible
delay)

56
Cross Tabulation

 The table below is an example of a cross-

tabulation (cross-tab), also referred to as a
pivot-table

57
Data Cube

 A data cube is a multidimensional

generalization of a cross-tab
 Can have n dimensions; we show 3 below
 Cross-tabs can be used as views on a data
cube

58
Online Analytical Processing Operations

 Pivoting: changing the dimensions used in a

cross-tab
 Slicing: creating a cross-tab for fixed values
only
 Rollup: moving from finer-granularity data to
a coarser granularity
 Drill down: The opposite operation - that of
moving from coarser-granularity data to finer-
granularity data

59
Hierarchies on Dimensions

 Hierarchy on dimension attributes: lets

dimensions be viewed at different levels of
detail

60
Cross Tabulation With Hierarchy

 Cross-tabs can be easily extended to deal

with hierarchies
 Can drill down or roll up on a hierarchy
 E.g. hierarchy: item_name → category

61
Reporting and Visualization

 Reporting tools help create formatted

reports with tabular/graphical
representation of data
 Data visualization tools help create
interactive visualization of data
◼ E.g., PowerBI, Tableau, FusionChart, plotly,
Datawrapper, Google Charts, etc.

62
Reporting and Visualization

63
Data Mining

 Data mining is the process of semi-

automatically analyzing large databases to
find useful patterns
 Some types of knowledge can be represented
as rules
 More generally, knowledge is discovered by
applying machine learning techniques to
past instances of data to form a model

64
Types of Data Mining Tasks

 Prediction based on past history

◼ Predict if a credit card applicant poses a good
credit risk, based on some attributes (income,
job type, age, ..) and past history
 Some examples of prediction mechanisms:
◼ Classification
 Items (with associated attributes) belong to one of
several classes
 Training instances have attribute values and classes
provided
◼ Regression formulae
 Given a set of mappings for an unknown function,
predict the function result for a new parameter value

65
THANKS YOU

Issa Elohim – THE CHURCH OF ADONITOLOGY™ 2
No ratings yet
Issa Elohim – THE CHURCH OF ADONITOLOGY™ 2
4 pages
Geological Basin of Pakistan
No ratings yet
Geological Basin of Pakistan
19 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
unit6
No ratings yet
unit6
22 pages
2 BDA A6515 Hadoop
No ratings yet
2 BDA A6515 Hadoop
55 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Lez.d-01-Hadoop (C)
No ratings yet
Lez.d-01-Hadoop (C)
29 pages
learn
No ratings yet
learn
16 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Hadoop 1 Converted
No ratings yet
Hadoop 1 Converted
26 pages
Big Data
No ratings yet
Big Data
47 pages
BDA_answers[1]
No ratings yet
BDA_answers[1]
6 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Bda Viva Questions
No ratings yet
Bda Viva Questions
8 pages
MapReduce-Final
No ratings yet
MapReduce-Final
92 pages
CLOUD COMPUTING UNIT 3
No ratings yet
CLOUD COMPUTING UNIT 3
10 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Big Data
No ratings yet
Big Data
67 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
data_analytics_chapter_5
No ratings yet
data_analytics_chapter_5
14 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
DA U2
No ratings yet
DA U2
17 pages
Hadoop
No ratings yet
Hadoop
154 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Spark Deep Dive
No ratings yet
Spark Deep Dive
34 pages
Basics of Big Data
No ratings yet
Basics of Big Data
7 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Hadoop Bitcoin-BlockChain - A New Era Needed in Distributed Computing
No ratings yet
Hadoop Bitcoin-BlockChain - A New Era Needed in Distributed Computing
7 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
Big Data Analysis IAT-1
No ratings yet
Big Data Analysis IAT-1
43 pages
Bda Imp No Header Footer (1)
No ratings yet
Bda Imp No Header Footer (1)
25 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Da ANSWERS
No ratings yet
Da ANSWERS
13 pages
Kcs061 Unit 2
No ratings yet
Kcs061 Unit 2
60 pages
unit-3 CC
No ratings yet
unit-3 CC
10 pages
Bda Module-1
No ratings yet
Bda Module-1
55 pages
Chapter 4 - Big Data Tools, Techniques, and Systems
No ratings yet
Chapter 4 - Big Data Tools, Techniques, and Systems
19 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Lesson 2 A Review of Hadoop
No ratings yet
Lesson 2 A Review of Hadoop
6 pages
QBII
No ratings yet
QBII
15 pages
Hadoop Spark
No ratings yet
Hadoop Spark
31 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
BDS Doc
No ratings yet
BDS Doc
30 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
session10
No ratings yet
session10
30 pages
4 8 SQL Server Basics Material
No ratings yet
4 8 SQL Server Basics Material
6 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
M5
No ratings yet
M5
18 pages
Hadoop and Big Data
No ratings yet
Hadoop and Big Data
41 pages
Scenario_Based_Hadoop_Interview_Questions
No ratings yet
Scenario_Based_Hadoop_Interview_Questions
5 pages
BIG DATA 4
No ratings yet
BIG DATA 4
14 pages
BDP 2023 03
No ratings yet
BDP 2023 03
59 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Tabel Baja WF LRFD
0% (1)
Tabel Baja WF LRFD
16 pages
SBGCR Reaserch Paper New
No ratings yet
SBGCR Reaserch Paper New
6 pages
Class 8 - Portions & Time Table (Dec '21)
No ratings yet
Class 8 - Portions & Time Table (Dec '21)
1 page
English 1
No ratings yet
English 1
8 pages
Clustered Knowledge Assessment: Resubmission Feedback
No ratings yet
Clustered Knowledge Assessment: Resubmission Feedback
65 pages
AP Action Plan SY2022-2023
100% (2)
AP Action Plan SY2022-2023
2 pages
Sugar Town Queens by Malla Nunn Chapter Sampler
100% (1)
Sugar Town Queens by Malla Nunn Chapter Sampler
23 pages
Download Complete Odyssey The PDF for All Chapters
No ratings yet
Download Complete Odyssey The PDF for All Chapters
24 pages
Puca, Antonella (1997) - 'Steve Reich and Hebrew Cantillation' in The Musical Quarterly, Winter, 1997, Vol. 81, No. 4 (Winter, 1997), Pp. 537-555
No ratings yet
Puca, Antonella (1997) - 'Steve Reich and Hebrew Cantillation' in The Musical Quarterly, Winter, 1997, Vol. 81, No. 4 (Winter, 1997), Pp. 537-555
20 pages
GP-GS010 Ver2.0 EN
No ratings yet
GP-GS010 Ver2.0 EN
17 pages
Death and Dying
No ratings yet
Death and Dying
8 pages
Mansoura University Faculty of Computers and Information Department of Computer Science First Semester: 2021-2022
No ratings yet
Mansoura University Faculty of Computers and Information Department of Computer Science First Semester: 2021-2022
29 pages
PCF Excel Version
No ratings yet
PCF Excel Version
2 pages
TEK 10-2C - Control Joints For CMU
No ratings yet
TEK 10-2C - Control Joints For CMU
8 pages
Gulamgiri
No ratings yet
Gulamgiri
160 pages
Astm D1331 20
No ratings yet
Astm D1331 20
4 pages
Seatwork No.1
No ratings yet
Seatwork No.1
1 page
IELTS Mock Test 2021 January
No ratings yet
IELTS Mock Test 2021 January
19 pages
Document_2024 Thru Feb
No ratings yet
Document_2024 Thru Feb
13 pages
Raz ls37 Gemstreasuresfromearth CLR
No ratings yet
Raz ls37 Gemstreasuresfromearth CLR
13 pages
NEP Aligned Scheme 23 24 of B.Tech - in Electronics Computer Science Adm. Y. 23 24
No ratings yet
NEP Aligned Scheme 23 24 of B.Tech - in Electronics Computer Science Adm. Y. 23 24
20 pages
J Parenter Enteral Nutr - 2021 - Boullata - Parenteral Nutrition Compatibility and Stability A Comprehensive Review
No ratings yet
J Parenter Enteral Nutr - 2021 - Boullata - Parenteral Nutrition Compatibility and Stability A Comprehensive Review
27 pages
ANT-ADU4518R10v06-1949-001 Datasheet
No ratings yet
ANT-ADU4518R10v06-1949-001 Datasheet
2 pages
Manual Servicio Panasonic SA-HT640wpl
100% (2)
Manual Servicio Panasonic SA-HT640wpl
122 pages
Module 1.1 (System of Co-Planar Forces)
No ratings yet
Module 1.1 (System of Co-Planar Forces)
70 pages
Comand Aaalan
No ratings yet
Comand Aaalan
14 pages
Welcome To Goodnite Products Comparison List - Goodnite
No ratings yet
Welcome To Goodnite Products Comparison List - Goodnite
2 pages
DA3 Practice Worksheet 3 - Motion and Time
No ratings yet
DA3 Practice Worksheet 3 - Motion and Time
6 pages