0% found this document useful (0 votes)
4 views

07-BigData-DataAnalysis

The document provides an overview of Big Data, including its definition, architecture components, storage methods, and processing techniques such as MapReduce and Spark. It discusses various data storage systems like distributed file systems and key-value stores, as well as the importance of data analytics for business decision-making. Additionally, it highlights the role of data warehousing and the use of machine learning for predictive modeling.

Uploaded by

hieutm0507
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

07-BigData-DataAnalysis

The document provides an overview of Big Data, including its definition, architecture components, storage methods, and processing techniques such as MapReduce and Spark. It discusses various data storage systems like distributed file systems and key-value stores, as well as the importance of data analytics for business decision-making. Additionally, it highlights the role of data warehousing and the use of machine learning for predictive modeling.

Uploaded by

hieutm0507
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

BIG DATA – DATA ANALYSIS

Lê Hồng Hải
UET-VNUH
Big Data Overview

1 Introduction

2 Big Data storages

3 Big Data processing

4 Streaming

2
Big Data

 The definition of big data is data that


contains greater variety, arriving in
increasing volumes and with more
velocity. This is also known as the 3 Vs

3
Big Data

4
Big data architecture components

• Data sources – relational databases, files (e.g., web


server log files) produced by applications, real-time
data produced by IoT devices.
• Big data storage –storing high data volumes of
different types before filtering, aggregating, and
preparing data for analysis.
• Real-time message ingestion store – to capture and
store real-time messages for stream processing.
• Analytical data store – relational databases for
preparing and structuring big data for further
analytical querying.
• Big data analytics and reporting, which may include
OLAP cubes, ML tools, BI tools, etc. – to provide big
data insights to end users.
5
Big data architecture

6
Big Data Storage

1. Distributed file systems


2. Sharding across multiple databases
3. Key-value storage systems
4. Parallel and distributed databases

7
Distributed File Systems

A distributed file system stores data across


a large collection of machines, but provides
a single file-system view
 Provides redundant storage of massive
amounts of data on cheap and unreliable
computers
◼ Google File System (GFS)
◼ Hadoop File System (HDFS)

8
Hadoop File System Architecture

▪ Single Namespace for entire


cluster
▪ Files are broken up into
blocks
• Typically 64 MB block size
• Each block replicated on
multiple DataNodes
▪ Client
• Finds the location of
blocks from NameNode
• Accesses data directly
from DataNode

9
Hadoop Distributed File System (HDFS)

 Data Coherency
◼ Write-once-read-many access model
◼ Client can only append to existing files
 Distributed file systems good for millions
of large files

10
Big Data Storage

1. Distributed file systems


2. Sharding across multiple databases
3. Key-value storage systems
4. Parallel and distributed databases

11
Sharding

 Sharding: partition data across multiple


databases
 Partitioning usually done on some
partitioning attributes (also known as
partitioning keys or shard keys e.g.
user ID
◼ E.g., records with key values from 1 to
100,000 on database 1,
records with key values from 100,001 to
200,000 on database 2, etc

12
Key Value Storage Systems

 Key-value storage systems store large


numbers (billions or even more) of small
(KB-MB) sized records
 Records are partitioned across multiple
machines and
 Queries are routed by the system to
appropriate machine
 Records are also replicated across
multiple machines, to ensure availability
even if a machine fails
◼ Key-value stores ensure that updates are
applied to all replicas, to ensure that their
values are consistent13
Key Value Storage Systems

 Key-value stores may store


◼ uninterpreted bytes, with an associated key
 E.g., Amazon S3, Amazon Dynamo
◼ Wide-table (can have arbitrarily many
attribute names) with associated key
▪ Google BigTable, Apache Cassandra, Apache Hbase,
Amazon DynamoDB
◼ JSON
 MongoDB, CouchDB (document model)
 Document stores store semi-structured
data, typically JSON
 Some key-value stores support multiple
versions of data, with timestamps/version
numbers 14
Data Representation

 An example of a JSON object is:


{
"ID": "22222",
"name": {
"firstname: "Albert",
"lastname: "Einstein"
},
"deptname": "Physics",
"children": [
{ "firstname": "Hans", "lastname":
"Einstein" },
{ "firstname": "Eduard", "lastname":
"Einstein" }
]
}
15
Key Value Storage Systems

 Key-value stores support


◼ put(key, value): used to store values with an
associated key,
◼ get(key): which retrieves the stored value
associated with the specified key
◼ delete(key) -- Remove the key and its
associated value
 Some systems also support range
queries on key values
 Document stores also support queries on
non-key attributes
◼ See book for MongoDB queries
◼ Also called NoSQL systems
16
Replication and Consistency

 Availability (system can run even if parts have


failed) is essential for parallel/distributed
databases
◼ Via replication, so even if a node has failed, another copy
is available
 Consistency is important for replicated data
◼ All live replicas have same value, and each read sees
latest version
 Network partitions (network can break into two
or more parts, each with active systems that can’t
talk to other parts)
 In presence of partitions, cannot guarantee both
availability and consistency
◼ Brewer’s CAP “Theorem”

17
Big data architecture

18
Big Data Processing

 Map-Reduce
 Spark
 Streaming

19
The MapReduce Paradigm

 Platform for reliable, scalable parallel computing


 Abstracts issues of distributed and parallel
environment from programmer
◼ Programmer provides core logic (via map() and
reduce() functions)
◼ System takes care of parallelization of
computation, coordination, etc

20
MapReduce - Dataflow

21
The MapReduce Paradigm

 Paradigm dates back many decades


◼ But very large scale implementations
running on clusters with 10^3 to 10^4
machines are more recent
◼ Google Map Reduce, Hadoop, ..
 Data storage/access typically done using
distributed file systems or key-value stores

22
MapReduce Programming Model

 Input: a set of key/value pairs


 User supplies two functions:
◼ map(k,v) → list(k1,v1)
◼ reduce(k1, list(v1)) → v2
 (k1,v1) is an intermediate key/value pair
 Output is the set of (k1,v2) pairs

23
Flow of Keys and Values

 Flow of keys and values in a map


reduce task
rk1 rv1 rk1 rv1,rv7,...
rk7 rv2 rk2 rv8,rvi,...
mk1 mv1
rk3 rv3 rk3 rv3,...
mk2 mv2
rk1 rv7
rk7 rv2,...
rk2 rv8

rki ... rvn,...


rk2 rvi
mkn mvn
rki rvn

map inputs map outputs reduce inputs


(key, value) (key, value)

https://www.geeksforgeeks.org/how-to-execute-wordcount-program-
in-mapreduce-using-cloudera-distribution-hadoop-cdh/
24
Example

I am a tiger, you are also a


tiger
I,1 a,2
map am,1 a, 1 also,1
a,1 a,1 reduce am,1
are,1 part0
also,1
tiger,1 am,1
map you,1 are,1
are,1 I,1
tiger,1 I, 1
tiger,1 tiger,2 part1
also,1 you,1 reduce you,1
map a, 1
tiger,1

JobTracker generates JobTracker generates


Hadoop sorts the
three TaskTrackers for two TaskTrackers
25 for
intermediate data
map tasks map tasks
25
Parallel Processing of MapReduce Job

User
Program
copy copy copy

Master
assign assign
map reduce
Part 1 Map 1 Reduce 1 File 1
Part 2
Part 3 Reduce 1 write File 2
Map 2
Part 4
local
write
Part n
read Map n Reduce m File m
Remote
Read, Sort
Input file Intermediate Output files
partitions files

26
Map Reduce vs. Databases

 Map Reduce widely used for parallel


processing
◼ Google, Yahoo, and 100’s of other companies
◼ Example uses: compute PageRank, build
keyword indices, do data analysis of web click
logs, ….
 Many real-world uses of MapReduce
cannot be expressed in SQL
 But many computations are much easier to
express in SQL

27
Map Reduce vs. Databases (Cont.)

 Relational operations (select, project, join,


aggregation, etc.) can be expressed using
Map Reduce
 SQL queries can be translated into Map
Reduce infrastructure for execution
◼ Apache Hive SQL, Apache Pig Latin, Microsoft
SCOPE

28
Where is MapReduce Inefficient?

 Long pipelines sharing data


 Interactive applications
 Streaming applications

(MapReduce would need to write and read


from disk a lot)

29
Spark

 The key idea of Spark is Resilient Distributed


Datasets (RDD)
 It supports in-memory processing computation

30
RDD Spark

 Resilient Distributed Dataset (RDD)


abstraction
◼ Collection of records that can be stored
across multiple machines
 Read-only partitioned collection of records
(like a DFS) but with a record of how the
dataset was created as a combination of
transformations from other dataset(s)

31
Word Count in Spark

32
Spark DataFramesand DataSet

 RDDs in Spark can be typed in programs,


but not dynamically
 The DataSet type allows types to be
specified dynamically
 Row is a row type, with attribute names
◼ In code below, attribute names/types of
instructor and department are inferred from
files read

33
Spark DataFramesand DataSet

 Operations filter, join, groupBy, agg, etc defined


on DataSet, and can execute in parallel
 Dataset<Row> instructor =
spark.read().parquet("...");
Dataset<Row> department =
spark.read().parquet("...");
instructor.filter(instructor.col("salary").gt(100000
))
.join(department, instructor.col("dept name")
.equalTo(department.col("dept name")))
.groupBy(department.col("building"))
.agg(count(instructor.col("ID")));

34
StreamingData

35
Streaming Data and Applications

 Streaming data refers to data that


arrives in a continuous fashion
 Applications include:
◼ Stock market: stream of trades
◼ Sensors: sensor readings
 Internet of things
◼ Network monitoring data
◼ Social media: tweets and posts can be viewed
as a stream
 Queries on streams can be very useful
◼ Monitoring, alerts, automated triggering of
actions

36
Publish Subscribe Systems

 Publish-subscribe (pub-sub) systems


provide a convenient abstraction for
processing streams
◼ Tuples in a stream are published to a topic
◼ Consumers subscribe to topic

37
Apache Kafka

 Apache Kafka is a popular parallel pub-sub


system widely used to manage streaming data
 Parallel pub-sub systems allow tuples in a
topic to be partitioned across multiple
machines

38
Big data architecture

39
Data Analytics

1. Overview
2. Data Warehousing (DW)
3. Online Analytical Processing (OLAP)
4. Data Mining

40
Overview

 Data analytics: the processing of data to


infer patterns, correlations, or models for
prediction
 Primarily used to make business decisions
◼ E.g., what product to suggest for purchase
◼ E.g., what products to manufacture/stock, in
what quantity
 Critical for businesses today

41
Common steps in data analytics

 Gather data from multiple sources into one


location
 Data warehouses also integrate data into a
common schema
 Data often needs to be extracted from
source formats, transformed into
common schema, and loaded into the
data warehouse (ETL)

42
Data Analytics

 Generate aggregates and reports


summarizing data
◼ Dashboards showing graphical charts/reports
◼ Online analytical processing (OLAP)
systems allow interactive querying
◼ Statistical analysis using tools such as
R/SAS/SPSS
 Build predictive models and use the
models for decision making

43
Overview (Cont.)

 Predictive models are widely used today


◼ E.g., use customer profile features and the
history of a customer to predict the likelihood
of default on a loan
◼ E.g., use history of sales to predict future sales
 Other examples of business decisions:
◼ What items to stock?
◼ What insurance premium to change?
◼ To whom to send advertisements?

44
Overview (Cont.)

 Machine learning techniques are key to


finding patterns in data and making
predictions
 Data mining extends techniques
developed by machine-learning
communities to run them on very large
datasets
 The term business intelligence (BI) is
synonym for data analytics

45
Data Warehousing

 A data warehouse is a repository (archive)


of information gathered from multiple
sources, stored under a unified schema, at a
single site

46
Warehouse Design issues

 Data transformation and data


cleansing
◼E.g., correct mistakes in addresses
(misspellings, zip code errors)
 How to propagate updates
 What data to summarize

47
Multidimensional Data

 Data in warehouses can usually be divided


into
◼ Fact tables, which are large
 E.g, sales(item_id, store_id,
customer_id, date, number, price)
◼ Dimension tables, which are relatively
small
 Store extra information about stores,
items, etc.

48
Fact Tables

 Attributes of fact tables can be usually


viewed as
◼ Measure attributes
 measure some value, and can be
aggregated upon
 e.g., the attributes number or price of
the sales relation
◼ Dimension attributes
 dimensions on which measure attributes
are viewed

49
Data Warehouse Star Schema

50
More on Data Warehouse Star Schema

51
Multidimensional Data and Warehouse Schemas

 More complicated schema structures


◼ Snowflake schema: multiple levels of
dimension tables

52
Data lakes

 Some applications do not find it worthwhile


to bring data to a common schema
◼ Data lakes are repositories which allow data to
be stored in multiple formats, without schema
integration
◼ Less upfront effort, but more effort during
querying

53
Database Support for Data Warehouses

 Data in warehouses usually append-only,


not updated. Can avoid concurrency
control overheads
 Data warehouses often use column-
oriented storage

54
Column-oriented storage

 Arrays are compressed, reducing storage,


IO and memory costs significantly
 Queries can fetch only attributes that they
care about, reducing IO and memory cost
 Data warehouses often use parallel storage
and query processing infrastructure

55
Data Analysis and OLAP

 Online Analytical Processing (OLAP)


 Interactive analysis of data, allowing data
to be summarized and viewed in different
ways in an online fashion (with negligible
delay)

56
Cross Tabulation

 The table below is an example of a cross-


tabulation (cross-tab), also referred to as a
pivot-table

57
Data Cube

 A data cube is a multidimensional


generalization of a cross-tab
 Can have n dimensions; we show 3 below
 Cross-tabs can be used as views on a data
cube

58
Online Analytical Processing Operations

 Pivoting: changing the dimensions used in a


cross-tab
 Slicing: creating a cross-tab for fixed values
only
 Rollup: moving from finer-granularity data to
a coarser granularity
 Drill down: The opposite operation - that of
moving from coarser-granularity data to finer-
granularity data

59
Hierarchies on Dimensions

 Hierarchy on dimension attributes: lets


dimensions be viewed at different levels of
detail

60
Cross Tabulation With Hierarchy

 Cross-tabs can be easily extended to deal


with hierarchies
 Can drill down or roll up on a hierarchy
 E.g. hierarchy: item_name → category

61
Reporting and Visualization

 Reporting tools help create formatted


reports with tabular/graphical
representation of data
 Data visualization tools help create
interactive visualization of data
◼ E.g., PowerBI, Tableau, FusionChart, plotly,
Datawrapper, Google Charts, etc.

62
Reporting and Visualization

63
Data Mining

 Data mining is the process of semi-


automatically analyzing large databases to
find useful patterns
 Some types of knowledge can be represented
as rules
 More generally, knowledge is discovered by
applying machine learning techniques to
past instances of data to form a model

64
Types of Data Mining Tasks

 Prediction based on past history


◼ Predict if a credit card applicant poses a good
credit risk, based on some attributes (income,
job type, age, ..) and past history
 Some examples of prediction mechanisms:
◼ Classification
 Items (with associated attributes) belong to one of
several classes
 Training instances have attribute values and classes
provided
◼ Regression formulae
 Given a set of mappings for an unknown function,
predict the function result for a new parameter value

65
THANKS YOU

You might also like