07-BigData-DataAnalysis
07-BigData-DataAnalysis
Lê Hồng Hải
UET-VNUH
Big Data Overview
1 Introduction
4 Streaming
2
Big Data
3
Big Data
4
Big data architecture components
6
Big Data Storage
7
Distributed File Systems
8
Hadoop File System Architecture
9
Hadoop Distributed File System (HDFS)
Data Coherency
◼ Write-once-read-many access model
◼ Client can only append to existing files
Distributed file systems good for millions
of large files
10
Big Data Storage
11
Sharding
12
Key Value Storage Systems
17
Big data architecture
18
Big Data Processing
Map-Reduce
Spark
Streaming
19
The MapReduce Paradigm
20
MapReduce - Dataflow
21
The MapReduce Paradigm
22
MapReduce Programming Model
23
Flow of Keys and Values
https://www.geeksforgeeks.org/how-to-execute-wordcount-program-
in-mapreduce-using-cloudera-distribution-hadoop-cdh/
24
Example
User
Program
copy copy copy
Master
assign assign
map reduce
Part 1 Map 1 Reduce 1 File 1
Part 2
Part 3 Reduce 1 write File 2
Map 2
Part 4
local
write
Part n
read Map n Reduce m File m
Remote
Read, Sort
Input file Intermediate Output files
partitions files
26
Map Reduce vs. Databases
27
Map Reduce vs. Databases (Cont.)
28
Where is MapReduce Inefficient?
29
Spark
30
RDD Spark
31
Word Count in Spark
32
Spark DataFramesand DataSet
33
Spark DataFramesand DataSet
34
StreamingData
35
Streaming Data and Applications
36
Publish Subscribe Systems
37
Apache Kafka
38
Big data architecture
39
Data Analytics
1. Overview
2. Data Warehousing (DW)
3. Online Analytical Processing (OLAP)
4. Data Mining
40
Overview
41
Common steps in data analytics
42
Data Analytics
43
Overview (Cont.)
44
Overview (Cont.)
45
Data Warehousing
46
Warehouse Design issues
47
Multidimensional Data
48
Fact Tables
49
Data Warehouse Star Schema
50
More on Data Warehouse Star Schema
51
Multidimensional Data and Warehouse Schemas
52
Data lakes
53
Database Support for Data Warehouses
54
Column-oriented storage
55
Data Analysis and OLAP
56
Cross Tabulation
57
Data Cube
58
Online Analytical Processing Operations
59
Hierarchies on Dimensions
60
Cross Tabulation With Hierarchy
61
Reporting and Visualization
62
Reporting and Visualization
63
Data Mining
64
Types of Data Mining Tasks
65
THANKS YOU