0% found this document useful (0 votes)

162 views

Introduction To Big Data Analytics

This document provides an introduction to big data analytics, including: 1) It discusses the history and evolution of big data, from the development of tools like MapReduce and Hadoop, to newer systems like Spark. 2) It outlines the key characteristics of big data, including the 3Vs, 4Vs and 6Vs models used to describe the volume, velocity, variety and other attributes. 3) It introduces machine learning and how it is used in big data analytics, as well as the role of cloud computing in providing scalable infrastructure for analyzing large datasets.

Uploaded by

Trần Nguyên Thái Bảo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

162 views

Introduction To Big Data Analytics

Uploaded by

Trần Nguyên Thái Bảo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

INTRODUCTION TO

BIG DATA ANALYTICS

Quách Đình Hoàng
Content
• A historical review for Big Data
• 3Vs, 4Vs, and 6Vs characteristics of Big Data
• Machine Learning (ML)
• Big Data and cloud computing
• Hadoop, Hadoop distributed file system (HDFS),
MapReduce, Spark
• BDA = ML + CC (Cloud Computing)

2
A Short History of Big Data (1)

3
A Short History of Big Data (2)

4
Typical Size of Different Data Files

Media Average Size Notes (2014)

Web page 1.6–2 MB Average 100 objects
eBook 1–5 MB 200–350 pages
Average 1.9 MB/per minute (MP3) 256
Song 3.5–5.8 MB
Kbps rate (3 mins)
60 frames per second (MPEG-4 format,
Movie 100–120 GB
Full High Definition, 2 hours)

5
The data evolution over the years

6
7

Source: https://www.domo.com/learn/infographic/data-never-sleeps-9
Big Data Phenomenon - Data Never Sleep
3V Characteristics of Big Data

8
3-6Vs Characteristics of Big Data

9
Machine learning process

10
Replacing humans in the learning process
• The ultimate goal of ML is to build systems that are of at the level of
human competence in performing complex tasks

11
Big Data Analytics and Cloud Computing

• Cloud Computing (CC) plays a critical role in the Big Data Analytics
(BDA) process
• it offers subscription-oriented access to computing infrastructure, data, and
application services
• The original objective of BDA was to leverage commodity hardware to
build computing clusters and scale-out the computing capacity
• Cost: enable many small to medium companies to implement BDA (pay as you
go)
• Scalability: almost “infinite” capacity
• Elasticity: easily scale-out and scale down

12
Scale out vs. scale up
• Scale out = horizontal scale
• scale up = vertical scale

13
Cloud computing services
• Infrastructure as a Service (IaaS)
• Serve computing resources: CPU, storage, networks, …
• Amazon EC2, Rackspace, …
• Platform as a Service (PaaS)
• Serve API, maintenance, upgrades
• Google App Engine, Apple Play Store, …
• Software as a Service (SaaS)
• Serve applications
• Gmail, Dropbox, …

14
Scope of Controls between Provider and
Consumer

15
Big Data Storage Systems
• Structured data: Data with a defined format and structure
• CSV files, spreadsheets, traditional relational databases, and OLAP
data cubes
• Semi-structured data: Textual data files with a flexible
structure that can be parsed
• XML, JSON
• Unstructured data: Data that have no inherent structure
• text documents, images, PDF files, and videos

16
Types of NoSQL data stores

17
Hadoop ecosystem

18
Hadoop kernel
• HDFS (file storage), Map (distribute function), and
Reduce (parallel processing function)

19
Briefing history of Hadoop

20
Google file system (GFS)
• The GFS architecture consists of three components
• Single master server (or name node)
• Multiple chunk servers (or data nodes for Hadoop)
• Multiple clients

21
MapReduce programming model

22
Evolution of GFS, HDFS MapReduce, and
Hadoop

23
The origin of Hadoop project
• Lucene
• a high-performance scalable information retrieval (IR) library
• was written by Doug Cutting in 2000 in Java
• In Sep. 2001, Lucene was absorbed by ASF
• Nutch
• Nutch is the predecessor of Hadoop, built by Doug Cutting in 2002
• There are two main reasons to develop Nutch
• Create a Lucene index (web crawler)
• Assist developers to make queries of their index
• Mahout
• a Java-based ML library that covers all ML algorithms
• Collaborative filtering (recommender engines)
• Clustering
• Classification

24
Apache Lucene

25
Spark
• Spark was developed by the UC Berkeley AMP Lab
• The main contributor is Matei Zaharia et al.
• It intends to replace MapReduce model with a better solution
• It would be 10-20 times faster than MapReduce for certain type of
workload
• Although it attempts to replace MapReduce, it leverages Hadoop’s file
storage system

26
Differences on data transfer speed

27
Spark framework vs Hadoop framework

28
Spark history

29
Spark analytic stack

30
Big Data 2.0 processing systems

31
BDA = ML + CC
• Big Data Analytics: the execution of machine learning tasks on
large-datasets in cloud computing environments

32
References
• Caesar Wu, Rajkumar Buyya, and Kotagiri Ramamohanarao, Big Data
Analytics = Machine Learning + Cloud Computing, In Big Data:
Principles and Paradigms, Morgan Kaufmann, 2016.
http://www.cloudbus.org/papers/BigDataAnalytics2016.pdf
• Domo, Data never sleep 9, 2021.
https://www.domo.com/learn/infographic/data-never-sleeps-9
• Sherif Sakr, Big Data 2.0 Processing Systems: A Survey, 2nd Edition,
Springer, 2020.

Introduction to information and big data security
No ratings yet
Introduction to information and big data security
39 pages
Lesson - MULTIMEDIA CONTENT CREATION
No ratings yet
Lesson - MULTIMEDIA CONTENT CREATION
34 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
Wagner, High Acuity Nursing, 6/E Test Bank
0% (1)
Wagner, High Acuity Nursing, 6/E Test Bank
16 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Chapter 1
No ratings yet
Chapter 1
27 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Application of Big Data Analytics-5089
No ratings yet
Application of Big Data Analytics-5089
7 pages
Utilization of Big Data in The Monetary Sector
No ratings yet
Utilization of Big Data in The Monetary Sector
16 pages
Big Data Summery
No ratings yet
Big Data Summery
9 pages
Big Data Group Assingment
No ratings yet
Big Data Group Assingment
41 pages
Bigdata MINT PDF
No ratings yet
Bigdata MINT PDF
4 pages
[IJCST-V12I6P9]:Mrs.N.Dhivya, Mrs.S.Senthamarai Selvi, R.Gayathri
No ratings yet
[IJCST-V12I6P9]:Mrs.N.Dhivya, Mrs.S.Senthamarai Selvi, R.Gayathri
5 pages
Unit 01
No ratings yet
Unit 01
32 pages
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
No ratings yet
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
32 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
Project
No ratings yet
Project
14 pages
Lecture1 Big Data
No ratings yet
Lecture1 Big Data
47 pages
Big Data Analytics in Weather Forecasting
No ratings yet
Big Data Analytics in Weather Forecasting
29 pages
Big Data Analytics in Cloud Computing
No ratings yet
Big Data Analytics in Cloud Computing
8 pages
Petroleum: Big Data Analytics in Oil and Gas Industry: An Emerging Trend
No ratings yet
Petroleum: Big Data Analytics in Oil and Gas Industry: An Emerging Trend
10 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
Assignment Stid (Group 18) - Big Data
No ratings yet
Assignment Stid (Group 18) - Big Data
28 pages
Big Data Analytics and Visualization Lab
No ratings yet
Big Data Analytics and Visualization Lab
193 pages
Implementation Issues Task
No ratings yet
Implementation Issues Task
18 pages
Social Media and Web Analytics Unit-5
No ratings yet
Social Media and Web Analytics Unit-5
10 pages
Dpu Mba Artificial Intelligence & Machine Learning Management
No ratings yet
Dpu Mba Artificial Intelligence & Machine Learning Management
26 pages
Application of Big Data Analytics
No ratings yet
Application of Big Data Analytics
12 pages
A6515 BDA Question Bank
No ratings yet
A6515 BDA Question Bank
9 pages
Literature Review On Big Data Analytics Vishal Kumar Harsh Bansal
No ratings yet
Literature Review On Big Data Analytics Vishal Kumar Harsh Bansal
6 pages
Big Data Analytics PDF
No ratings yet
Big Data Analytics PDF
22 pages
2019-A Bi-Objective Hyper-Heuristic Support Vector Machines For Big Data Cyber - Security
No ratings yet
2019-A Bi-Objective Hyper-Heuristic Support Vector Machines For Big Data Cyber - Security
11 pages
Big Data Technologies
No ratings yet
Big Data Technologies
4 pages
Big Data Fund
No ratings yet
Big Data Fund
5 pages
Ôn Tập Applied Big Data in Management
No ratings yet
Ôn Tập Applied Big Data in Management
43 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Big Data Not Right Data Yes
No ratings yet
Big Data Not Right Data Yes
8 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
2 pages
Fundamentals Of Digital Marketing Puneet Singh Bhatiadownload
100% (2)
Fundamentals Of Digital Marketing Puneet Singh Bhatiadownload
60 pages
Business Intelligence & Business Analytics
No ratings yet
Business Intelligence & Business Analytics
8 pages
Orenga-Roglá y Chalmeta - 2019 - Framework For Implementing A Big Data Ecosystem in PDF
No ratings yet
Orenga-Roglá y Chalmeta - 2019 - Framework For Implementing A Big Data Ecosystem in PDF
9 pages
Big Data Analytics: A Literature Review Paper: Lecture Notes in Computer Science August 2014
No ratings yet
Big Data Analytics: A Literature Review Paper: Lecture Notes in Computer Science August 2014
16 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Augmented Analytics
No ratings yet
Augmented Analytics
8 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
Unit V
No ratings yet
Unit V
13 pages
BCSE 0105 - Machine Learning - Module 1 - Complete - NC
No ratings yet
BCSE 0105 - Machine Learning - Module 1 - Complete - NC
200 pages
MIS Important Questions
No ratings yet
MIS Important Questions
8 pages
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
No ratings yet
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
192 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
EXAME Media Kit 2019 - Eng
No ratings yet
EXAME Media Kit 2019 - Eng
12 pages
The Role of Big Data Analytics For The Internet of Things (Iot)
No ratings yet
The Role of Big Data Analytics For The Internet of Things (Iot)
15 pages
Spatial and Temporal Database
No ratings yet
Spatial and Temporal Database
44 pages
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
From Everand
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
Fouad Sabry
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Homework Pro Apk 7.9
100% (1)
Homework Pro Apk 7.9
7 pages
Marketing Fashion
No ratings yet
Marketing Fashion
17 pages
Ventilation System
No ratings yet
Ventilation System
10 pages
Preview: 50 One-Minute Tips For Better Communication
No ratings yet
Preview: 50 One-Minute Tips For Better Communication
16 pages
Pioneer DDJ-WeGO VirtualDJ 8 Operation Guide
No ratings yet
Pioneer DDJ-WeGO VirtualDJ 8 Operation Guide
13 pages
Intrusion Detection System IDS Seminar Report
No ratings yet
Intrusion Detection System IDS Seminar Report
18 pages
L5000 English Manual
No ratings yet
L5000 English Manual
10 pages
Juliana Demoraes Resume
No ratings yet
Juliana Demoraes Resume
1 page
Oct 2011 7705 SAR-M GPON en Datasheet
No ratings yet
Oct 2011 7705 SAR-M GPON en Datasheet
2 pages
Po Struktur Organisasi Goodyear
No ratings yet
Po Struktur Organisasi Goodyear
2 pages
Leo, Alan - 1906 - The Horoscope in Detail
No ratings yet
Leo, Alan - 1906 - The Horoscope in Detail
117 pages
Fujitsu Service Manual
100% (1)
Fujitsu Service Manual
126 pages
ADT 2 2 Compatibility Matrix
No ratings yet
ADT 2 2 Compatibility Matrix
2 pages
Experiment 3: Aim: - Write The Software Requirement Specification Document of ATM Management System. Introduction
No ratings yet
Experiment 3: Aim: - Write The Software Requirement Specification Document of ATM Management System. Introduction
6 pages
Jharkhand University of Technology, Ranchi
No ratings yet
Jharkhand University of Technology, Ranchi
6 pages
CPIT Module-5 QB
No ratings yet
CPIT Module-5 QB
8 pages
Maasai Mara University ICT Policy
No ratings yet
Maasai Mara University ICT Policy
46 pages
Etech 2ND Quarter (Lesson 6-12)
No ratings yet
Etech 2ND Quarter (Lesson 6-12)
18 pages
Enclosure To Circular 330 1211120
No ratings yet
Enclosure To Circular 330 1211120
19 pages
Kyocera Color PDF
No ratings yet
Kyocera Color PDF
5 pages
Forex Robots and How They Work
No ratings yet
Forex Robots and How They Work
6 pages
Helpdesk Portal - KEKA - Employees - 3jul2021
No ratings yet
Helpdesk Portal - KEKA - Employees - 3jul2021
7 pages
BMS Siemen
No ratings yet
BMS Siemen
188 pages
Mediant 600: Bri Voip Gateway
No ratings yet
Mediant 600: Bri Voip Gateway
2 pages
Efroze Pharmaceuticals - MIS Report
No ratings yet
Efroze Pharmaceuticals - MIS Report
25 pages
Data Driven Business Report
No ratings yet
Data Driven Business Report
5 pages
Process Expert Licensing Guide
No ratings yet
Process Expert Licensing Guide
46 pages
AlgoSec Solution Brochure WEB
No ratings yet
AlgoSec Solution Brochure WEB
6 pages
Checkpoint Generative Ai For Cybersecurity
No ratings yet
Checkpoint Generative Ai For Cybersecurity
26 pages

Introduction To Big Data Analytics

Uploaded by

Introduction To Big Data Analytics

Uploaded by

INTRODUCTION TO

BIG DATA ANALYTICS

Media Average Size Notes (2014)

You might also like