0% found this document useful (0 votes)
162 views

Introduction To Big Data Analytics

This document provides an introduction to big data analytics, including: 1) It discusses the history and evolution of big data, from the development of tools like MapReduce and Hadoop, to newer systems like Spark. 2) It outlines the key characteristics of big data, including the 3Vs, 4Vs and 6Vs models used to describe the volume, velocity, variety and other attributes. 3) It introduces machine learning and how it is used in big data analytics, as well as the role of cloud computing in providing scalable infrastructure for analyzing large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views

Introduction To Big Data Analytics

This document provides an introduction to big data analytics, including: 1) It discusses the history and evolution of big data, from the development of tools like MapReduce and Hadoop, to newer systems like Spark. 2) It outlines the key characteristics of big data, including the 3Vs, 4Vs and 6Vs models used to describe the volume, velocity, variety and other attributes. 3) It introduces machine learning and how it is used in big data analytics, as well as the role of cloud computing in providing scalable infrastructure for analyzing large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

INTRODUCTION TO

BIG DATA ANALYTICS


Quách Đình Hoàng
Content
• A historical review for Big Data
• 3Vs, 4Vs, and 6Vs characteristics of Big Data
• Machine Learning (ML)
• Big Data and cloud computing
• Hadoop, Hadoop distributed file system (HDFS),
MapReduce, Spark
• BDA = ML + CC (Cloud Computing)

2
A Short History of Big Data (1)

3
A Short History of Big Data (2)

4
Typical Size of Different Data Files

Media Average Size Notes (2014)


Web page 1.6–2 MB Average 100 objects
eBook 1–5 MB 200–350 pages
Average 1.9 MB/per minute (MP3) 256
Song 3.5–5.8 MB
Kbps rate (3 mins)
60 frames per second (MPEG-4 format,
Movie 100–120 GB
Full High Definition, 2 hours)

5
The data evolution over the years

6
7

Source: https://www.domo.com/learn/infographic/data-never-sleeps-9
Big Data Phenomenon - Data Never Sleep
3V Characteristics of Big Data

8
3-6Vs Characteristics of Big Data

9
Machine learning process

10
Replacing humans in the learning process
• The ultimate goal of ML is to build systems that are of at the level of
human competence in performing complex tasks

11
Big Data Analytics and Cloud Computing

• Cloud Computing (CC) plays a critical role in the Big Data Analytics
(BDA) process
• it offers subscription-oriented access to computing infrastructure, data, and
application services
• The original objective of BDA was to leverage commodity hardware to
build computing clusters and scale-out the computing capacity
• Cost: enable many small to medium companies to implement BDA (pay as you
go)
• Scalability: almost “infinite” capacity
• Elasticity: easily scale-out and scale down

12
Scale out vs. scale up
• Scale out = horizontal scale
• scale up = vertical scale

13
Cloud computing services
• Infrastructure as a Service (IaaS)
• Serve computing resources: CPU, storage, networks, …
• Amazon EC2, Rackspace, …
• Platform as a Service (PaaS)
• Serve API, maintenance, upgrades
• Google App Engine, Apple Play Store, …
• Software as a Service (SaaS)
• Serve applications
• Gmail, Dropbox, …

14
Scope of Controls between Provider and
Consumer

15
Big Data Storage Systems
• Structured data: Data with a defined format and structure
• CSV files, spreadsheets, traditional relational databases, and OLAP
data cubes
• Semi-structured data: Textual data files with a flexible
structure that can be parsed
• XML, JSON
• Unstructured data: Data that have no inherent structure
• text documents, images, PDF files, and videos

16
Types of NoSQL data stores

17
Hadoop ecosystem

18
Hadoop kernel
• HDFS (file storage), Map (distribute function), and
Reduce (parallel processing function)

19
Briefing history of Hadoop

20
Google file system (GFS)
• The GFS architecture consists of three components
• Single master server (or name node)
• Multiple chunk servers (or data nodes for Hadoop)
• Multiple clients

21
MapReduce programming model

22
Evolution of GFS, HDFS MapReduce, and
Hadoop

23
The origin of Hadoop project
• Lucene
• a high-performance scalable information retrieval (IR) library
• was written by Doug Cutting in 2000 in Java
• In Sep. 2001, Lucene was absorbed by ASF
• Nutch
• Nutch is the predecessor of Hadoop, built by Doug Cutting in 2002
• There are two main reasons to develop Nutch
• Create a Lucene index (web crawler)
• Assist developers to make queries of their index
• Mahout
• a Java-based ML library that covers all ML algorithms
• Collaborative filtering (recommender engines)
• Clustering
• Classification

24
Apache Lucene

25
Spark
• Spark was developed by the UC Berkeley AMP Lab
• The main contributor is Matei Zaharia et al.
• It intends to replace MapReduce model with a better solution
• It would be 10-20 times faster than MapReduce for certain type of
workload
• Although it attempts to replace MapReduce, it leverages Hadoop’s file
storage system

26
Differences on data transfer speed

27
Spark framework vs Hadoop framework

28
Spark history

29
Spark analytic stack

30
Big Data 2.0 processing systems

31
BDA = ML + CC
• Big Data Analytics: the execution of machine learning tasks on
large-datasets in cloud computing environments

32
References
• Caesar Wu, Rajkumar Buyya, and Kotagiri Ramamohanarao, Big Data
Analytics = Machine Learning + Cloud Computing, In Big Data:
Principles and Paradigms, Morgan Kaufmann, 2016.
http://www.cloudbus.org/papers/BigDataAnalytics2016.pdf
• Domo, Data never sleep 9, 2021.
https://www.domo.com/learn/infographic/data-never-sleeps-9
• Sherif Sakr, Big Data 2.0 Processing Systems: A Survey, 2nd Edition,
Springer, 2020.

33

You might also like