Introduction To Big Data Analytics
Introduction To Big Data Analytics
2
A Short History of Big Data (1)
3
A Short History of Big Data (2)
4
Typical Size of Different Data Files
5
The data evolution over the years
6
7
Source: https://www.domo.com/learn/infographic/data-never-sleeps-9
Big Data Phenomenon - Data Never Sleep
3V Characteristics of Big Data
8
3-6Vs Characteristics of Big Data
9
Machine learning process
10
Replacing humans in the learning process
• The ultimate goal of ML is to build systems that are of at the level of
human competence in performing complex tasks
11
Big Data Analytics and Cloud Computing
• Cloud Computing (CC) plays a critical role in the Big Data Analytics
(BDA) process
• it offers subscription-oriented access to computing infrastructure, data, and
application services
• The original objective of BDA was to leverage commodity hardware to
build computing clusters and scale-out the computing capacity
• Cost: enable many small to medium companies to implement BDA (pay as you
go)
• Scalability: almost “infinite” capacity
• Elasticity: easily scale-out and scale down
12
Scale out vs. scale up
• Scale out = horizontal scale
• scale up = vertical scale
13
Cloud computing services
• Infrastructure as a Service (IaaS)
• Serve computing resources: CPU, storage, networks, …
• Amazon EC2, Rackspace, …
• Platform as a Service (PaaS)
• Serve API, maintenance, upgrades
• Google App Engine, Apple Play Store, …
• Software as a Service (SaaS)
• Serve applications
• Gmail, Dropbox, …
14
Scope of Controls between Provider and
Consumer
15
Big Data Storage Systems
• Structured data: Data with a defined format and structure
• CSV files, spreadsheets, traditional relational databases, and OLAP
data cubes
• Semi-structured data: Textual data files with a flexible
structure that can be parsed
• XML, JSON
• Unstructured data: Data that have no inherent structure
• text documents, images, PDF files, and videos
16
Types of NoSQL data stores
17
Hadoop ecosystem
18
Hadoop kernel
• HDFS (file storage), Map (distribute function), and
Reduce (parallel processing function)
19
Briefing history of Hadoop
20
Google file system (GFS)
• The GFS architecture consists of three components
• Single master server (or name node)
• Multiple chunk servers (or data nodes for Hadoop)
• Multiple clients
21
MapReduce programming model
22
Evolution of GFS, HDFS MapReduce, and
Hadoop
23
The origin of Hadoop project
• Lucene
• a high-performance scalable information retrieval (IR) library
• was written by Doug Cutting in 2000 in Java
• In Sep. 2001, Lucene was absorbed by ASF
• Nutch
• Nutch is the predecessor of Hadoop, built by Doug Cutting in 2002
• There are two main reasons to develop Nutch
• Create a Lucene index (web crawler)
• Assist developers to make queries of their index
• Mahout
• a Java-based ML library that covers all ML algorithms
• Collaborative filtering (recommender engines)
• Clustering
• Classification
24
Apache Lucene
25
Spark
• Spark was developed by the UC Berkeley AMP Lab
• The main contributor is Matei Zaharia et al.
• It intends to replace MapReduce model with a better solution
• It would be 10-20 times faster than MapReduce for certain type of
workload
• Although it attempts to replace MapReduce, it leverages Hadoop’s file
storage system
26
Differences on data transfer speed
27
Spark framework vs Hadoop framework
28
Spark history
29
Spark analytic stack
30
Big Data 2.0 processing systems
31
BDA = ML + CC
• Big Data Analytics: the execution of machine learning tasks on
large-datasets in cloud computing environments
32
References
• Caesar Wu, Rajkumar Buyya, and Kotagiri Ramamohanarao, Big Data
Analytics = Machine Learning + Cloud Computing, In Big Data:
Principles and Paradigms, Morgan Kaufmann, 2016.
http://www.cloudbus.org/papers/BigDataAnalytics2016.pdf
• Domo, Data never sleep 9, 2021.
https://www.domo.com/learn/infographic/data-never-sleeps-9
• Sherif Sakr, Big Data 2.0 Processing Systems: A Survey, 2nd Edition,
Springer, 2020.
33