BDA 01 - Introduction
BDA 01 - Introduction
Lesson 1: Introduction
4
Lesson 1: Introduction
Big Data???
Analysis???
Business
Knowledge???
Contents
6
Whats Big Data?
Using the Google trends site check how the popularity of the search term “Big Data” has
declined compared with similar search terms: machine learning and artificial intelligence
8
Popularity of ‘Big Data’ Search Term on
Google
Popularity of ‘Big Data’ Search Term on Google
9
What exactly is Big Data?
10
Characteristics of Big Data-The x V’s
The 4Vs of Big Data
12
History of Big Data
13
Examples of Big Data
Articles talking about Big Data volumes Very Big Data at Netflix
14
Sources of Big Data
Some sources
16
Applications of Big Data
What Do they Do with
Industries Using Big Data the Data?
Online Retail Search Finance
1. Amazon uses its massive data
to build recommendation
Manufacturing Automobile Medicine
systems
18
Big Data Tools and Ecosytem
What Do We Want to Do with Big Data?
Store, manage and retrieve Analyze and visualize Build models (ML)
The Big Data tools, methodologies, frameworks and ecosystems address these tasks. Most
technologies combine several aspects. For instance, with Hadoop, you have both storage
and analysis capabilities
20
Tools and Platforms
Cloud providers Hadoop, HDFS, Apache Spark,
NoSQL Databases
Mapreduce
21
How Some of these Components Look in a Full System
22
Summary on Big Data
• Big Data, when it comes to size is a relative term BUT you will know
you have a large dataset if you can’t process it on a single computer
or you can’t use traditional software such as Excel to handle the data
• The Big Data technology landscape is now dominated by cloud
providers such as AWS, Google, Azure and others. Other notable
vendors involved include Databricks, Cloudera/Hortonworks, MapR
• Big Data has penetrated many industries including government
23
Further Reading on Big Data Basics
24
Big Data and Parallel
Processing
What is Parallel Processing
Most Big Data Frameworks utilize both data and task parallelism
26
Similar Concepts to Parallelism
• Distributed computing
• Concurrent processing
27
Why Parallel Processing?
28
Parallel vs. Linear Processing
Problem
Problem
Instruction-1 Instruction-2 Instruction-N
Instruction-1
Error
Instruction-2 Output
Error
Instruction-N
Output
29
Advantages of Parallel Processing
30
Vertical Scaling Vs. Horizontal Scaling
Eventually vertical
scaling fails
31
Horizontal Scaling is Better
A Cluster of computers
32
Concurrency and Parallelism
in Python
Before we turn to the big guns (Big Data frameworks) to handle our Big Data, we
will look at how to achieve simple parallelism with vanilla Python
33
Concurrency and Parallelism in Python
34
Summary of Different Concurrency Types in
Python
35
CPU-Bound and I/O-Bound Programs in in
Python
36
I/O Bound Vs. CPU Bound Processes
37
How to Achieve Concurrency in Python
38
How to Solve Big Data
Problems in Python
Working with data in a distributed fashion is inherently
difficult, therefore make sure that you exhaust all options
before jumping into using Spark, Hadoop or other Big Data
frameworks
40
Advice on Tackling a Big Data Problem
Some questions to ask yourself before you jump to the big guns
1. Can I optimize pandas to solve the problem?: If you are using Pandas for
data munging, you can optimize pandas to load large datasets depending
on the nature of your problem
2. How about drawing a sample from the large dataset? Depending on your
use case, drawing a sample out of a large dataset may or may not work.
Just be careful that you sample correctly.
3. Can I use simple Python parallelism to solve the problem on my laptop?
Sometimes the data isn't that big but you just need to run more intense
computations on the smaller data, multiprocessing can help.
4. Can I use a big data framework on my laptop? For some tasks, even with a
25GB dataset, frameworks like Spark and Dask can work on a single
laptop.
5. Need to build a cluster: Take time to think about which distribution of
Hadoop to use, which vendors to use, whether you will put the cluster on
the cloud or on-premise. You will need input of IT people for this one.
41
Exercises
• Register an account of Microsoft Planetary Computer
• Analysis with Python
• Computer repaire time
• Observing the repair time of a computer repair center, we have the following data table:
1 18 8 29 15 11 22 12 29 16 36 11 43 13
2 15 9 10 16 14 23 34 30 14 37 10 44 8
3 17 10 14 17 13 24 29 31 15 38 13 45 10
4 9 11 17 18 16 25 13 32 7 39 14 46 13
5 37 12 12 19 13 26 19 33 40 40 9 47 16
6 15 13 13 20 15 27 12 34 16 41 18 48 9
7 8 14 12 21 16 28 15 35 11 42 8 49 9
• Time in days. Decide on an appointment time with the customer when there is a new
computer repair request.
42