Ch3 - Introduction To Big Data Analytics
Ch3 - Introduction To Big Data Analytics
Chapter 3
Introduction to Big Data Analytics
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing
and Presenting Data.
Contents
2
o We live in the
information age
o Information leads to
power and success,
analyzing such data
is an important need
Data Deluge
4
o Difficult to
o Keep up with this huge influx of data
o more challenging is analyzing vast amounts
1. Huge volume:
billions of rows and millions of columns.
More
Structured
Data Structures
9
1. Structured data:
• A predefined data type, format, and structure
• Therefore, straightforward to analyse
• Conforms to a tabular format with relationship
between the different rows and columns.
• The most traditional form of data storage
• Example: Excel files or SQL databases, transaction data.
Structured Data: Example
11
Data Structures
12
2. Semi-structured data:
• Textual data files with a pattern that enables parsing.
• A form of structured data that contains tags or other
markers to separate semantic elements and enforce
hierarchies of records and fields within the data.
Examples of semi-structured data include
• Easier to analyse than unstructured data.
• Example: XML, JSON
Semi-Structured Data: Example
13
Data Structures
14
3. Quasi-structured data:
• Textual data with erratic data formats that can be
formatted with effort, tools, and time.
• Example: web clickstream data.
Quasi-structured Data: Example
15
Data Structures
16
4. Unstructured data:
• No inherent structure; does not have a predefined
composition
• Typically text-heavy, but may contain dates, numbers, and
facts, resulting in irregularities and ambiguities
• Difficult to understand and analyse using traditional
programs
• Example: PDFs, images, and video.
Unstructured Data: Example
17
Business Drivers
18
Business Intelligence (BI)
Versus Data Science
19
(1) Portrays Data Devices and the “Sensornet”, which is collecting data from
multiple locations and is continuously generating new data about this data.
smart phone -> busy roads, Shopping loyalty cards -> promotions
(2) Data Collectors include entities who are collecting data from the device
and users.
TV provider -> TV content , shopping carts with RFID chip -> products
(3) Data Aggregators make sense of the data collected from the various
entities from the “SensorNet” or the “Internet of Things”.
sell to brokers
(4) At the outer edges of this web of the ecosystem are Data Users & Buyers.
These groups directly benefit from the information.
Banks -> demographics of people with a specific threshold of debt and
searching the web for home remodeling
Key Roles of the Data Ecosystem
23
Quantitative skill
e.g., math, statistics
Technical aptitude
e.g., software engineering,
programming
Critical thinking
ability to examine work
critically
Profile of Data Scientist
27
Read more: The Power of Habit: Why We Do What We Do in Life and Business
Churn Analysis
35
• Churn prediction is one of the most popular Big Data use cases in
business.
• It consists of detecting customers “who are likely to cancel a
subscription to a service.”
• It works based on how customers use the service.
• It asks the following question for each current customer: “Is this
customer going to leave us within the next X months?”
• There are only two possible answers, yes or no, and it is what we
call a binary classification task.
Churn Analysis for Mobile Teleco
36
Using Social Network Analysis to
Improve Churn Prediction
37