0% found this document useful (0 votes)
7 views

Big Data Analytics

The document provides an overview of Big Data Analytics, covering key concepts such as the 7 Vs of Big Data, the importance of data science for business, and various data analytics skills and job roles. It outlines the data analytics process, including data preparation, preprocessing, and machine learning techniques, along with practical use cases like customer segmentation and churn prediction. Additionally, it discusses methodologies for data analysis, including A/B testing and market basket analysis.

Uploaded by

saidmemoryy22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Big Data Analytics

The document provides an overview of Big Data Analytics, covering key concepts such as the 7 Vs of Big Data, the importance of data science for business, and various data analytics skills and job roles. It outlines the data analytics process, including data preparation, preprocessing, and machine learning techniques, along with practical use cases like customer segmentation and churn prediction. Additionally, it discusses methodologies for data analysis, including A/B testing and market basket analysis.

Uploaded by

saidmemoryy22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Big Data Analytics

Annisa Aurelia Mufid


Data Analytics Lead - Telkom Indonesia

2022
Big Data Analytics

Outline

Data Storytelling &


Introduction to Data Analytics Data Analytics Use Case
Presentation

➔ What’s Big Data? ➔ Data Analytics Use Case ➔ Theory


➔ What’s Data Science? ➔ Exploratory Data Analysis ➔ Presentation Example
➔ The importance of data science for ➔ Cohort Customer Retention
Business ➔ Customer Segmentation
➔ Data Science Job Roles ➔ Association Rules
➔ Data Analytics Skills ➔ Price Elasticity
➔ Level of Data Analytics
➔ Data Analytics Process
➔ Machine Learning Overview
Introduction to Data
Analytics
Big Data Analytics
Big Data Analytics

What’s Big Data?

The 7 Vs of Big Data:

1. Value: usefulness of gathered data for your business.


2. Volume: the size of data
3. Variability: the data whose meaning is constantly changing
4. Visualization: the data in a manner that’s readable and accessible
5. Veracity: how accurate or truthful a data set may be
6. Variety: the different types of data
7. Velocity: how quickly data is generated and how quickly that data moves
Big Data Analytics

What’s Data Science?

➔ an interdisciplinary field of study that uses data to extract knowledge and insight
Big Data Analytics

The Importance of Data Science for Business

To improve strategy and


business decision-making

Gain customer
insights

Predict market
trends

Making a better
product

Managing business
efficiently
Big Data Analytics

Data Science Job Roles

Data Engineer Data Analyst Data Scientist

Data engineers work in a variety Data Analyst work closely with Data Scientist need advanced
of settings to build systems that business stakeholders by skills and techniques such as
collect, manage, and convert creating the report, dashboard, designing data modeling
raw data into usable information or generating the insights to processes, creating algorithms
for data scientists and data support them in decision and predictive models to embed
analysts to interpret. making. insights into the business.
Big Data Analytics

Data Analytics Skills

Data analytics is simply a branch under the wider concept of data science.

Data analytics involves an inquiry into a hypothesis with the primary objective of uncovering
insights that would support and grow a business in a particular area.

Skills:

- SQL
- Microsoft Excel
- R or Python Programming
- Data Visualization
- Presentation Skills
- Critical Thinking
- Machine Learning
Big Data Analytics

Level of Analytics

● Descriptive: Analysis of historical data

● Diagnostic: Utilization of historical data to identify


a product failure pattern and determine the
failure’s root cause.

● Predictive: Use of modeling, data mining, and


machine learning to analyze both real time and
historical data to predict and anticipate future
events based on patterns found in the data

● Prescriptive: a suggested next step and/or


decision is identified, evaluated, and can be
automatically enabled
Big Data Analytics

Analytics Process

5. Implement and Evaluate the 1. Business Objectives


Results What are you willing to

What you have achieved and learnt? achieve?

Analytics
Process
4. Data Science Solution 2. KPI metrics
Which data science solutions What indicators will you measure?
you need to implement?

3. Initiatives Business
Understanding
What actions you need to do?

Data Data Analysis / Interpret and


Data Preparation
Preprocessing Exploration Modeling Present
Big Data Analytics

Business Understanding

➔ to dig beneath the surface to uncover the structure of the business problem and the data that are available, and
then match them to one or more data mining tasks for which we may have substantial science and technology to
apply.

1 2 3

Gather background Assessing the Situation Determining data science


information goals

- Compiling the business - Requirements , - Data science goals


background Assumptions, and - Data science success
- Defining business Constraints criteria
objectives
- Business success criteria
Big Data Analytics

Data Preparation

➔ how to cast the business problem as one or more data science problems. Framing a business problem in terms of
expected value can allow us to systematically decompose it into data mining tasks

1 2 3

Collect Initial Data Describe Data Explore Data and Verify


Data Quality

- Data Gathering from - Amount of Data - Data Distribution


various sources (csv, - Data Description - Missing Data
database) - Data Types - Outlier
Big Data Analytics

Level of Measurements

Data

Categorical Numerical
(can be grouped) (measure)

Nominal Ordinal Interval Ratio


- Cannot be arranged - Can be arranged in
order - Doesn’t have “true - Has “true zero” value
in any particular
- e.g. zero” value - E.g. score (0-100)
order
high/medium/low - E.g. temperature
- e.g. gender
Big Data Analytics

Data Types

- Integer: number without decimal point (e.g. 1,2,3)


Numeric
- Float: number with decimal point (e.g. 1.25, 1.50)

Character - Char: fixed length (e.g. ABC, ABD, ABE)


String - Varchar: variabel length (e.g. ABC, ABCD, AB)

Data Types

- Consist of Day, Month, Year, Hour, Minute, Second


Date Time
(e.g. 12/03/2021 01:20:05)

Binary - 0 or 1, true or false


Big Data Analytics

Database Relationship

Transaction Product
Customer
(Fact Table)
Product ID
Customer ID
Transaction ID
Product Name
Customer Name
Customer ID
Product Category
Customer LTV
Product ID

Time ID

Location ID
Time Location

Price
Time ID Location ID

Discount
Day City

Qty Sold
Month Province

Sales
Year
Big Data Analytics

Describe Data
Column Description Data Types Completeness Distribution Example

Transaction ID Identity number for transaction Char (10) 100% Unique Value: 200,000 TRX0000001

Customer ID Identity number for customer Char (10) 80% Unique Value: 10,000 CS00000001

Product ID Identity number for product Char (10) 95% Unique Value: 500 TA00000001

Unique Value: 365


Time ID Identity number for transaction time Numeric (8) 100% Min: 20210101 20210101
Max: 20211231

Location ID Identity number for transaction location Char (3) 70% Unique Value: 100 JKT

Min: 10,000 Max: 1,000,000


Price Transaction price Float 100% 100,000
Med: 400,000 Avg: 350,000

Min: 10,000 Max: 1,00,000


Discount Discount amount for this transaction Float 80% 10,000
Med: 50,000 Avg: 20,000

Min: 1 Max: 1,000


Qty Sold Quantity sold for this transaction Integer 100% 1
Med: 25 Avg: 20

Min: 10,000 Max: 10,,000,000


Sales Transaction sales Float 100% 1,000,000
Med: 400,000 Avg: 350,000
Big Data Analytics

Data Preprocessing

➔ the data are manipulated and converted into forms that yield better results.

1 2 3

Select Right Data Clean Data Extend Data

- Select table and - Handle Missing Value - Add new features by


columns/features needed - Handle Outlier converting or
- Delete duplicate rows transforming the data
Big Data Analytics

Handle Missing Value


1. Drop or Use Variable based on Proportion Missing Values

Columns Missing Values Completeness

Age 80% 20% Drop Age

Salary 10% 90%

2. Drop Row with Missing Values

Row Salary Sales

1 10,000,000 100,000

2 NULL 200,000
Drop Row 2
3. Fill Missing Values with Avg / Median / Mode

Row Salary Sales

1 10,000,000 100,000

2 [Avg Salary] 200,000 Fill Salary with Average Salary


Big Data Analytics

Handle Outlier
➔ Outliers are values at the extreme ends of a dataset.

Steps:

1. Sort your data from low to high


2. Identify the first quartile (Q1), the median, and the third quartile (Q3).
3. Calculate your IQR = Q3 – Q1
4. Calculate your upper fence = Q3 + (1.5 * IQR)
5. Calculate your lower fence = Q1 – (1.5 * IQR)
6. Use your fences to highlight any outliers, all values that fall outside your fences.
Big Data Analytics

Data Transformation

1. Categorical Data
- Convert to numeric

ID GENDER EDUCATION GENDER_FLAG EDUCATION_FLAG_1 EDUCATION_FLAG_2 EDUCATION_FLAG_3

1 Female SMP 0 0 0 1

2 Male SMA 1 0 1 0

3 Male S1 1 1 0 0

4 Female S1 0 1 0 0
Big Data Analytics

Data Transformation
2. Numerical Data
- Feature Scaling
- Convert to Group
- Transformation

ID AGE SALARY AGE_SCALING AGE_GROUP SALARY_LOG

1 20 15,000,000 -1.2 17 - 25 7.17

2 30 25,000,000 -0.3 25 - 40 7.39

3 40 15,000,000 0.6 25 - 40 7.17

4 50 48,000,000 1.6 40 - 65 7.68


Big Data Analytics

Machine Learning Techniques

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on
the use of data and algorithms to imitate the way that humans learn,

Supervised Unsupervised
Learning Learning

Association
Classification Regression Clustering
Rules

Labeled Dataset Unlabeled Dataset

Output / Target N Input Variable


N Input Variable
Variable
Big Data Analytics

Supervised Learning

Supervised learning is defined by its use of labeled datasets to train algorithms to make prediction.

Classification Regression

Output label Categorical (discrete values) Numerical (continuous values)

Example Use - Churn prediction - Price Elasticity of Demand


Case - Sentiment analysis (Text mining, NLP) - Sales forecasting

Logistic regression, Decision Tree, Random Linear regression, Decision Tree, Random
Forest, Support Vector Machine, Naive Forest, Support Vector Regressor, Neural
Algorithm
bayes, Nearest Neighbour, Neural Network Network
Big Data Analytics

Unsupervised Learning

Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled datasets.

interpret the input data and automatically discovering natural grouping in data

Clustering Example use case: customer segmentation

Algorithm: K-Means, Hierarchical, DBSCAN, etc

discovering interesting relations between variables in large databases

Association Rules Example use case: market basket analysis

Algorithm: Apriori, FP-growth


Big Data Analytics

Data Science Use Case

Customer Sentiment
Segmentation Analysis

Churn
Prediction
Pricing

Recommendation
Engine
A/B Testing

Market Basket Promotion


Analysis Effectiveness
Big Data Analytics

Customer Segmentation

Business How to spend marketing costs effectively


Questions to the right customer segment?

Customer ID, Transaction ID, Transaction


Dataset
Time, Sales

RFM Segmentation by building


Methodology
rule-based or clustering model
Big Data Analytics

Pricing

Business How to predict demand by adjusting


PROMO THIS MONTH !!!
Questions price?

10,000
8,000
Transaction ID, Transaction Time,
Dataset
Quantity, Price Periode: 1 - 30 Sep
2021

Price elasticity by building log-log linear


Methodology
regression model

Price Demand Sales

10,000 10 100,000

8,000 15 120,000

Price Elasticity of Demand = 2.5 (elastic)


Big Data Analytics

Market Basket Analysis


➔ sell related or complementary products to an existing customer
Market basket analysis

Customer 1: Rice, chicken, cola Questions:


- What do customer buy?
- Which products are
bought together?
Customer 2: Rice, chicken

Goals:
To find associations and
Customer 3: Rice, chicken, burger correlation between different
items that customer buy

Customer N:
Offer Rice
Chicken
Big Data Analytics

Promotion Effectiveness

➔ is the process of tracking marketing channels that lead to conversions or sales

Email

Paid search

Referral

Methodology:
Social media First click, Last click, Linear approach, Markov model, etc
Big Data Analytics

Recommendation Engine

Business How to recommend personalized


Questions product for each customer?

Dataset Customer ID, Product ID, Rating

Build Recommendation Engine by


Methodology
building collaborative filtering model Image source: https://medium.com/@humansforai
Big Data Analytics

Sentiment Analysis

Business How to evaluate our product by


Questions analyzing the product’s reviews?

Dataset Product ID, Review, Label

Positive Negative

I love the The product is


product! hard to use!
Build sentiment analysis by creating
Methodology
classification model from text data
Big Data Analytics

Churn Prediction

➔ Churn model estimates the likelihood of a customer to leave in the next period
of time

Input service level, tenure, payment history, demographics, purchase


variable behaviour, etc

Output
Churn or not churn
variable

Methodology Classification

Action
Experiment testing by giving a marketing action to customer
who likely to churn
Big Data Analytics

A/B Testing

the process of comparing two different


versions of a web page or email so as to
determine which version generates more
conversions.

A/B Testing Promotion A/B Testing Product


Big Data Analytics

Simulation Link

➔ Dataset: Dataset Link


➔ Exploratory Data Analysis: Script Link
➔ Cohort Customer Retention: Script Link
➔ Customer Segmentation: Script Link
➔ Market Basket Analysis: Script Link
➔ Price Elasticity: Script Link
Data Storytelling Theory

Big Data Analytics


Big Data Analytics

The Importance of Context

Exploratory Analysis Exploratory

What you do to understand the data A specific thing you want to explain,
and figure out what might be a specific story you want to tell.
noteworthy or interesting to
highlight to others. Turn the data into information that
can be consumed by an audience

Don’t give 100 oysters to your audience, but give 2 pearls


Big Data Analytics

How to do Exploratory Analysis?

Define your WHO, WHAT, and HOW


To whom are you What do you want your audience to How can you use data to help
communicating? know or do? make your point?
● Your Audience ● Action Data becomes supporting
Specific audience Communicate relevant for evidence of the story you will
Creating different build and tell, including how
your audience and form a
communications for
clear understanding. the way you visualize it.
different audiences
Want your audience to
● You
know or do something
Think about the relationship
that you have with ● Mechanism
your audience and how you Level of detail
expect that they will Amount of control
perceive you ● Tone
The tone you want your
communication to convey
to your audience
Big Data Analytics

Storytelling (1/2)

➔ Constructing the story

The Beginning The Middle The End

Introduce the plot, building Throughout your End with a call to action.
the context for your communication, make the Make it totally clear to your
audience. information specific and audience what you want them
This section, set up relevant to your audience. to do with the new
the essential elements of The story should ultimately be understanding or knowledge
story. about your audience, not about that you’ve imparted to them.
you.
Big Data Analytics

Storytelling (2/2)

➔ The narrative structure

Narrative has to be central to the communication. These are words written, spoken, or a combination
of the two that tell the story in an order that makes sense and convinces the audience why it’s important
or interesting.

Narrative flow, the order of The power of repetition


your story

The spoken and written


narrative
Big Data Analytics

Create Presentation

1. Define questions/problem/context/background/objectives
2. Create hypothesis / list of analyses to answer the questions
3. Create story and outline presentation
4. Create visualization
5. Create executive summary
6. Create recommendation
Big Data Analytics

Data Visualization
Presentation Example
Big Data Analytics
Big Data Analytics

Business Objectives

How to retain customer in order to gain more demand or sales and spend promotion cost effectively?

➔ Exploratory Data Analysis: analyze current sales performance


➔ Customer Retention: evaluate current customer retention rate
➔ Customer Segmentation: focus to give promotion on more profitable customer
➔ Association Rules: promote the similar products that associate with product that customer bought
➔ Price Elasticity: give discount to elastic product in order to gain more demand
Big Data Analytics

Sales Performance

Top 10% products already contribute 62% of sales.

4,147 Top Products


Products

Product Sales
10Mn
Total Sales DOTCOM POSTAGE 206K

5.3Mn REGENCY
164K
CAKESTAND 3 TIER
Quantity Sold
PARTY BUNTING 98K
2,423
Avg. Sales / product WHITE HANGING
HEART T-LIGHT 97K
HOLDER
127
Avg. Trx / product JUMBO BAG RED
RETROSPOT
92K

1,281
Avg. Qty / product
Big Data Analytics

Sales by Country
United Kingdom already contributes 85.6% of sales
Big Data Analytics

Monthly Sales and Customer


Sales trend is growing up along with the number of customers
Big Data Analytics

Customer Performance
One of strategies to increase sales is to persuade customer to transact more and spend more money

4,380 Top Customers


Customers
CustomerID Sales

1,890 14646 279K


Avg. Sales
18102 256K
5.07
Avg. Trx 17450 187K

1,117 14911 132K


Avg. Quantity
12415 123K
315
Avg. Sales/Trx

61
Avg. Product * 10%, 25%, 50%, 75%, 90% distribution
Big Data Analytics

New Customer vs. Repeated Customer


While repeated customer was gradually increasing, the new customer was gradually decreasing
Big Data Analytics

Customer Retention
Customer retention rate went up and down month by month, with the average retention rate was around
20-30%.
Big Data Analytics

RFM Segmentation

RECENCY FREQUENCY MONETARY


The freshness of the The frequency of the How much customer
customer visits customer visits spend their money
The more recent the visit, the The more frequently the The more customer spend,
more responsive the customer visit, the more the more interested they are
customer is to promotion engaged and satisfy they are

The combination for each score of those metrics will create different groups,
and then the groups can be clustered into several segments
Big Data Analytics

RFM Segmentation Rule


Big Data Analytics

Customer Segmentation
➔ Champion segment already contributes more than 60% of sales
➔ Focusing your efforts on critical segments of customers is likely to give you much higher return on investment.
Big Data Analytics

Customer Segmentation - Recommendation

No. Segment Description Recommendation # CustomerID

Do transaction recently, buy often and Reward them, can be early adopters for new
1 Champion 779 (18%)
spend the most products. Will promote your brand.

Upsell higher value products. Ask for reviews.


2 Loyal Spend good money often 522 (12%)
Engage them.

Potential Recent customers, but spent a good Offer membership/loyalty programs and
3 461 (11%)
Loyalist amount and bought more than once recommend other products.

Recent Provide onboarding support, start building


4 Bought most recently, but not often 358 (8%)
Customer relationship.

Cannot Lose Made biggest purchases and often Win them back with aggressive promo, don't lose
5 199 (5%)
Them but haven't returned for a long time them to competitor.

In average recency, frequency, and


6 Average Build a good relationship with them. 260 (6%)
monetary values

Below average recency, frequency, Recommend popular products/renewals at


7 About to Sleep 1067 (25%)
and monetary values. discount, reconnect with them.

Lowest recency, frequency, and Do reach-out campaign, but don't put extra effort
8 Lost Customer 651 (15%)
monetary values to retain them.
Big Data Analytics

Recommendation Bundling Products based on Association Rules


Product Bundling 1 Product Bundling 2

- PINK REGENCY TEACUP AND SAUCER - ALARM CLOCK BAKELIKE GREEN


- ROSES REGENCY TEACUP AND SAUCER - ALARM CLOCK BAKELIKE RED
- REGENCY CAKESTAND 3 TIER
- GREEN REGENCY TEACUP AND SAUCER

Support 5% Support 6%
Confidence 96% Confidence 82%

Product Bundling 3 Product Bundling 4

- BAKING SET SPACEBOY DESIGN - RED HANGING HEART T-LIGHT HOLDER


- BAKING SET 9 PIECE RETROSPOT - WHITE HANGING HEART T-LIGHT HOLDER

Support 5.5% Support 7%


Confidence 81% Confidence 81%

* Support: how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.
* Confidence: how likely item Y is purchased when item X is purchased
Big Data Analytics

Price Elasticity of Demand (PED) Type

PED > -1 PED = 0 PED < -1


“Inelastic” “Perfectly Inelastic” “Elastic”
Perubahan demand Perubahan demand Perubahan demand
kurang dipengaruhi sama sekali tidak cukup dipengaruhi
oleh perubahan dipengaruhi oleh oleh perubahan
harga perubahan harga harga (demand
sensitif terhadap
harga)
Big Data Analytics

Price Elasticity
Their demand vary between price, and we can find optimum price and sales for each product. Decrease
price for elastic product to give us a higher sales.

Price elasticity: -3.821 (elastic)

Notes: Data with filter country United Kingdom


Big Data Analytics

Price Optimization Simulation


Optimum daily projected sales (50.64) was achieved when decreasing price 35%.

Notes: Data with filter country United Kingdom


Big Data Analytics

Price Optimization - Top 10 Products

Total Sales
StockCode Elasticity Optimum Price Increase Price Projected Sales
(historical)

22423 -3.82 16.224 -0.35 50.56 107K

85123A -4.85 3.474 -0.40 16.6 93K

23166 -31.44 0.625 -0.50 151.74 80K

85099B -8.23 1.144 -0.45 258.0 75K

47566 -6.05 2.970 -0.40 265.5 62K

84879 -9.81 0.929 -0.45 347.8 50K

22502 -7.74 3.273 -0.45 96.89 46K

79321 -3.09 8.099 -0.35 31.32 43K

22086 -2.65 4.053 -0.30 77.81 36K

23284 -2.62 11.641 -0.30 51.99 33K

Notes: Data with filter country United Kingdom


Big Data Analytics

Executive Summary

➔ From 4,147 products and 4,380 customers, they contribute 10Mn sales, with the top 10% products
already contribute 62% of sales and United Kingdom contributes the most of sales (85.6%)

➔ Average monthly customer retention rate is around 20-30% and we need to have initiative to
increase the retention rate in order to increase the overall sales by doing marketing initiative such
as personalized promotion, product bundling recommendation, and price elasticity.

➔ Give personalized promotion based on their customer segment and focusing the promotion on
critical segments of customers is likely to give much higher return on investment

➔ Adjust price based on their optimum price for elastic products in order to increase the demand.
Big Data Analytics

Recommendation

➔ Doing personalized promotion based on customer segment.


➔ Prioritize to spend promotion cost for the most profitable customer segment (Champion and Loyal)
➔ Try to build bundling products for top products and prioritize groups that have higher confidence.
➔ Adjust price based on their optimum price for elastic products and focusing on top products.
Book Recommendation
Big Data Analytics
Big Data Analytics

Book Recommendation
Appendix
Big Data Analytics
Linear Regression
(dependent or output variable)
Linear regression is a linear model, e.g. a
model that assumes a linear relationship
between the input variables (x) and the single
output variable (y).

More specifically, that y can be calculated


from a linear combination of the input
variables (x).

Output:

ŷ = the predicted value of y (the dependent


variable) in a regression equation
w0 = intercept
w1 = slope
(independent or input variable)
K-Means Clustering

Algorithm:

1. Choose the number of clusters k


2. Select k random points from the data as centroids
3. Assign all the points to the closest cluster centroid
4. Recompute the centroids of newly formed clusters
5. Repeat steps 3 and 4 until meet one of stopping criterias

Stopping criteria:

1. Centroids of newly formed clusters do not change


2. Points remain in the same cluster
3. Maximum number of iterations are reached

You might also like