100% found this document useful (1 vote)
139 views

Data Analytics Using R (DA-R)

This document provides an overview of data analytics using R (DA-R). It begins with an introduction and outline of topics to be covered, including different types of data analytics, why R is used, and types of problems in data analytics. The document then presents four scenarios to demonstrate applications of data analytics. The first scenario describes staff scheduling challenges at a hospital and how data analytics could help optimize scheduling. The second scenario discusses high default rates at an auto loan company and how analytics could reduce losses. The third scenario involves reducing costs from candidates reneging job offers. The fourth scenario is about identifying high-value customers for a retail company. In each scenario, the document provides details about the problem, objectives, and available data.

Uploaded by

RiturajPaul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
139 views

Data Analytics Using R (DA-R)

This document provides an overview of data analytics using R (DA-R). It begins with an introduction and outline of topics to be covered, including different types of data analytics, why R is used, and types of problems in data analytics. The document then presents four scenarios to demonstrate applications of data analytics. The first scenario describes staff scheduling challenges at a hospital and how data analytics could help optimize scheduling. The second scenario discusses high default rates at an auto loan company and how analytics could reduce losses. The third scenario involves reducing costs from candidates reneging job offers. The fourth scenario is about identifying high-value customers for a retail company. In each scenario, the document provides details about the problem, objectives, and available data.

Uploaded by

RiturajPaul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Data Analytics using R (DA-R)

INTRODUCTION
Outline
Different Types
Introduction of Data Prediction Effect
Analytics

Types of
Course Plan Why R? Problems in
Data Analytics
Scenario 1
STAFF SCHEDULING AT VUMC
Staff Scheduling at VUMC
• Vanderbilt University Medical Centre (VUMC) is one
of the leading hospitals.
• VUMC maintains 55 operating rooms across
different sites.
• VUMC schedules elective (non-emergency)
surgeries primarily on weekdays.
VUMC Operations
• The charge nurse reports the schedule for
the next day to admin director. Admin
• If the number of cases booked is low, the Director
admin director decides to close some
operating rooms.
• The charge nurse also asks some
operating room nurses to take a paid
holiday.
• If the number of booked cases is high, Charge
the admin director asks the charge nurse
to call in extra operating room nurses. Nurse
Challenges at VUMC
• VUMC assumes that surgeries would
occur equally across all weekdays in a Elective Surgeries
month.
• Recently, VUMC has observed a large
variation in daily surgical case volume •94%
(number of surgeries) to be performed.
• This is creating a major problem for
surgical staff schedule. Add-on Cases

•6%
Surgeries are generally scheduled
earlier on the week and earlier on the
day.
Potential
Causes of Sometimes no surgeries are scheduled
in a week for various reasons.
Variation

6% add-on cases.
Why is Staff Scheduling so Important?
Overstaffing
• May not cancel staff at late notice (labour relations). Even if possible, last minute
changes may hurt employee satisfaction, as most employees want predictable
schedules.

Understaffing
• May not be able to find someone available to work on short notice. Understaffing
of nurses may delay the surgeries.
Objective
• To resolve issues related to staff scheduling.
Data
• Actual Number of Surgeries
• Number of Surgeries booked in advance
• Day of the Week
Scenario 2
FINDING RIGHT CUSTOMERS AT AUTO FINANCE LTD.
Finding Right Customers
at Auto Finance Ltd.
• Auto Finance Ltd. is a major player in the two-wheeler
business in India.
• Many of the people buying two-wheelers belong to lower-
middle class of India and does not have access to enough
capital.
• Auto Finance Ltd. provides loans, typically on a fixed interest
rate for 3-5 years, to enable cash-strapped customers to buy
the vehicle.
• The loan facility has enabled Auto Finance Ltd. to attract a
new customer segment.
Challenges at Auto Finance Ltd.
• Recently, Auto Finance Ltd. has faced
a major issue.
Timely
• Around 70% of the customers have Payment
30%
delayed the repayments.
• In order to decide whether to grant
credit, the credit provider considers Delayed
the trade-off between the interest Payment
70%
income and the possibility of borrower
defaulting.
Objective
• To reduce the loss due to high default rate.
Data
• Auto Finance Ltd. records the default status for
each customer.
• It also maintains a huge database with several
customer specific information such as age, gender,
income, employment details, etc.
Scenario 3
TALENT ACQUISITION BY SCALENEWORKS
Talent Acquisition by
Scaleneworks
• Scaleneworks, a Bangalore based start-up company,
supports many IT companies in India with talent
acquisition.
• Advises its customers on status of modern talent
acquisition practices.
• Recommends and implements individually tailored,
viable solutions.
Business Problem
• The top management has observed
that several persons have not joined In an IT firm, suppose 12000 offers
are rolled out every year.
the organization even after accepting
the offer. At 30% renege rate, approximately
3600 candidates accept the offer and
• Owing to this, cost of hiring increased then not join the company.
between 10% and 15%. Company would have spent 15 man-
hours/candidate in recruitment
lifecycle.

54000 man-hours wasted by one


client alone
Objective
• To reduce the cost associated with the candidates not joining the company even
after accepting the offer.
Data
• Scaleneworks has access to the data on joining
status for several candidates.
• Scaleneworks also has several important candidate-
specific information such as age, gender, joining
bonus offered or not, employment location, pay
band, etc.
Scenario 4
MARKET SEGMENTATION AT EASY SHOPPING
Background Details
• Easy Shopping is a registered non-store online retail
company.
• The company mainly sells unique all-occasion gifts.
Business Problem
• Easy Shopping wants to run a targeted marketing campaign that will resonate
with a high-value customers, but not with others.
• This targeted group will receive messages tailored to their needs and interests.
• However, Easy Shopping is clueless about the process of dividing customers into
groups.
Objective
• To identify high-value customers.
Data
• Easy Shopping maintains the transaction details for
each customer.
What is Analytics?

Data Analytics Decision


Analytics is the discovery, interpretation, and communication of meaningful patterns in data; and the
process of applying those patterns towards effective decision making.
Objective
• By using analytics, one can
✓Find Patterns
✓Understand the meaning of underlying patterns
✓Make Predictions
✓Recommend Decisions
Some Recent News
• Bernie Sanders, and How Indian Food Can Predict Vote Choice! (Check:
https://www.nytimes.com/2020/01/30/upshot/bernie-sanders-indian-food.html)

• Can an Algorithm Predict the Pandemic’s Next Moves? (Check:


https://www.nytimes.com/2020/07/02/health/santillana-coronavirus-model-forecast.html)

• Are Penalty Kicks Easier Without Fans? Maybe Not. (Check:


https://fivethirtyeight.com/features/are-penalty-kicks-easier-without-fans-maybe-not/)

• Leaving Airplane Middle Seats Empty Could Cut Coronavirus Risk


Almost In Half, A Study Says. (Check:
https://www.forbes.com/sites/carlieporterfield/2020/07/11/leaving-airplane-middle-seats-
empty-could-cut-coronavirus-risk-almost-in-half-a-study-says/#3d3cedc61a0c)

• Using Patent Analytics To See Why Amazon Bought Zoox (Check:


https://www.forbes.com/sites/louiscolumbus/2020/07/12/using-patent-analytics-to-see-why-
amazon-bought-zoox/#72e6d7745ab6)
Different Types of Data Analytics
Different
Types of
Data
Analytics

Source:
https://www.sciencedirect.com/science/article/pii/
S0148296318302480
Descriptive Analytics
• Descriptive Analytics consists of set of techniques
that describes what has happened in the past.
• Examples: Data Queries, Reports, Descriptive
Statistics, Data Visualization, etc.
Diagnostic analytics
• Diagnostic analytics (as a natural extension of
descriptive analytics) examines data or content to
answer the question “why did it happen?”
• It requires exploratory data analysis of the existing
data or sometimes additional data using tools and
techniques as visualization, data discovery, and data
mining in order to discover the root causes of a
problem.
Predictive Analytics
• Predictiveanalytics comprises of the set of
techniques that use models constructed from the
past data to predict the future or study the impact
on one variable on the other.
• Examples: Linear Regression, Logistic Regression,
etc.
Prescriptive Analytics
• Prescriptive analytics provides a best course of
action to take, i.e., the output from a prescriptive
analytics model is the best solution.
• A common example is portfolio models in finance,
which determine the mix of investments that yield
the highest expected return while limiting the
exposure to risk.
Predictive Analytics
WHY PREDICTION IS SO IMPORTANT?

Source: Siegel, E. (2016). Predictive Analytics. Wiley.


Application: Direct Marketing
Imagine you have a company with a mailing list of a million customers.

Cost of sending a mail to each one is $2.

Suppose you have observed that one out of 100 of them will buy your product (i.e.,
10,000 responses).

Also your profit is $220 for each positive response.


Profit Calculation
• Overall Profit
= Revenue − Cost
= $220 × 10,000 responses − ($2 × 1million)
= $200,000.
Prediction Effect
Now suppose you use Predictive Analytics (PA) in the same context.

Suppose PA earmarks a quarter of the entire list and says: "These folks are
three times more likely to respond than average!”

So you now have a short list of 250,000 customers

At 3% percent response rate, 7500 responses.


Prediction Effect
• Overall Profit
= Revenue − Cost
=($220×7,500 responses)−($2×250,000)
=$1,150,000.
Prediction Effect
• Overall Profit
= Revenue − Cost
=($220×7,500 responses)−($2×250,000)
=$1,150,000.
• You just improved your profit 5.75 times over
mailing to fewer people.
Predictive Analytics
What’s predicted?
• Which customers will respond to marketing contact?

What’s done about it?


• Contact customers more likely to respond.
Types of Problems
SUPERVISED LEARNING VS UNSUPERVISED LEARNING
Supervised Learning
•Both the set of independent variables (X) and the
dependent variable (Y) are observed.
Unsupervised Learning
• Only a bunch of variables (X) are observed.
Supervised Learning
Objective in Supervised Learning
Inference: To understand the relationship
between 𝑌 and 𝑋

Prediction: To predict 𝑌 based on 𝑋


Supervised Learning Problems
Regression
Supervised Problem
Learning
Problem Classification
Problem
Types of Supervised Learning
Regression Problem
Predicted

Regression
𝑌 𝑌෠
Model

Quantitative Quantitative
𝑋1 𝑋2 𝑋3
Regression Problem: Examples
• Staff Scheduling at VUMC
Classification Problem

Predicted
Classification
𝑌 Class
Model Labels

Qualitative
Qualitative
𝑋1 𝑋2 𝑋3
Classification Problem: Examples
• Finding Right Customers at Auto Finance Ltd.
• Talent Acquisition by Scaleneworks
Supervised Learning:
Techniques
Linear Regression

Logistic Regression

Decision Trees

Bagging

Random Forest

Boosting

Support Vector Machines


Unsupervised Learning
Unsupervised Learning
• A set of statistical tools intended for the setting in which we have only a set of
features 𝑋1 , 𝑋2 , … , 𝑋𝑝 measures on 𝑛 observations.
• We are not interested in prediction, because we do not have an associated
response variable 𝑌.
• The goal is to discover interesting things about the measurements on
𝑋1 , 𝑋2 , … , 𝑋𝑝 .
Is there any informative way to
visualize the data?
Objectives in
Unsupervised
Learning Can we discover the subgroups
among the variables or among
the observations?
Unsupervised Learning: Examples
• Customer segmentation is the process of dividing customers into groups based
on common characteristics.
• The most common characteristics are demographics (e.g., age, gender, marital
status, income), psychographics (e.g., interests, lifestyle, group affiliations),
geographical region, and purchase behaviour (e.g., previously purchased items,
shipping preferences, page views on your website, etc.).
Customer Segmentation at Easy
Shopping
• Easy Shopping decides to work with metrics such as each customer’s recency of
last purchase, frequency of purchase, and monetary value.
• These three variables, collectively known as RFM, are often used in customer
segmentation for marketing purposes.
Unsupervised Learning:
Challenges
• The exercise tends to be more subjective, and there
is no simple goal for the analysis, such as prediction
of a response.
• Unsupervised learning is often performed as a part
of an exploratory data analysis.
• It can be very hard to assess the results obtained
from the unsupervised learning methods since
there is no universally accepted mechanism for
validating results on an independent data set.
Unsupervised Learning: Techniques
Principal Component Analysis
• A tool used for data visualization or data pre-
processing before supervised techniques are applied.
Clustering
• A broad class of methods for discovering unknown
subgroups in data.
Introduces several supervised and
unsupervised learning
techniques.
Objective of
This Course
Implements all these techniques
in R.
Why R?
• Open Source Software, available on every major platform.
• Massive set of packages for visualization, statistical modelling,
machine learning, and importing and manipulating data.
• Readily available tools for data analysis.
• A great community. Easy to get help from experts.
• Readily available tools for communicating results.
• Can connect to high-performance programming languages like
C, C++, Fortran.
(Check Advanced R by Hadley Wickham for more details)
Course Plan
Books
1. Seema Acharya (2018). Data Analytics using R. McGraw Hill Education [Ref 1]
2. James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013). An Introduction to Statistical
Learning: with Applications in R. New York: Springer-Verlag. (web: http://www-
bcf.usc.edu/~gareth/ISL/). [Ref 2]
3. Hyndman, R. J. & Athanasopoulos, G. (2016). Forecasting: Principles and Practice.
Otexts. (web: https://www.otexts.org/fpp/) [Ref 3]
4. Lander, J. (2013). R for Everyone: Advanced Analytics and Graphics. New Jersey:
Addison-Wesley.
5. Siegel, E. (2016). Predictive Analytics. Wiley.
Internet Websites
https://fivethirtyeight.com/
http://analytics-magazine.org/
https://medium.com/
http://www.r-bloggers.com/
https://stat.ethz.ch/mailman/listinfo/r-help
http://stackoverflow.com/questions/tagged/r
http://blog.revolutionanalytics.com/r/
http://chance.amstat.org/
http://www.statslife.org.uk/significance
Journals
Management Science (web: https://pubsonline.informs.org/journal/mnsc)
Marketing Science (web: https://pubsonline.informs.org/journal/mksc)
INFORMS Journal on Applied Analytics (web: https://www.informs.org/Publications/INFORMS-
Journals/INFORMS-Journal-on-Applied-Analytics)
Computational Statistics and Data Analysis (web:
https://www.journals.elsevier.com/computational-statistics-and-data-analysis/)
Computational Statistics (web: https://link.springer.com/journal/180)
Interfaces (https://pubsonline.informs.org/journal/inte)
The R Journal (web: https://journal.r-project.org/)
Journal of Statistical Software (web: http://www.jstatsoft.org/index)
Reading Material
• [Ref 2] James, G., Witten, D., Hastie, T. & Tibshirani, R. (2013). An Introduction to
Statistical Learning: with Applications in R. New York: Springer-Verlag. (web:
http://www-bcf.usc.edu/~gareth/ISL/).
➢Chapter 2
✓Section 2.1
✓Sub-sections 2.1.4 and 2.1.5
✓Section 2.3

You might also like