0% found this document useful (0 votes)
8 views

Chapter 02 Overview

Uploaded by

pranjal songara
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Chapter 02 Overview

Uploaded by

pranjal songara
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

Chapter 2: Overview

Data Mining for Business


Analytics
Concepts, Techniques and Applications
with JMP Pro
Shmueli, Bruce, Stephens & Patel

© Galit Shmueli, Peter Bruce, and Mia Stephens


2016
Core Ideas in Data Mining
Data
Mining
Supervised Unsupervised
Learning Learning
(Predictive (Descriptive
Analytics) Analytics)

Association Rules Data Exploration &


Prediction Time Series Visualization
Classification Data Reduction &
Dimension Reduction

2
Supervised Learning
Goal: Predict a single “target” or “outcome”
variable

Training data, where target value is known

Score data where value is not known

Methods: Classification and Prediction

3
Predictive Modeling Example
Who will cheat on their taxes? l l us ss
ri ca ri ca uo l a
o go ti n tc
teg t e n ge
c a ca co tar
Use what we know to
TI Refu Marital Taxabl Cheat
predict what we don’t.
D nd Status e
Income
on
Taxes
1 Yes Single 125K No
2 No Married 100K No
Refu Marital Taxabl Cheat
3 No Single 70K No
nd Status e on
Income Taxes 4 Yes Married 120K No
No Single 75K ? 5 No Divorce 95K Yes
d
Yes Married 50K ?
6 No Married 60K No
No Married 150K ?
7 Yes Divorce 220K No
Yes Divorce 90K ?
d
d
8 No Single 85K Yes
No Single 40K ?
4 9 No Married 75K No
No Married 80K ?
10 No Single 90K Yes
Supervised: Prediction
Goal: Predict numerical target (outcome)
variable
Examples: sales, revenue, performance
Each row is a case (customer, tax return,
applicant)
Each column is a variable

5
Supervised Prediction
Example
X Y
0 1
1 3
2 5
3 7
4 9
5 11
6 13
7 15
8 17
9 19
10 21

Known data for X equal to 0 through 10


Can we predict X equal to 11? Or X equal
6 to 5.5?
Supervised: Classification

Goal: Predict categorical target (outcome)


variable
Examples: Purchase/no purchase,
fraud/no fraud, creditworthy/not
creditworthy…
Each row is a case (customer, tax return,
applicant)
Each column is a variable
Target variable is often binary (yes/no)

7
Unsupervised Learning

Goal: Segment data into meaningful


segments; detect patterns

There is no target (outcome) variable to


predict or classify

Methods: Association rules, data reduction


& exploration, visualization

8
Unsupervised: Association
Rules
Goal: Produce rules that define “what
goes with what”
Example: “If X was purchased, Y was
also purchased”
Rows are transactions
Used in recommender systems – “Our
records show you bought X, you may
also like Y”
Also called “affinity analysis”

Note: Association Rules are available in JMP Pro as of version 13.


9
Unsupervised: Data
Exploration

Data sets are typically large, complex &


messy
Need to review the data to help refine the
task
Use techniques of Reduction and
Visualization

10
Unsupervised: Data
Reduction
Distillation of complex/large data into
simpler/smaller data
Reducing the number of variables/columns
(e.g., principal components)
Reducing the number of records/rows (e.g.,
clustering)

11
Unsupervised: Data
Visualization
Graphs and plots of data
Histograms, boxplots, bar charts,
scatterplots
Especially useful to examine relationships
between pairs of variables

12
The Process of Data Mining

13
Steps in Data Mining
1. Define/understand purpose of project
2. Obtain data (may involve random sampling)
3. Explore, clean, preprocess data
4. Reduce the data dimensionality
5. Specify data mining task (classification,
clustering, ...)
6. Partition the data (if supervised)
7. Choose the techniques (regression, CART,
neural networks, etc.)
8. Perform the task
9. Interpret the results and assess the model –
14
compare models and select the best model
10. Deploy best model
Data Mining Effort
100
90
80
70
60
50
40
30
20
10
0
Business Data Data Analysis of
Objective Preparation Modeling Results and
Determination Knowledge
Assimilation

15
SEMMA Approach to Data
Mining
Business Imperative
Sample +
 You gather the data by creating one or more tables. The sample should
be large enough to contain the significant information, but small
enough to process.
Explore
 You search for anticipated relationships, unanticipated trends, and
anomalies to gain understanding and ideas.
Modify
 You change the data by creating, selecting, and transforming the
variables to focus the model selection process.
Model
 You model the data by using the analytical tools to search for a
combination of the data that reliably predicts a desired outcome.
Assess
 You evaluate competing predictive models by building charts to
16 evaluate the usefulness and reliability of the findings from the data
mining process.
Not a one-time initiative
Models need to be maintained over time
Adapt to changes in the business
environment
Leverage new data as it becomes available

17
Obtaining Data: Sampling

Data mining typically deals with huge


databases
Algorithms and models are often applied to
a sample from a database
Once you develop and select a final model,
you use it to “score” the observations in
the larger database

18
Pre-processing Data

19
Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs. numeric
Modeling types are used in JMP to
indicate the types of variables
Numeric
Continuous or Integer – Continuous in JMP
Categorical
Ordered (low, medium, high) – Ordinal in JMP
Unordered (male, female) – Nominal in JMP
20
Types of Variables
 The icons next to the variable names (under
Columns) indicate the modeling type

21
Variable handling
Numeric
Most algorithms in JMP can handle numeric
data
May occasionally need to “bin” into categories

Categorical
JMP algorithms can generally handle
categorical data
Some stat packages require conversion to
binary dummy or indicator variables (number
of dummies = number of categories – 1)
22
JMP automatically codes categorical variables
behind the scenes if needed
Detecting Outliers
An outlier is an observation that is
“extreme”, being distant from the rest of the
data (definition of “distant” is deliberately
vague)
Outliers can have disproportionate influence
on models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is
required to determine if it is an error, or truly
extreme.
23
Detecting Outliers
In some contexts, finding outliers is the
purpose of the DM exercise (airport security
screening)
This is called “anomaly detection”

 The Explore Outliers utility provides


procedures for detecting outliers
 Multivariate outliers can also be identified
using numerical summaries and visualization
tools, and the Outlier Analysis option in the
Multivariate platform.
24
Detecting Outliers – Quantile
Range
0.1 0.9

Outliers Outliers
80%

-400 -300 -200 -100 0 100 200 300 400 500

In this example :
Tail Quantile = 0.1 = 10%
Mid Interval = 1 – 2 x Tail Interval = 0.8 =
80%
Inter Quantile Range = 100
Q = number of intervals to accept before
calling something an outlier = 3
25
Outliers are anything < -300 or > 400
Detecting Outliers

Tail Quantile = 10%


Lower 10% Quantile = 860.65
Upper 10% is the 90% Quantile = 9717.4
Inter Quantile Range = 9717.4 – 860.65 =
8856.75
Q indicates how many multiples of the IQR you
tolerate before calling something an outlier
Low Threshold = Low Quantile – (IQR x Q) =
860.65 – (8856.75 x 3) = -
26 25710

Handling Missing Data
Most algorithms will not process records with
missing values. JMP will generally omit these
records.
Solution 1: Omission
 If a small number of records have missing values,
can omit them
 If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
 If many records have missing values, omission is not
practical
Solution 2: Imputation
 Replace missing values with reasonable substitutes
27  Lets you keep the record and use the rest of its
Handling Missing Data
There are several features in JMP Pro for
evaluating missing values or working with
missing values
 Columns Viewer – reports the number of values missing for
a variable
 Missing Data Pattern – find patterns of missing data
 Explore Missing Values utility – provides methods for
imputing missing data for continuous variables
 Multivariate platform – provides imputation for continuous
variables
 Recode – can be used to recode missing values into a
”missing” category
 Informative Missing – available in most platforms to handle

28 missing values
Normalizing (Standardizing)
Data
Used in some techniques when variables with
the largest scales would dominate and skew
results
Puts all variables on same scale
Normalizing function: Subtract mean and
divide by standard deviation
Generally applied behind the scenes in JMP if
needed, or a built in option is provided in model
dialogs
Select a variable in the data table and select
New Formula Column > Transform >
29
Standardize

The Problem of Overfitting
Statistical models can produce highly
complex explanations of relationships
between variables
The “fit” may be excellent
When used with new data, models of great
complexity do not do so well.

30
100% fit – not useful for new
data
1600

1400

1200

1000
Revenue

800

600

400

200

0
200 300 400 500 600 700 800 900 1000

Expenditure

31
Overfitting (cont.)
Causes:
 Too many predictors
 A model with too many parameters
 Trying many different models

Consequence: Deployed model will not work


as well as expected with completely new
data.

32
Partitioning the Data
Problem: How well will our
model perform with new data?

Solution: Separate data into


two parts
Training partition to develop
the model
Validation partition to
implement the model and
evaluate its performance on
“new” data
33
Test Partition
 When a model is developed on
training data, it can overfit the
training data (hence need to assess
on validation)
 Assessing multiple models on same
validation data can overfit
validation data
 Some methods use the validation
data to choose a model. This too
can lead to overfitting the
validation data
 Solution: final selected model is

34
applied to a test partition to give
unbiased estimate of its
Example – Linear Regression
West Roxbury Housing Data

First 10 records

35
Example – Linear Regression
West Roxbury Housing Data

36
Partitioning the data
(Using the Make Validation Column
Utility)

37
Fitting a Multiple Linear Regression
Using Fit Model

38
Saving values to the data
table
Click on the red
triangle in the
Fit Least
Squares
analysis
window to
select
additional
options
Save Prediction
Formula
39
Save Residuals
Predictions for First 10
Observations
Pred Formula TOTAL VALUE is the value
predicted by our regression model
Residual TOTAL VALUE is the difference
between the actual and predicted value
Highlighted rows are in the Validation Set

40
Summary of errors and cross
validation statistics

41
RMSE (and RASE)
Error = actual - predicted
RMSE = Root-mean-squared error = Square
root of average squared error
(RMSE is reported as RASE in Fit Model and
some other JMP Platforms)

In previous example, sizes of training and


validation sets differ, so only RMSE and
Average Error are comparable

42
Summary
Data Mining consists of supervised methods
(Classification & Prediction) and
unsupervised methods (Association Rules,
Data Reduction, Data Exploration &
Visualization)
Before algorithms can be applied, data must
be characterized and pre-processed
To evaluate performance and to avoid
overfitting, data partitioning is used
Data mining methods are usually applied to
a sample from a large database, and then
43 the best model is used to score the entire

You might also like