Chapter 02 Overview
Chapter 02 Overview
2
Supervised Learning
Goal: Predict a single “target” or “outcome”
variable
3
Predictive Modeling Example
Who will cheat on their taxes? l l us ss
ri ca ri ca uo l a
o go ti n tc
teg t e n ge
c a ca co tar
Use what we know to
TI Refu Marital Taxabl Cheat
predict what we don’t.
D nd Status e
Income
on
Taxes
1 Yes Single 125K No
2 No Married 100K No
Refu Marital Taxabl Cheat
3 No Single 70K No
nd Status e on
Income Taxes 4 Yes Married 120K No
No Single 75K ? 5 No Divorce 95K Yes
d
Yes Married 50K ?
6 No Married 60K No
No Married 150K ?
7 Yes Divorce 220K No
Yes Divorce 90K ?
d
d
8 No Single 85K Yes
No Single 40K ?
4 9 No Married 75K No
No Married 80K ?
10 No Single 90K Yes
Supervised: Prediction
Goal: Predict numerical target (outcome)
variable
Examples: sales, revenue, performance
Each row is a case (customer, tax return,
applicant)
Each column is a variable
5
Supervised Prediction
Example
X Y
0 1
1 3
2 5
3 7
4 9
5 11
6 13
7 15
8 17
9 19
10 21
7
Unsupervised Learning
8
Unsupervised: Association
Rules
Goal: Produce rules that define “what
goes with what”
Example: “If X was purchased, Y was
also purchased”
Rows are transactions
Used in recommender systems – “Our
records show you bought X, you may
also like Y”
Also called “affinity analysis”
10
Unsupervised: Data
Reduction
Distillation of complex/large data into
simpler/smaller data
Reducing the number of variables/columns
(e.g., principal components)
Reducing the number of records/rows (e.g.,
clustering)
11
Unsupervised: Data
Visualization
Graphs and plots of data
Histograms, boxplots, bar charts,
scatterplots
Especially useful to examine relationships
between pairs of variables
12
The Process of Data Mining
13
Steps in Data Mining
1. Define/understand purpose of project
2. Obtain data (may involve random sampling)
3. Explore, clean, preprocess data
4. Reduce the data dimensionality
5. Specify data mining task (classification,
clustering, ...)
6. Partition the data (if supervised)
7. Choose the techniques (regression, CART,
neural networks, etc.)
8. Perform the task
9. Interpret the results and assess the model –
14
compare models and select the best model
10. Deploy best model
Data Mining Effort
100
90
80
70
60
50
40
30
20
10
0
Business Data Data Analysis of
Objective Preparation Modeling Results and
Determination Knowledge
Assimilation
15
SEMMA Approach to Data
Mining
Business Imperative
Sample +
You gather the data by creating one or more tables. The sample should
be large enough to contain the significant information, but small
enough to process.
Explore
You search for anticipated relationships, unanticipated trends, and
anomalies to gain understanding and ideas.
Modify
You change the data by creating, selecting, and transforming the
variables to focus the model selection process.
Model
You model the data by using the analytical tools to search for a
combination of the data that reliably predicts a desired outcome.
Assess
You evaluate competing predictive models by building charts to
16 evaluate the usefulness and reliability of the findings from the data
mining process.
Not a one-time initiative
Models need to be maintained over time
Adapt to changes in the business
environment
Leverage new data as it becomes available
17
Obtaining Data: Sampling
18
Pre-processing Data
19
Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs. numeric
Modeling types are used in JMP to
indicate the types of variables
Numeric
Continuous or Integer – Continuous in JMP
Categorical
Ordered (low, medium, high) – Ordinal in JMP
Unordered (male, female) – Nominal in JMP
20
Types of Variables
The icons next to the variable names (under
Columns) indicate the modeling type
21
Variable handling
Numeric
Most algorithms in JMP can handle numeric
data
May occasionally need to “bin” into categories
Categorical
JMP algorithms can generally handle
categorical data
Some stat packages require conversion to
binary dummy or indicator variables (number
of dummies = number of categories – 1)
22
JMP automatically codes categorical variables
behind the scenes if needed
Detecting Outliers
An outlier is an observation that is
“extreme”, being distant from the rest of the
data (definition of “distant” is deliberately
vague)
Outliers can have disproportionate influence
on models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is
required to determine if it is an error, or truly
extreme.
23
Detecting Outliers
In some contexts, finding outliers is the
purpose of the DM exercise (airport security
screening)
This is called “anomaly detection”
Outliers Outliers
80%
In this example :
Tail Quantile = 0.1 = 10%
Mid Interval = 1 – 2 x Tail Interval = 0.8 =
80%
Inter Quantile Range = 100
Q = number of intervals to accept before
calling something an outlier = 3
25
Outliers are anything < -300 or > 400
Detecting Outliers
28 missing values
Normalizing (Standardizing)
Data
Used in some techniques when variables with
the largest scales would dominate and skew
results
Puts all variables on same scale
Normalizing function: Subtract mean and
divide by standard deviation
Generally applied behind the scenes in JMP if
needed, or a built in option is provided in model
dialogs
Select a variable in the data table and select
New Formula Column > Transform >
29
Standardize
The Problem of Overfitting
Statistical models can produce highly
complex explanations of relationships
between variables
The “fit” may be excellent
When used with new data, models of great
complexity do not do so well.
30
100% fit – not useful for new
data
1600
1400
1200
1000
Revenue
800
600
400
200
0
200 300 400 500 600 700 800 900 1000
Expenditure
31
Overfitting (cont.)
Causes:
Too many predictors
A model with too many parameters
Trying many different models
32
Partitioning the Data
Problem: How well will our
model perform with new data?
34
applied to a test partition to give
unbiased estimate of its
Example – Linear Regression
West Roxbury Housing Data
First 10 records
35
Example – Linear Regression
West Roxbury Housing Data
36
Partitioning the data
(Using the Make Validation Column
Utility)
37
Fitting a Multiple Linear Regression
Using Fit Model
38
Saving values to the data
table
Click on the red
triangle in the
Fit Least
Squares
analysis
window to
select
additional
options
Save Prediction
Formula
39
Save Residuals
Predictions for First 10
Observations
Pred Formula TOTAL VALUE is the value
predicted by our regression model
Residual TOTAL VALUE is the difference
between the actual and predicted value
Highlighted rows are in the Validation Set
40
Summary of errors and cross
validation statistics
41
RMSE (and RASE)
Error = actual - predicted
RMSE = Root-mean-squared error = Square
root of average squared error
(RMSE is reported as RASE in Fit Model and
some other JMP Platforms)
42
Summary
Data Mining consists of supervised methods
(Classification & Prediction) and
unsupervised methods (Association Rules,
Data Reduction, Data Exploration &
Visualization)
Before algorithms can be applied, data must
be characterized and pre-processed
To evaluate performance and to avoid
overfitting, data partitioning is used
Data mining methods are usually applied to
a sample from a large database, and then
43 the best model is used to score the entire