Chap2 Overview
Chap2 Overview
Data Mining
Shmueli, Patel & Bruce
Prediction
Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
Taken together, classification and prediction
constitute predictive analytics
Unsupervised Learning
Categorical
In most other algorithms, must create binary
dummies (number of dummies = number of
categories – 1)
Example: work status – Employed (yes/no),
unemployed (yes/no), retired (yes/no), student
(yes/no)
XLMiner can convert categorical into binary dummies
Creating dummy variables
For categorical variables, it is necessary to
create dummy variables sometimes.
Example:
Pre-processing steps (very
subjective, people clean data differently)
Outliers
An outlier is an observation that is “extreme”,
being distant from the rest of the data
Outliers can have disproportionate influence
on models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is required
to determine if it is an error, or truly extreme
Statistical definition of outliers:
> Q3 + 1.5*IQR or < Q1 – 1.5*IQR
Handling Missing Data
Most algorithms will not process records with
missing values. Default is to drop those records.
Solution 1: Omission
If a small number of records have missing values, can
omit them
If many records are missing values on a small set of
variables, can drop those variables
If many records have missing values, omission is not
practical
Solution 2: Imputation
Replace missing values with reasonable substitutes
Lets you keep the record and use the rest of its (non-
missing) information
Partitioning the Data
Problem: How well will our model
perform with new data?