0% found this document useful (0 votes)
2 views

Chap2 Overview

Uploaded by

Alison Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chap2 Overview

Uploaded by

Alison Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Overview: Chapter 2

Data Mining
Shmueli, Patel & Bruce

Presented by Yinfei Kong, Ph.D.

Associate Professor of ISDS


Core Ideas in Data Mining
Classification
Prediction
 Association Rules

Data management and exploration

Two types of methods:


Supervised and Unsupervised learning
Supervised Learning
Goal: Predict a single “target” or “outcome”
variable

Training data from which the algorithm


“learns” – value of the outcome of interest is
known

Apply to test data where value is not known


and will be predicted

Methods: Classification and Prediction


Supervised
Classification:
Goal: Predict categorical target (outcome) variable
Examples: Purchase/no purchase, fraud/no fraud,
creditworthy/not creditworthy…
Target variable is often binary (yes/no)

Prediction
Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
Taken together, classification and prediction
constitute predictive analytics
Unsupervised Learning

Goal: Segment data into meaningful


segments; detect patterns

There is no target (outcome) variable to


predict or classify – no need to partition
data

Methods: Association rules, data reduction


& exploration, visualization, clustering
Unsupervised: Association
Rules
Goal: Produce rules that define “what goes
with what”
Example: “If X was purchased, Y was also
purchased”
Rows are transactions
Used in recommender systems – “Our records
show you bought X, you may also like Y”
Amazon.com, Netflix.com
Also called affinity analysis or market
basket analysis
Pre-processing Data
Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs. numeric
Numeric
Continuous
Integer
Categorical (or nominal)
Ordered (low, medium, high)
Unordered (male, female)
Variable handling
Numeric
Most algorithms in XLMiner can handle numeric data
May occasionally need to “bin” into categories

Categorical
In most other algorithms, must create binary
dummies (number of dummies = number of
categories – 1)
Example: work status – Employed (yes/no),
unemployed (yes/no), retired (yes/no), student
(yes/no)
XLMiner can convert categorical into binary dummies
Creating dummy variables
For categorical variables, it is necessary to
create dummy variables sometimes.
 Example:
Pre-processing steps (very
subjective, people clean data differently)
Outliers
An outlier is an observation that is “extreme”,
being distant from the rest of the data
Outliers can have disproportionate influence
on models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is required
to determine if it is an error, or truly extreme
Statistical definition of outliers:
 > Q3 + 1.5*IQR or < Q1 – 1.5*IQR
Handling Missing Data
Most algorithms will not process records with
missing values. Default is to drop those records.
Solution 1: Omission
 If a small number of records have missing values, can
omit them
 If many records are missing values on a small set of
variables, can drop those variables
 If many records have missing values, omission is not
practical
Solution 2: Imputation
 Replace missing values with reasonable substitutes
 Lets you keep the record and use the rest of its (non-
missing) information
Partitioning the Data
Problem: How well will our model
perform with new data?

Solution: Separate data into two parts


Training partition to develop the model
Validation partition to implement the
model and evaluate its performance on
new data, compare models to pick the
best one

Addresses the issue of overfitting, bias


Test Partition
When a model is developed on
training data, it can overfit the
training data (hence need to assess
on validation)
Assessing multiple models on same
validation data can overfit
validation data
Some methods use the validation
data to choose a parameter – can
lead to overfitting the validation data
Solution: final selected model is
applied to a test partition to give
unbiased estimate of its performance
on new data
Types of Partition
Partition in 3 parts
Partition in 2 parts
X train X train
y train
y train
X validation y validation X validation
y validation
y test X test

 Training part is used to train


Training part is
the model
used to train the  Validation part is used to
model select model (when there are
multiple models available) or
Validation part is select parameters (when
used to evaluate there exist tuning parameters
within a method)
the trained model  Test part is used to evaluate
the model
Summary
Data Mining consists of supervised
methods (Classification & Prediction) and
unsupervised methods (Association Rules,
Data Reduction, Data Exploration &
Visualization)
Before algorithms can be applied, data
must be cleaned and pre-processed
Issues to keep in mind: missing data,
outliers, overfitting
Data partitioning is used to avoid bias and
overfitting

You might also like