Chap2 Overview

Uploaded by

Alison Wang

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Chap2 Overview

Uploaded by

Alison Wang

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 17

Overview: Chapter 2

Data Mining
Shmueli, Patel & Bruce

Presented by Yinfei Kong, Ph.D.

Associate Professor of ISDS

Core Ideas in Data Mining
Classification
Prediction
 Association Rules

Data management and exploration

Two types of methods:

Supervised and Unsupervised learning
Supervised Learning
Goal: Predict a single “target” or “outcome”
variable

Training data from which the algorithm

“learns” – value of the outcome of interest is
known

Apply to test data where value is not known

and will be predicted

Methods: Classification and Prediction

Supervised
Classification:
Goal: Predict categorical target (outcome) variable
Examples: Purchase/no purchase, fraud/no fraud,
creditworthy/not creditworthy…
Target variable is often binary (yes/no)

Prediction
Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
Taken together, classification and prediction
constitute predictive analytics
Unsupervised Learning

Goal: Segment data into meaningful

segments; detect patterns

There is no target (outcome) variable to

predict or classify – no need to partition
data

Methods: Association rules, data reduction

& exploration, visualization, clustering
Unsupervised: Association
Rules
Goal: Produce rules that define “what goes
with what”
Example: “If X was purchased, Y was also
purchased”
Rows are transactions
Used in recommender systems – “Our records
show you bought X, you may also like Y”
Amazon.com, Netflix.com
Also called affinity analysis or market
basket analysis
Pre-processing Data
Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs. numeric
Numeric
Continuous
Integer
Categorical (or nominal)
Ordered (low, medium, high)
Unordered (male, female)
Variable handling
Numeric
Most algorithms in XLMiner can handle numeric data
May occasionally need to “bin” into categories

Categorical
In most other algorithms, must create binary
dummies (number of dummies = number of
categories – 1)
Example: work status – Employed (yes/no),
unemployed (yes/no), retired (yes/no), student
(yes/no)
XLMiner can convert categorical into binary dummies
Creating dummy variables
For categorical variables, it is necessary to
create dummy variables sometimes.
 Example:
Pre-processing steps (very
subjective, people clean data differently)
Outliers
An outlier is an observation that is “extreme”,
being distant from the rest of the data
Outliers can have disproportionate influence
on models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is required
to determine if it is an error, or truly extreme
Statistical definition of outliers:
 > Q3 + 1.5*IQR or < Q1 – 1.5*IQR
Handling Missing Data
Most algorithms will not process records with
missing values. Default is to drop those records.
Solution 1: Omission
 If a small number of records have missing values, can
omit them
 If many records are missing values on a small set of
variables, can drop those variables
 If many records have missing values, omission is not
practical
Solution 2: Imputation
 Replace missing values with reasonable substitutes
 Lets you keep the record and use the rest of its (non-
missing) information
Partitioning the Data
Problem: How well will our model
perform with new data?

Solution: Separate data into two parts

Training partition to develop the model
Validation partition to implement the
model and evaluate its performance on
new data, compare models to pick the
best one

Addresses the issue of overfitting, bias

Test Partition
When a model is developed on
training data, it can overfit the
training data (hence need to assess
on validation)
Assessing multiple models on same
validation data can overfit
validation data
Some methods use the validation
data to choose a parameter – can
lead to overfitting the validation data
Solution: final selected model is
applied to a test partition to give
unbiased estimate of its performance
on new data
Types of Partition
Partition in 3 parts
Partition in 2 parts
X train X train
y train
y train
X validation y validation X validation
y validation
y test X test

 Training part is used to train

Training part is
the model
used to train the  Validation part is used to
model select model (when there are
multiple models available) or
Validation part is select parameters (when
used to evaluate there exist tuning parameters
within a method)
the trained model  Test part is used to evaluate
the model
Summary
Data Mining consists of supervised
methods (Classification & Prediction) and
unsupervised methods (Association Rules,
Data Reduction, Data Exploration &
Visualization)
Before algorithms can be applied, data
must be cleaned and pre-processed
Issues to keep in mind: missing data,
outliers, overfitting
Data partitioning is used to avoid bias and
overfitting

Practical Statistical Process Control
From Everand
Practical Statistical Process Control
Colin Hardwick
5/5 (9)
Automated Software Testing Interview Questions You'll Most Likely Be Asked
From Everand
Automated Software Testing Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Statistics Book
100% (2)
Statistics Book
296 pages
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
No ratings yet
BANA 560 - Lecture - 2 - Data - Mining - Overview - Data - Exploration
38 pages
Overview of Data Mining Process
No ratings yet
Overview of Data Mining Process
43 pages
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
No ratings yet
Data Mining For Business Intelligence: Shmueli, Patel & Bruce
37 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Unit 2
No ratings yet
Unit 2
18 pages
Chapter 02 Overview
No ratings yet
Chapter 02 Overview
43 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Data Mining
No ratings yet
Data Mining
49 pages
1635838720082
No ratings yet
1635838720082
35 pages
Study Material I
No ratings yet
Study Material I
140 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
DM_UNIT-1_FUNDAMENTALS OF DATA MINING (1)
No ratings yet
DM_UNIT-1_FUNDAMENTALS OF DATA MINING (1)
43 pages
Data Mining Process: Dr. Gaurav Dixit
No ratings yet
Data Mining Process: Dr. Gaurav Dixit
18 pages
SQL Server 2008 For Business Intelligence: UTS Short Course
No ratings yet
SQL Server 2008 For Business Intelligence: UTS Short Course
43 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
10 Challenging Problems in Data Mining Research
No ratings yet
10 Challenging Problems in Data Mining Research
8 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Lec 2
No ratings yet
Lec 2
19 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Unit 1 Data Mining task
No ratings yet
Unit 1 Data Mining task
7 pages
Unit - II
No ratings yet
Unit - II
56 pages
1. Introduction to Data Mining & Classification
No ratings yet
1. Introduction to Data Mining & Classification
58 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Unit 4 Intro DM
No ratings yet
Unit 4 Intro DM
30 pages
FAM_QUESTION_BANK_CT[1]
No ratings yet
FAM_QUESTION_BANK_CT[1]
14 pages
Lecture 2
No ratings yet
Lecture 2
18 pages
BA Notes From Lecture
No ratings yet
BA Notes From Lecture
9 pages
DMBI Simplified
No ratings yet
DMBI Simplified
28 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Is Zc415 (Data Mining BITS-WILP)
No ratings yet
Is Zc415 (Data Mining BITS-WILP)
4 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
DM - MOD - 1 Part II
No ratings yet
DM - MOD - 1 Part II
14 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
ML Lect1
100% (1)
ML Lect1
51 pages
unit 3 BI & Data science (1)
No ratings yet
unit 3 BI & Data science (1)
19 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Discovering Knowledge in Data: Lecture Review of
No ratings yet
Discovering Knowledge in Data: Lecture Review of
20 pages
CH 2
No ratings yet
CH 2
37 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Ways to Achieve Quality
From Everand
Ways to Achieve Quality
chakrapani srinivasa
5/5 (1)
Machine Learning For Cyber: Unit 1: Introduction
No ratings yet
Machine Learning For Cyber: Unit 1: Introduction
23 pages
Evaluating Machine Learning Classification For Financial Trading
No ratings yet
Evaluating Machine Learning Classification For Financial Trading
15 pages
02-Data Mining Functionalities-2
No ratings yet
02-Data Mining Functionalities-2
23 pages
Unit 3 & 4 Question Bank
No ratings yet
Unit 3 & 4 Question Bank
5 pages
Not Everything You Read Is True Fake News Detection Using Machine Learning Algorithms
No ratings yet
Not Everything You Read Is True Fake News Detection Using Machine Learning Algorithms
4 pages
ML Overview Notes
No ratings yet
ML Overview Notes
23 pages
Cs3491 - Aiml - Unit III - Linear Classification Models - Discriminant Function
No ratings yet
Cs3491 - Aiml - Unit III - Linear Classification Models - Discriminant Function
6 pages
Jntuk R20 ML Unit-Iii
100% (1)
Jntuk R20 ML Unit-Iii
21 pages
Mule Proposal
No ratings yet
Mule Proposal
21 pages
FTSTPL-127-2023
No ratings yet
FTSTPL-127-2023
19 pages
Object Recognition in Aerial Images
No ratings yet
Object Recognition in Aerial Images
9 pages
Ensembles of Deep LSTM Learners For Activity Recognition Using Wearables
No ratings yet
Ensembles of Deep LSTM Learners For Activity Recognition Using Wearables
28 pages
Data Mining For Intelligence
No ratings yet
Data Mining For Intelligence
4 pages
oedestrian
No ratings yet
oedestrian
6 pages
Trash Bot New
No ratings yet
Trash Bot New
23 pages
Lecture 12 - Neural Networks (DONE!!) PDF
No ratings yet
Lecture 12 - Neural Networks (DONE!!) PDF
27 pages
Skin Detection A Bayesian Network Approach
No ratings yet
Skin Detection A Bayesian Network Approach
5 pages
Statistics chp3&4
No ratings yet
Statistics chp3&4
33 pages
ML Project Report
No ratings yet
ML Project Report
12 pages
Statistical Machine Learning
No ratings yet
Statistical Machine Learning
28 pages
Sarcasm Detection
No ratings yet
Sarcasm Detection
96 pages
3005 A deep learning based stock trading model with 2-D CNN trend detection
No ratings yet
3005 A deep learning based stock trading model with 2-D CNN trend detection
8 pages
Reading 3 Machine Learning - Answers
No ratings yet
Reading 3 Machine Learning - Answers
12 pages
Assignment-1 (MLP From Scratch) : Roll No: EDM18B055
No ratings yet
Assignment-1 (MLP From Scratch) : Roll No: EDM18B055
1 page
Cheng Et Al. (2007)
No ratings yet
Cheng Et Al. (2007)
4 pages
DumbLoc Dumb Indoor Localization Framework Using Wi-Fi Fingerprinting
No ratings yet
DumbLoc Dumb Indoor Localization Framework Using Wi-Fi Fingerprinting
8 pages
NIS Micro project
No ratings yet
NIS Micro project
19 pages
A Classification of Quran Verses Using Deep Learning
No ratings yet
A Classification of Quran Verses Using Deep Learning
14 pages