Chap1 Intro
Chap1 Intro
Chapter 1
Prediction Methods
– Use some variables to predict unknown or
future values of other variables.
Description Methods
– Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Clu
ste Data
ri ng
Tid Refund Marital
Status
Taxable
Income Cheat
l i ng
1 Yes Single 125K No
ode
2 No Married 100K No
M
i ve
3 No Single 70K No
4 Yes Married 120K No
ct
5 No Divorced 95K Yes
edi
6
7
No
Yes
Married 60K
Divorced 220K
No
No P r
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
An
De oma
11 No Married 60K No
tec ly
oci 13 No Single 85K Yes
tio
s
As
14 No Married 75K No
10
15 No Single 90K Yes
n
l es
Ru
Milk
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
Training
Learn
Set Classifier Model
Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
Label past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit
card transactions on an account.
Introduction to Data Mining, 2nd Edition
09/09/2020 19
Tan, Steinbach, Karpatne, Kumar
Classification: Application 2
object.
Model the class based on these features.
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Use of K-means to
partition Sea Surface
60
Land Cluster 2
0
(NPP) into clusters that
Ice or No NPP
-30
reflect the Northern
Sea Cluster 2 and Southern
-60
Hemispheres.
Sea Cluster 1
Introduction to Data Mining, 2nd Edition
-90 09/09/2020 25
-180 -150 -120 -90 -60 -30 0 30
longitude
60 90 120 150 180
Cluster Tan, Steinbach, Karpatne, Kumar
Clustering: Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
Collect different attributes of customers based on
their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.
Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
TID Items
1 Bread, Coke, Milk
Rules
RulesDiscovered:
Discovered:
2 Beer, Bread {Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management
Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases
Introduction to Data Mining, 2nd Edition
09/09/2020 29
Tan, Steinbach, Karpatne, Kumar
Association Analysis: Applications
Scalability
High Dimensionality
Non-traditional Analysis