02 - Data Mining
02 - Data Mining
Prediction Tasks
Use some variables to predict unknown or future values of other
variables
Description Tasks
Find human-interpretable patterns that describe the data.
No Single 75K ? No
Yes Married 50K ? No
Tid Refund Marital Taxable
Status Income Cheat No Married 150K ? No
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier
model.
Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions.
Approach:
Use credit card transactions and the information on its
account-holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc
Label past transactions as fraud or fair transactions. This forms
the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card
transactions on an account.
Market Segmentation:
Goal: subdivide a market into distinct subsets
of customers where any subset may
conceivably be selected as a market target to
be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on
their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
Document Clustering:
Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
Approach: To identify frequently occurring
terms in each document. Form a similarity
measure based on the frequencies of different
terms. Use it to cluster.
Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.
TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Juice, Bread {Milk} --> {Coke}
3 Juice, Coke, Diaper, Milk {Diaper, Milk} --> {Juice}
4 Juice, Bread, Diaper, Milk
5 Coke, Diaper, Milk
+4*NumberOfChildren+0.0001*Income+…
Predicting wind velocities as a function of
temperature, humidity, and pressure.
Data Mining - Lecture 2 14
Deviation/Anomaly Detection
Ordered
Sequential Data
Spatial Data
Temporal Data
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
An element of
the sequence
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean