01 Intro to Data Mining
01 Intro to Data Mining
Description Methods
✓ Find human-interpretable patterns that describe the
data.
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk
Find a model for class attribute as a function of the values of
other attributes Model for predicting credit
worthiness
Class Employed
# years at
Level of Credit No Yes
Tid Employed present
Education Worthy
address
1 Yes Graduate 5 Yes No Education
2 Yes High School 2 No
{ High school,
3 No Undergrad 1 No Graduate
Undergrad }
4 Yes High School 10 Yes
… … … … … Number of Number of
10
years years
Yes No Yes No
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10
10
… … … … … Test
Set
Learn
Training Model
Set Classifier
• Classifying credit card transactions as legitimate or fraudulent
• Classifying land covers (water bodies, urban areas, forests, etc.) using
satellite data
• Categorizing news stories as finance, weather, entertainment, sports, etc
• Identifying intruders in the cyberspace
• Predicting tumor cells as benign or malignant
• Classifying secondary structures of protein as alpha-helix, beta-sheet, or
random coil
GOAL:
Predict fraudulent cases in credit card transactions
APPROACH:
▪ Use credit card transactions and the information on its account-holder
as attributes.
▪ When does a customer buy, what does he buy, how often he pays
on time, etc
▪ Examples:
✓ Predicting sales amounts of new product based on advetising
expenditure.
✓Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
✓Time series prediction of stock market indices.
▪ Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated to)
the objects in other groups
.
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
▪ Given a set of records each of which contain some number of items from
a given collection
▪ Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
▪ Market-basket analysis
▪ Rules are used for sales promotion, shelf management, and inventory
management
▪ Medical Informatics
▪ Rules are used to find combination of patient symptoms and test results
associated with certain diseases
▪ Detect significant deviations from
normal behavior
▪ Applications:
▪ Credit Card Fraud Detection
▪ Network Intrusion
Detection
▪ Identify anomalous behavior from
sensor networks for monitoring
and surveillance.
▪ Detecting changes in the global
forest cover.
1 • Problem Definition
2 • Data Preparation
3 • Data Exploration
4 • Modeling
5 • Evaluation
6 • Deployment
▪ Understanding the project objectives and requirements from
a business perspective and then converting this knowledge
into a data mining problem definition with a preliminary plan
designed to achieve the objectives.