Unit 4 Intro DM
Unit 4 Intro DM
Mining
Introduction
How?
Data Base
–Find all credit applicants with last name of Smith.
–Identify customers who have purchased more than $10,000
in the last month.
– Find all customers who have purchased milk
– Find all credit applicants who are poor credit risks.
(classification)
Data Mining
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk.
(association rules)
Data Mining Algorithm
Characterized as consisting of 3 parts:
Honest
Tridas Vickie Mike
Crooked
7
Predict
Prediction
9
Basic Data Mining Tasks
Classification : Maps data into predefined groups
or classes
⚫ Example:- Airport security screening station used to
determine if the passengers are terrorists or criminals.
Outliers
Interpretation
Visualization
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
Social Implications of DM
Privacy
Profiling
Unauthorized use
Database Perspective on Data
Mining
Implementation issues of concern:
◦ Scalability(Massive datasets)
◦ Real world data(More noisy)
◦ Update(Dynamic data at times)
◦ Easy of use (Quite complex to understand)
Data Pre-processing
Data cleaning
⚫ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
⚫ Integration of multiple databases, data cubes, or files
Data transformation
⚫ Normalization and aggregation
Data reduction
⚫ Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization
⚫ Part of data reduction but with particular importance, especially for
numerical data
Why Data Preprocessing?
Data in the real world is dirty
⚫ incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
⚫ noisy: containing errors or outliers
⚫ inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!
⚫ Quality decisions must be based on quality data
⚫ Data warehouse needs consistent integration of quality
data
Forms of data preprocessing
Data Cleaning