DMlecture1
DMlecture1
Course Overview
1
Data Mining Overview
Understanding Data
Classification: Decision Trees and Bayesian classifiers,
ANN, SVM
Association Rules Mining: APriori, FP-growth
Clustering: Hierarchical and Partition approaches
Dimensionality Reductions
Advanced topics: Social Network graph mining, outlier
detection,
2
What is Data Mining?
3
Overview of terms
4
Overview of terms
5
Knowledge Discovery
6
Examples of Data mining Applications
7
How Data Mining is used
8
The Data Mining Process
9
Origins of Data Mining
Heterogeneous,
distributed nature Database
of data systems
10
Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Description Methods
Find human-interpretable patterns that describe the
data.
11
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
12
Data Mining Tasks
13
Data Mining Tasks
14
Data Mining Methods
15
Why Data Preprocessing?
16
Why can Data be Incomplete?
17
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
18
Classification: Definition
19
Classification Example
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
Model
10
10 No Single 90K Yes Set Classifier
20
Example of a Decision Tree
Splitting Attributes
Tid Home Marital Taxable
Owner Status Income Default
21
Another Example of Decision Tree
MarSt Single,
Married Divorced
Tid Home Marital Taxable
Owner Status Income Default
NO HO
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10
22
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
23
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-
holder as attributes.
When does a customer buy, what does he buy, how often he
pays on time, etc
Label past transactions as fraud or fair transactions. This forms the
class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card
transactions on an account.
24
Clustering Definition
25
Illustrating Clustering
⌧Euclidean Distance Based Clustering in 3-D space.
26
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their
geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
27
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
28
Association Rule Discovery:
Definition
Given a set of records each of which contain some
number of items from a given collection;
Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
Rules Discovered:
3 Beer, Coke, Diaper, Milk
{Milk} --> {Coke}
4 Beer, Bread, Diaper, Milk
{Diaper, Milk} --> {Beer}
5 Coke, Diaper, Milk
29
Association Rule Discovery:
Application 1
30
Association Rule Discovery: Application 2
Inventory Management:
Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer
products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer
households.
Approach: Process the data on tools and parts
required in previous repairs at different consumer
locations and discover the co-occurrence patterns.
32
Regression
33
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per day
34
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
35
Data Compression
Original Data
Approximated
36
Numerosity Reduction:
Reduce the volume of data
Parametric methods
Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except
possible outliers)
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
37
Clustering
38
Recommended Reference Books
39