Comp 6838
Comp 6838
LECTURE 1: Introduction
Courses Objectives
Understand the basic concepts to carry out
data mining and knowledge discovery in
databases.
Implement on real world datasets the most
well known data mining algorithms.
Office: M314
Offices Hours: Monday 7.30-9am, Tuesday:
7.30-8.30am and Thursday 9.30-10.30am.
Extension x3287
E-mail: [email protected],
[email protected]
TA: Roxana Aparicio (M 309, M108)
4
References
Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data
Mining, Pearson Addison Wesley, 2005.
Jiawei Han, Micheline Kamber, Data Mining : Concepts and
Techniques, 2nd edition, Morgan Kaufmann, 2006.
Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning
Tools and Techniques, 2nd Edition, Morgan Kaufmann, 2005.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning: Data Mining, Inference, and Prediction, Springer
Verlag, 2001.
Mehmed Kantardzic, Data Mining: Concepts, Models, Methods, and
Algorithms, Wiley-IEEE Press, 2002.
Michael Berry & Gordon Linoff, Mastering Data Mining, John Wiley &
Sons, 2000.
Graham Williams, Data Mining Desktop Survival Guide, on-line book
(PDF).
David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data
Mining , MIT Press, 2000.
Software
Free:
R (cran.r-project.org). Statistical oriented.
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ):
written in Java, manual in spanish.There is an R
interface to Weka (RWeka)
RapidMiner (YALE) ( http://rapid-i.com ). It has more
features than Weka.
Orange (http://www.ailab.si/orange ). It requires
Python and other programs.
Software
Comercials:
Microsoft SQL 2008: Analysis Services. Incluye 9
data mining procedures, 6 of them to be discussed in
this course.
Oracle,
Statistica Miner,
SAS Enterprise Miner,
SPSS Clementine.
XL Miner, written in Excel.
Also specialized software to perform a specific data
mining task.
Rapid-Miner
Weka
Evaluation
Homeworks (4) 40%
Partial exam..30%
Project. .... 30%
10
Courses Content
Introduction to Data Mining: 3 hrs.
Data Preprocessing: 15 hrs.
Visualization: 5 hrs.
Outlier Detection 5 hrs
Supervised Classification: 9 hrs.
Clustering: 7 hrs
11
Motivation
The mechanisms for automatic recollection of data
and the development of databases technology has
made possible that a large amount of data can be
available in databases, data warehouses and other
repositories of information. Nowdays, there is the
need to convert this data in knowledge and
information.
Every time the amount of data increases by a factor of
ten we should totally rethink how we analyze it.
J.H.F. Friedman (1997). Data Mining and Statistics,
what is the connection.
12
Size of datasets
Description
Size in Bytes
Mode of storage
very small
102
Piece of paper
Small
104
Medium
106 (megabyte)
Floppy Disks
Large
109(gigabite)
A TV Movie
Massive
1012(Terabyte)
A Hard Disk
Super-massive
1015(Petabyte)
13
14
100,000 features
Network
intrusion
(~120MB)
Microarray data
(~10MB)
100
instances
15
16
Machine
Learning
Visualization
Data Mining
Statistics
Databases
18
Visualization, databases
Visualization (~15 % of DM)
The dataset is explored in a visual fashion.
It can be used in either pre or post processing step of the
Knowledge discovery process.
Relational Databases (~20% of DM)
19
DM Applications
Science: Astronomy, Bioinformatics (Genomics,
Proteonomics, Metabolomics), drug discovery.
Business: Marketing, credit risk, Security and Fraud
detection,
Govermment: detection of tax cheaters, anti-terrorism.
Text Mining: Discover distinct groups of potential
buyers according to a user text based profile. Draw
information from different written sources (e-mails).
Web mining: Identifying groups of competitors web
pages. E-commerce (Amazon.com)
20
Data Mining
Preprocessed Data
Target Data
Selection
Databases
Preprocessing
Data Mining
Visualization
Star plots
Chernoff faces
Parallel Coordinate
plots
Radviz
Survey plots
Star Coordinates
Unsupervised
DM
Hierarchical Clustering
Partitional Clustering
Self Organizing Maps
Association Rules
Market Basket
Supervised DM
Linear Regression
Logistic Regression
Discriminant
Analysis
Decision Trees
K-nn classifiers
SVM
MLP, RBF
Regression
The value of a continuous response variable is
predicted based on the values of other
variables (predictors), assuming that there is a
functional relation among them.
Statistical models, decision trees, neural
networks can be used.
Examples: car sales of dealers based on the
experience of the sellers, advertisament, type
of cars, etc.
25
Regresion[2]
Linear Regression Y=bo+b1X1+..bpXp
Non-Linear Regression, Y=g(X1,,Xp) ,
where g is a non-linear function. For
example, g(X1,Xp)=X1XpeX1+Xp
Non-parametric Regression Y=g(X1,,Xp),
where g is estimated using the available
data.
26
Supervised Classification
The response variable is categorical.
Given a set of records, called the training set (each
record contains a set of attributes and usually the last one
is the class), a model for the attribute class as a function
of the others attributes is constructed. The model is called
the classifier.
Goal: Assign records previously unseen ( test set) to a
class as accurately as possible.
Usually a given data set is divided in a training set and a
test set. The first data set is used to construct the model
and the second one is used to validate. The precision of
the model is determined in the test data set.
It is a decision process.
27
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Single
75K
No
Married
100K
No
Yes
Married
50K
No
Single
70K
No
No
Married
150K
Yes
Married
120K
No
Yes
Divorced 90K
No
Divorced 95K
Yes
No
Single
40K
No
Married
No
No
Married
80K
60K
Test set
10
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
10
28
No
Single
90K
Yes
Training set
Estimate
Classifier
Model
no
1200 patients
Weight > 90kg
18800 patients
gender = male?
yes
no
400 patients
Diabetic (%80)
800 customers
Diabetic (%10)
no
yes
etc
etc
etc.
90
weight
67
Unsupervised Classification
(Clustering)
Find out groups of objects (clusters) such as the objects
within the same clustering are quite similar among them
whereas objects in distinct groups are not similar.
A similarity measure is needed to establish whether two
objects belong to the same cluster or to distinct cluster.
Examples of similarity measure: Euclidean distance,
Manhattan distance, correlation, Grower distance, hamming
distance, etc.
Problems: Choice of the similarity measure, choice of the
number of clusters, cluster validation.
32
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
f.e. weight
Clustering[2]
Tri-dimensional clustering based on euclidean distance.
The Intracluster
distances are minimized
35
Clustering Algorithms
Partitioning algorithms: K-means, PAM,
SOM.
Hierarchical algorithms: Agglomerative,
Divisive.
Gaussian Mixtures Models.
36
Outlier Detection
The objects that behave different or that are
inconsistent with the majority of the data are called
outliers.
Outliers arise due to mechanical faults, human error,
instrument error, fraudulent behavior, changes ithe
system, etc . They can represent some kind of
fraudulent activity.
The goal of outlier detection is to find out the
instances that do not have a normal behavior.
37
based on Statistics.
based on distance.
based on local density.
38
39
TID
Items
1
2
3
4
5
Rules discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Reglas de Asociacion[2]
The rules (X->Y) must satisfy a minimum support and
confidence set up by the user. X is called the antecedent and Y
is called the consequent.
Support=(# records containing X and Y)/(# records)
Confidence=(# records containing X and Y/(# de records
containing X)
Example: The first rule has support .6 and the second rule has
support .4.
The confidence of rule 1 is .75 and for the rule 2 is .67
Applications: Marketing and sales promotion.
40
41