0% found this document useful (0 votes)

42 views

Comp 6838

This document provides information about the COMP 6838 Data Mining course taught by Dr. Edgar Acuna at Universidad de Puerto Rico-Mayaguez. The course objectives are to understand basic data mining concepts and implement algorithms on real world datasets. It will be held on Tuesdays and Thursdays from 2-3:15pm in room M118. Students should have knowledge of statistics, probability, matrices, databases, and programming. Office hours and contact information are provided for Dr. Acuna and the teaching assistant. Various data mining software packages and examples of their use are also discussed. The course will cover topics like data preprocessing, visualization, classification, clustering, and outlier detection. Student evaluation will be based on home

Uploaded by

tanukhanna

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

Comp 6838

Uploaded by

tanukhanna

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

COMP 6838 Data MIning

LECTURE 1: Introduction

Dr. Edgar Acuna

Departmento de Matematicas
Universidad de Puerto Rico- Mayaguez
math.uprm.edu/~edgar

Courses Objectives
Understand the basic concepts to carry out
data mining and knowledge discovery in
databases.
Implement on real world datasets the most
well known data mining algorithms.

Courses Schedule: Tuesday and Thursday

from 2.00pm till 3.15 pm in M118.
Prerequisites: Two courses including
statistical and probability concepts. Some
knowledge of matrix algebra, databases and
programming.

Office: M314
Offices Hours: Monday 7.30-9am, Tuesday:
7.30-8.30am and Thursday 9.30-10.30am.
Extension x3287
E-mail: [email protected],
[email protected]
TA: Roxana Aparicio (M 309, M108)
4

References
Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data
Mining, Pearson Addison Wesley, 2005.
Jiawei Han, Micheline Kamber, Data Mining : Concepts and
Techniques, 2nd edition, Morgan Kaufmann, 2006.
Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning
Tools and Techniques, 2nd Edition, Morgan Kaufmann, 2005.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of
Statistical Learning: Data Mining, Inference, and Prediction, Springer
Verlag, 2001.
Mehmed Kantardzic, Data Mining: Concepts, Models, Methods, and
Algorithms, Wiley-IEEE Press, 2002.
Michael Berry & Gordon Linoff, Mastering Data Mining, John Wiley &
Sons, 2000.
Graham Williams, Data Mining Desktop Survival Guide, on-line book
(PDF).
David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data
Mining , MIT Press, 2000.

Software
Free:
R (cran.r-project.org). Statistical oriented.
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ):
written in Java, manual in spanish.There is an R
interface to Weka (RWeka)
RapidMiner (YALE) ( http://rapid-i.com ). It has more
features than Weka.
Orange (http://www.ailab.si/orange ). It requires
Python and other programs.

Software
Comercials:
Microsoft SQL 2008: Analysis Services. Incluye 9
data mining procedures, 6 of them to be discussed in
this course.
Oracle,
Statistica Miner,
SAS Enterprise Miner,
SPSS Clementine.
XL Miner, written in Excel.
Also specialized software to perform a specific data
mining task.

Rapid-Miner

Weka

Evaluation
Homeworks (4) 40%
Partial exam..30%
Project. .... 30%

Courses Content
Introduction to Data Mining: 3 hrs.
Data Preprocessing: 15 hrs.
Visualization: 5 hrs.
Outlier Detection 5 hrs
Supervised Classification: 9 hrs.
Clustering: 7 hrs

Motivation
The mechanisms for automatic recollection of data
and the development of databases technology has
made possible that a large amount of data can be
available in databases, data warehouses and other
repositories of information. Nowdays, there is the
need to convert this data in knowledge and
information.
Every time the amount of data increases by a factor of
ten we should totally rethink how we analyze it.
J.H.F. Friedman (1997). Data Mining and Statistics,
what is the connection.

Size of datasets
Description

Size in Bytes

Mode of storage

very small

102

Piece of paper

Small

104

Several sheets of paper

Medium

106 (megabyte)

Floppy Disks

Large

109(gigabite)

A TV Movie

Massive

1012(Terabyte)

A Hard Disk

Super-massive

1015(Petabyte)

File of distributed data

Exabyte (1018 bytes), ZettaByte (1021 bytes), Yottabyte(1024 bytes)

Source: http://www.bergesch.com/bcs/storage.htm

Two different shape of datasets

30 features
5 millions instances

100,000 features

Network
intrusion
(~120MB)

Microarray data
(~10MB)

100
instances

Examples of very large databases

A telescope may generate up to 1 gigabyte of
astronomical data in one second.
ATT storages annually about 35 Terabytes of
information in telephone calls (2006).
Google searches in more than 1 trillion of internet
pages representing more than 25 PetaBytes (2008).
It is estimated that in 2002 more than 5 exabytes(5
millions of TB) of new data was generated.

What is data mining? What is KD?

Data mining is the process of extracting previously unknown
comprehensible and actionable information from large
databases and using it to make crucial business decision.
(Zekulin)
Knowledge discovery is the non-trivial extraction of implicit,
unknown, and potentially useful information from data. Fayyad
et al. (1996).
Other names: Knowledge discovery in databases (KDD),
knowledge extraction, intelligent data analysis.
Currently: Data Mining and Knowledge Discovery are used
interchangeably

Related Areas Areas

Machine
Learning

Visualization
Data Mining

Statistics

Databases

Statistics, Machine Learning

Statistics (~40% of DM)

Based on theory. Assume distributional properties of the features

being considered.

Focused in testing of hypothesis, parameter estimation and model

estimation (learning process).

Efficient strategies for data recollection are considered.

Machine learning (~25 % of DM)

Part of Artificial Intelligence.

More heuristic than Statistics.

Focused in improvement of the performance of a classifier based on

prior experiences.

Includes: Neural Networks (Eng), decision trees (Stat), Nave Bayes,

Genetic algorithms (CS).

Includes other topics such as robotics that are unrelated to data

mining

Visualization, databases
Visualization (~15 % of DM)
The dataset is explored in a visual fashion.
It can be used in either pre or post processing step of the
Knowledge discovery process.
Relational Databases (~20% of DM)

A relational database is a set de tables and their schemas which

define the structure of tables. Each table has a primary key that is
used to uniquely define every record (row) in table. Foreign keys are
used to define the relations between different tables in databases.
The goal for an RDBMS is to maintain the data (in tables) and to
quickly located the requested data.
The most used interface between the user and the relational database
is SQL( structured query language).

DM Applications
Science: Astronomy, Bioinformatics (Genomics,
Proteonomics, Metabolomics), drug discovery.
Business: Marketing, credit risk, Security and Fraud
detection,
Govermment: detection of tax cheaters, anti-terrorism.
Text Mining: Discover distinct groups of potential
buyers according to a user text based profile. Draw
information from different written sources (e-mails).
Web mining: Identifying groups of competitors web
pages. E-commerce (Amazon.com)

Data Mining as one step of the KDD

process
Pattern Evaluation

Data Mining
Preprocessed Data
Target Data
Selection

Databases

Preprocessing

Data Mining

Visualization
Star plots
Chernoff faces
Parallel Coordinate
plots
Radviz
Survey plots
Star Coordinates

Quantitative Data Mining

Unsupervised
DM
Hierarchical Clustering
Partitional Clustering
Self Organizing Maps
Association Rules
Market Basket

Supervised DM
Linear Regression
Logistic Regression
Discriminant
Analysis
Decision Trees
K-nn classifiers
SVM
MLP, RBF

Types of data mining tasks

Descriptive: General properties of the
database are determined. The most important
features of the databases are discovered.
Predictive: The collected data is used to train a
model for making future predictions. Never is
100% accurate and the most important matter
is the performance of the model when is
applied to future data.
23

Data mining tasks

Regression (predictive)
Classification (predictive)
Unsupervised Classification Clustering
(descriptive)
Association Rules (descriptive)
Outlier Detection (descriptive)
Visualization (descriptive)
24

Regression
The value of a continuous response variable is
predicted based on the values of other
variables (predictors), assuming that there is a
functional relation among them.
Statistical models, decision trees, neural
networks can be used.
Examples: car sales of dealers based on the
experience of the sellers, advertisament, type
of cars, etc.
25

Regresion[2]
Linear Regression Y=bo+b1X1+..bpXp
Non-Linear Regression, Y=g(X1,,Xp) ,
where g is a non-linear function. For
example, g(X1,Xp)=X1XpeX1+Xp
Non-parametric Regression Y=g(X1,,Xp),
where g is estimated using the available
data.
26

Supervised Classification
The response variable is categorical.
Given a set of records, called the training set (each
record contains a set of attributes and usually the last one
is the class), a model for the attribute class as a function
of the others attributes is constructed. The model is called
the classifier.
Goal: Assign records previously unseen ( test set) to a
class as accurately as possible.
Usually a given data set is divided in a training set and a
test set. The first data set is used to construct the model
and the second one is used to validate. The precision of
the model is determined in the test data set.
It is a decision process.

Example: Supervised Classification

Tid Refund Marital

Status

Taxable
Income Cheat

Refund Marital
Status

Taxable
Income Cheat

Yes

Single

125K

Single

75K

Married

100K

Yes

Married

50K

Single

70K

Married

150K

Yes

Married

120K

Yes

Divorced 90K

Divorced 95K

Yes

Single

40K

Married

80K

60K

Test set

Yes

Divorced 220K

Single

85K

Yes

Married

75K

10
10

Single

90K

Yes

Training set

Estimate
Classifier

Model

Examples of Classification Techniques

Linear Discriminant Analysis
Nave Bayes
Decision trees
K-Nearest neighbors
Logistic regression
Neural networks
Support Vector Machines
..

Example Classification Algorithm 1

Decision Trees
20000 patients
age > 67
yes

1200 patients
Weight > 90kg

18800 patients
gender = male?

yes

400 patients
Diabetic (%80)

800 customers
Diabetic (%10)

yes
etc

etc

etc.

Decision Trees in Pattern Space

The goals classifier is to

separate classes [circle(nondiabetic), square (diabetic)] on
the basis of attribute age and
weight

weight

Each line corresponds to a

split in the tree
Decision areas are tiles in
pattern space
age

Unsupervised Classification
(Clustering)
Find out groups of objects (clusters) such as the objects
within the same clustering are quite similar among them
whereas objects in distinct groups are not similar.
A similarity measure is needed to establish whether two
objects belong to the same cluster or to distinct cluster.
Examples of similarity measure: Euclidean distance,
Manhattan distance, correlation, Grower distance, hamming
distance, etc.
Problems: Choice of the similarity measure, choice of the
number of clusters, cluster validation.

Data Mining Tasks: Clustering

Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar

f.e. weight

In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age

Data Mining Tasks: Clustering

Clustering is the discovery of groups
in a set of instances
Groups are different, instances in a
group are similar

f.e. weight

In 2 to 3 dimensional pattern space

you could just visualize the data and
leave the recognition to a human end
user
In >3 dimensions this is not possible
f.e. age

Clustering[2]
Tri-dimensional clustering based on euclidean distance.

The Intracluster
distances are minimized

The Intercluster distances

are maximized

Clustering Algorithms
Partitioning algorithms: K-means, PAM,
SOM.
Hierarchical algorithms: Agglomerative,
Divisive.
Gaussian Mixtures Models.

Outlier Detection
The objects that behave different or that are
inconsistent with the majority of the data are called
outliers.
Outliers arise due to mechanical faults, human error,
instrument error, fraudulent behavior, changes ithe
system, etc . They can represent some kind of
fraudulent activity.
The goal of outlier detection is to find out the
instances that do not have a normal behavior.

Outlier Detection [2]

Methods:

based on Statistics.
based on distance.
based on local density.

Application: Credit card fraud detection,

Network intrusion

Association Rules discovery

Given a set of records each of which contain some
number of items from a given collection.
The aim is to find out dependency rules which will predict
occurrence of an item based on occurrences of other
items

TID

Items

1
2
3
4
5

Bread, Coke, Milk

Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk

Rules discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}

Reglas de Asociacion[2]
The rules (X->Y) must satisfy a minimum support and
confidence set up by the user. X is called the antecedent and Y
is called the consequent.
Support=(# records containing X and Y)/(# records)
Confidence=(# records containing X and Y/(# de records
containing X)
Example: The first rule has support .6 and the second rule has
support .4.
The confidence of rule 1 is .75 and for the rule 2 is .67
Applications: Marketing and sales promotion.

Challenges of Data Mining

Scalability
Dimensionality
Complex and Heterogeneous Data
Data Privacy
Streaming Data

Step-By-Step Build Advanced InfoPath Form and SharePoint Designer Workflow
No ratings yet
Step-By-Step Build Advanced InfoPath Form and SharePoint Designer Workflow
32 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
DMDW Full PDF
No ratings yet
DMDW Full PDF
784 pages
Data Mining at UVA: New Horizons in Teaching and Learning Conference
No ratings yet
Data Mining at UVA: New Horizons in Teaching and Learning Conference
19 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
20 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
UNIT 1 (1)
No ratings yet
UNIT 1 (1)
59 pages
Data Mining: Nicoleta ROGOVSCHI
No ratings yet
Data Mining: Nicoleta ROGOVSCHI
84 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
Chap 1
No ratings yet
Chap 1
32 pages
Lec 1
No ratings yet
Lec 1
48 pages
dm 1
No ratings yet
dm 1
47 pages
LectureSlide 1
No ratings yet
LectureSlide 1
12 pages
Data Mining
No ratings yet
Data Mining
20 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
July 16, 2009 1 Data Mining
No ratings yet
July 16, 2009 1 Data Mining
26 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
01 Intro
No ratings yet
01 Intro
22 pages
Penambangan Data: Program Pascasarjana Fakultas Teknik Jteti - Ugm
No ratings yet
Penambangan Data: Program Pascasarjana Fakultas Teknik Jteti - Ugm
33 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Unit-1
No ratings yet
Unit-1
148 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Data Mining Intro
No ratings yet
Data Mining Intro
56 pages
Unit 1
No ratings yet
Unit 1
59 pages
Knowledge Discovery Process and Data Mining - Final Remarks: - Moore's Law
No ratings yet
Knowledge Discovery Process and Data Mining - Final Remarks: - Moore's Law
25 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
46 pages
Introduction To Data Mining: - Chapter 3
No ratings yet
Introduction To Data Mining: - Chapter 3
39 pages
ML Lect1
100% (1)
ML Lect1
51 pages
DB-14
No ratings yet
DB-14
97 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Data Mining: Concepts and Techniques: - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1
37 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
មេរៀនទី១
No ratings yet
មេរៀនទី១
40 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Mining e Resources
No ratings yet
Data Mining e Resources
98 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
45 pages
01 Intro
No ratings yet
01 Intro
23 pages
DWDM-LS1-Fall-24-25
No ratings yet
DWDM-LS1-Fall-24-25
42 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
CH 1
No ratings yet
CH 1
66 pages
01 Intro
No ratings yet
01 Intro
40 pages
L1 CH 1 Introd
No ratings yet
L1 CH 1 Introd
97 pages
Lecture_01_11jan
No ratings yet
Lecture_01_11jan
29 pages
Data Mining, Data Pattern, Machine Learning (Week 2
No ratings yet
Data Mining, Data Pattern, Machine Learning (Week 2
19 pages
01Intro
No ratings yet
01Intro
28 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
DataMining S
No ratings yet
DataMining S
103 pages
Chapter 1
No ratings yet
Chapter 1
38 pages
S X
No ratings yet
S X
11 pages
Data Mining: Ying Liu, Prof., PH.D
No ratings yet
Data Mining: Ying Liu, Prof., PH.D
57 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
01 Intro
No ratings yet
01 Intro
29 pages
Internal
No ratings yet
Internal
267 pages
Introduction
No ratings yet
Introduction
46 pages
DM Day1 Intro MS F24 (1)
No ratings yet
DM Day1 Intro MS F24 (1)
111 pages
Data Science
From Everand
Data Science
John D. Kelleher
3/5 (8)
High Availability For Power Systems: Client Presentation
No ratings yet
High Availability For Power Systems: Client Presentation
33 pages
HPC Lab Manual Zeel
No ratings yet
HPC Lab Manual Zeel
22 pages
Method of Starting: Still Continues To Run Synchronously. The Value of This Load Angle or Coupling Angle (As It Is
No ratings yet
Method of Starting: Still Continues To Run Synchronously. The Value of This Load Angle or Coupling Angle (As It Is
2 pages
A Finite Element Model For Rutting Prediction of Flexible Pavement Considering Temperature Effect
No ratings yet
A Finite Element Model For Rutting Prediction of Flexible Pavement Considering Temperature Effect
13 pages
114 (STD)
No ratings yet
114 (STD)
13 pages
Lumpy Disease Classification Using Deep Learning
No ratings yet
Lumpy Disease Classification Using Deep Learning
7 pages
Hydro Graph
No ratings yet
Hydro Graph
52 pages
Group F: Information Engineering: 4F1 - Control System Design
No ratings yet
Group F: Information Engineering: 4F1 - Control System Design
4 pages
Windbell SS160 Manual
No ratings yet
Windbell SS160 Manual
28 pages
Engineering Thermodynamics
No ratings yet
Engineering Thermodynamics
207 pages
CS8391-Data Structures
No ratings yet
CS8391-Data Structures
16 pages
IMAT Physics Formula Sheet
No ratings yet
IMAT Physics Formula Sheet
12 pages
Jagran-ll Updation of Feasibility Study (February 2011)
No ratings yet
Jagran-ll Updation of Feasibility Study (February 2011)
155 pages
BMR Calculator
No ratings yet
BMR Calculator
1 page
Battery Report
No ratings yet
Battery Report
38 pages
Ranpelen PP Random Copolymer: Description
No ratings yet
Ranpelen PP Random Copolymer: Description
2 pages
Lab 5
No ratings yet
Lab 5
9 pages
(Problem 84-2) by J. L. Bentley (Bell Laboratories), C. E. Leiserson (MIT), R. L. Rivest (MIT), and C. J. Van Wyk
No ratings yet
(Problem 84-2) by J. L. Bentley (Bell Laboratories), C. E. Leiserson (MIT), R. L. Rivest (MIT), and C. J. Van Wyk
3 pages
Biotechnology and Biochemical Engineering: Prasanna B.D. Sathyanarayana N. Gummadi Praveen V. Vadlani Editors
No ratings yet
Biotechnology and Biochemical Engineering: Prasanna B.D. Sathyanarayana N. Gummadi Praveen V. Vadlani Editors
233 pages
CV65BSX-M - CV65BSX-2X2: Electrical Specifications
No ratings yet
CV65BSX-M - CV65BSX-2X2: Electrical Specifications
3 pages
6T40 Gen III - Start Stop
No ratings yet
6T40 Gen III - Start Stop
3 pages
Physical Properties of Fatty Acids
No ratings yet
Physical Properties of Fatty Acids
2 pages
IChO 2020 Problems & Solutions
No ratings yet
IChO 2020 Problems & Solutions
93 pages
Cubic Spline
No ratings yet
Cubic Spline
11 pages
Matter and Energy II Workbook
No ratings yet
Matter and Energy II Workbook
44 pages
Boyles Law PhET
No ratings yet
Boyles Law PhET
7 pages
Kalman Filter
No ratings yet
Kalman Filter
11 pages
Nov Dec 2022
No ratings yet
Nov Dec 2022
3 pages
Fem Bits For Mid 2
No ratings yet
Fem Bits For Mid 2
3 pages