Lecture 1
Lecture 1
Week 1: Introduction
Spring 24
Course Instructor
• Dr. Amr El-Wakeel, Assistant Professor, Lane Department of CSEE
• Research Interests: CAVs, ITS, internet of things, and healthcare informatics
e-mail: [email protected]
Office: AERB 253,
Office hours: Mondays 1-2 pm
Or by e-mail appointment
2
Class Overview
• Textbook
• Homeworks, exams, grading
• Course topics
Text Book
• Main Book:
– Pattern Classification by Duda, Hart and Stork, Second Edition, ISBN: 9-
780471056690
Homework 15 %
Course project reports, 50%
codes and presentations
6
Terminology
• Pattern Recognition: “the act of taking raw data and taking an action based on the
category of the pattern.”
• Common Applications: speech recognition, fingerprint identification (biometrics),
Anomaly detection, DNA sequence identification
• Related Terminology:
▪ Data mining: the process of finding anomalies, patterns and correlations within large
data sets to predict outcomes.
8
Biometric Recognition
9
Fingerprint Classification
10
Face Detection
11
Autonomous Systems
12
Medical Applications
13
Land Cover Classification
(from aerial or satellite images)
14
Knowledge Discovery Process
Knowledge Interpretation
Pattern Recognition
Machine learning
Data Mining
Data Transformation
Feature Extraction
Preprocessed Selection
Data
Data Cleaning
Data Integration
Databases
Sampling Data
▪ “Big” data arises in many forms:
▪ Physical Measurements: from science (physics, astronomy)
▪ Medical data: biometric sequences, detailed time series
▪ Activity data: GPS location, body sensor activities
▪ Business data: customer behavior tracking at fine detail
▪ Common themes:
▪ Data is large, and growing
▪ There are important patterns
and trends in the data
▪ We don’t fully know where to look
or how to find them
Reducing the Data
▪ Although “big” data is about more than just the volume…
…most big data is big!
▪ It is not always possible to store the data in full
• Many applications (telecoms, ISPs, search engines, Sensor data)
can’t keep everything
▪ It is inconvenient to work with data in full
• Just because we can, doesn’t mean we should (Human
behavior)
▪ It is faster to work with a compact summary
• Better to explore data on a laptop
than a cluster
Sampling the Data
▪ Sampling has an intuitive semantics
• We obtain a smaller data set with the same structure
▪ Estimating on a sample is often straightforward
• Run the analysis on the sample that you would on the full
data
• Some rescaling/reweighting may be necessary
▪ Sampling is general and agnostic to the analysis to be
done
• Though sampling can be tuned to optimize some criteria
▪ Sampling is (usually) easy to understand
• So prevalent that we have an intuition about sampling
Sampling as a Mediator of Constraints
Data Characteristics
(Correlations)
Sampling
20
21
STUDY POPULATION
SAMPLE
TARGET POPULATION
22
Population definition
▪ A population can be defined as including all people
or items with the characteristic one wishes to
understand.
▪ Because there is very rarely enough time or money
to gather information from everyone or everything
in a population, the goal becomes finding a
representative sample (or subset) of that
population.
23
Data Everywhere!
28
Nature of Data
1
Target Var.
0
Var. 1
63
Var. 2
.
.
.
.
.
Var. 100
2 1 54 . . . .
3 0 44 . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1,500,000 1 32 . . . .
Types of Problems
31
Learning/Modeling/Decision making
Learning Task Examples
• Classification maps data into predefined groups or
classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item to a real valued
prediction variable
• Clustering groups similar data together into clusters
– Unsupervised learning
– Segmentation
– Anomaly detection
• Dimensionality Reduction transformation of data
from a high-dimensional space into a low-
dimensional space retaining meaningful properties of
the original data 33
Prediction Problem
35
Classification: Definition
36
Classification—A Two-Step Process
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
38
An Example Data Set and Decision Tree
no big
med
no sailboat yes
small big
yes no
Examples of Classification Task
40
Issues regarding classification
Issues (1): Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and handle missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
41
Issues regarding classification
Issues (2): Evaluating Classification Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– The concept of generalization
• Interpretability:
– understanding and insight provided by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
42
Error Analysis for a Two Class Problem
fth
Negative Positive
2 4
43
Evaluation Criteria
Accuracy= TP – FP/P+N
44
Multi-class classification
▪ Multi-class vs. binary
classification
– one vs. all (one vs. many, one
vs. rest)
– N classes ➔ train N
classifiers
– each classifier uses one class
for the positive examples and
the rest classes for the
negative examples
▪ Combine the results: select the
classifier with the highest
confidence score
45
Bias, variance, generalization error
▪ A model underfits the training data, if it does not
capture all of the structure available from the data.
(b)
▪ A model overfits if it captures too many of the
idiosyncrasies of the training data. (d)
46
▪ What does it mean to overfit or underfit?
▪ Assume we are doing regression.
▪ Suppose we have a training set
Strain = {( x (1) , y (1) ),..., ( x ( m ) , y ( m ) )}
from some distribution D.
▪ Define the average training error of a hypothesis h
1 m
Strain (h ) = (h ( x (i ) ) − y (i ) ) 2
m i =1
▪ We are interested in generalization error,
(h ) = E( x , y ) from D [(h ( x) − y ) 2 ]
▪ Both underfitting or overfitting lead to high generalization error.
(previous figure)
47
(a) linear regression fits of linear function to 3 different training sets
randomly selected over the interval [0,4] ➔ low variance
(b) linear model after parameters are averaged over 50,000 trials ➔ high bias
(underestimate in the mid-range, overestimate near the ends)
(c) linear regression fits of fourth-order polynomial to 3 random training sets
➔ high variance
(d) model after averaged over 50,000 trials ➔ low bias
48
▪ We can’t directly find out the generalization error.
▪ Instead, we estimate generalization error using the test error.
m
S (h ) = (h ( xtest
test
(i )
) − ytest
(i ) 2
)
i =1
49
50
▪ Define training error as the proportion of training
examples that are misclassified.
1 m
Strain (h ) = I { h ( xtrain
(i )
) ytrain
(i )
)}
m i =1
where I {} is an indicator function such that I{true}=1,
I{false}=0.
▪ Generalization error is defined as the probability of a
new example being misclassified
S (h ) = P( x , y ) from D (h ( x) y )
51
Bias and variance in practice
52
Regression: Least Squares Fitting
y=ax+b
(x6,y6)
(x3,y3)
(x5,y5)
(x1,y1)
(x7,y7)
(x4,y4)
(x2,y2)
Function Approximation
time
p = –1/2 gt2
position
55
Clustering
income
education
age
56
Recognition or Understanding?
110
100
Males
90
weight [kg]
80
70
60
50
Females
40
1
0
-1
30
-1.2 120 130 140 150 160 170 180 190 200 210 220
-1
height [cm]
30
25
-0.8 20
15
-0.6 10
5
-0.4 0
The Design Cycle
Start
End
Common Mistakes