0% found this document useful (0 votes)
9 views

Lecture 1

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture 1

Uploaded by

dylan.j.gormley
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

CS 677 Pattern Recognition

Week 1: Introduction

Dr. Amr El-Wakeel


Lane Department of Computer Science and Electrical Engineering

Spring 24
Course Instructor
• Dr. Amr El-Wakeel, Assistant Professor, Lane Department of CSEE
• Research Interests: CAVs, ITS, internet of things, and healthcare informatics
e-mail: [email protected]
Office: AERB 253,
Office hours: Mondays 1-2 pm
Or by e-mail appointment

Acknowledgment: Dr. Omid Dehzangi


Course design and material

2
Class Overview

• Textbook
• Homeworks, exams, grading
• Course topics
Text Book

• Main Book:
– Pattern Classification by Duda, Hart and Stork, Second Edition, ISBN: 9-
780471056690

• Suggested Material that will help you:


– C. M. Bishop, "Pattern Recognition and Machine Learning", 2006
– Angel R. Martinez and Wendy L. Martinez, Computational Statistics
Handbook with MATLAB (3rd Edition or later)
Class Overview

Homework 15 %
Course project reports, 50%
codes and presentations

Exams (dates will be 35%


discussed in class):
Exam 1: 15%
Exam 2: 20 %
Course Topics
• Course description: The course will introduce graduate students to several topics
in machine learning and pattern recognition, including the following suggested
topics:
-Introduction to pattern recognition
-Bayesian decision theory
-Nearest-neighbor
-Linear discriminant functions
-Linear regression
-Logistic regression
-Gradient descent
-Support vector machines
-Clustering
-Feature Extraction and Reduction
-Neural Networks
-Selected topics (if time permits)

6
Terminology

• Pattern Recognition: “the act of taking raw data and taking an action based on the
category of the pattern.”
• Common Applications: speech recognition, fingerprint identification (biometrics),
Anomaly detection, DNA sequence identification
• Related Terminology:
▪ Data mining: the process of finding anomalies, patterns and correlations within large
data sets to predict outcomes.

▪ Machine Learning: The ability of a machine to improve its performance based on


previous results.
▪ Machine Understanding: acting on the intentions of the user
generating the data.
• Related Fields: artificial intelligence, signal processing and discipline-specific research
(e.g., target recognition, speech recognition, natural language processing).
License Plate Recognition

8
Biometric Recognition

9
Fingerprint Classification

10
Face Detection

11
Autonomous Systems

12
Medical Applications

Skin Cancer Detection Breast Cancer Detection

13
Land Cover Classification
(from aerial or satellite images)

14
Knowledge Discovery Process

Knowledge Interpretation
Pattern Recognition
Machine learning
Data Mining

Data Transformation
Feature Extraction

Preprocessed Selection
Data
Data Cleaning

Data Integration

Databases
Sampling Data
▪ “Big” data arises in many forms:
▪ Physical Measurements: from science (physics, astronomy)
▪ Medical data: biometric sequences, detailed time series
▪ Activity data: GPS location, body sensor activities
▪ Business data: customer behavior tracking at fine detail
▪ Common themes:
▪ Data is large, and growing
▪ There are important patterns
and trends in the data
▪ We don’t fully know where to look
or how to find them
Reducing the Data
▪ Although “big” data is about more than just the volume…
…most big data is big!
▪ It is not always possible to store the data in full
• Many applications (telecoms, ISPs, search engines, Sensor data)
can’t keep everything
▪ It is inconvenient to work with data in full
• Just because we can, doesn’t mean we should (Human
behavior)
▪ It is faster to work with a compact summary
• Better to explore data on a laptop
than a cluster
Sampling the Data
▪ Sampling has an intuitive semantics
• We obtain a smaller data set with the same structure
▪ Estimating on a sample is often straightforward
• Run the analysis on the sample that you would on the full
data
• Some rescaling/reweighting may be necessary
▪ Sampling is general and agnostic to the analysis to be
done
• Though sampling can be tuned to optimize some criteria
▪ Sampling is (usually) easy to understand
• So prevalent that we have an intuition about sampling
Sampling as a Mediator of Constraints

Data Characteristics
(Correlations)

Sampling

Resource Constraints Query Requirements


(Bandwidth, Storage, CPU, (Ad Hoc, Accuracy,
GPU) Aggregates, Speed)
SAMPLING……

▪ What is your population of interest?


• To whom do you want to generalize your
results?
–All doctors
–School children
–Nationality
–Women aged 15-45 years
–Other
▪ Can you sample the entire population?

20
21
STUDY POPULATION

SAMPLE

TARGET POPULATION

22
Population definition
▪ A population can be defined as including all people
or items with the characteristic one wishes to
understand.
▪ Because there is very rarely enough time or money
to gather information from everyone or everything
in a population, the goal becomes finding a
representative sample (or subset) of that
population.

23
Data Everywhere!

▪ Lots of data is being collected and warehoused


– Web data, e-commerce
– Sensor data
– Purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
Type of Data

▪ Relational Data (Structured data:


Tables/Transaction/Legacy Data)
▪ Matrix (Biometrics)
▪ Text Data (Unstructured data: Web)
▪ XML (Semi-structured Data)
▪ Graph Data
• Social Network, Semantic Web (RDF), …
▪ Streaming Data
• You can only scan the data once
What to do with these data?

▪ Aggregation and Statistics


– Data warehouse
▪ Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (e.g. XML)
▪ Knowledge discovery
– Data Analytic
– Statistical Modeling
Random Sample and Statistics

▪ Population: is used to refer to the set or universe of all entities


under study.
▪ However, looking at the entire population may not be feasible,
or may be too expensive.
▪ Instead, we draw a random sample from the population, and
compute appropriate statistics from the sample, that give
estimates of the corresponding population parameters of
interest.
Ex: Time Series Analysis
• Example: Stock Market
• Predict future values
• Determine similar patterns over time
• Classify behavior

28
Nature of Data

Many Observations on Many


Variables
Data File: OBS No.

1
Target Var.

0
Var. 1

63
Var. 2

.
.

.
.

.
Var. 100

2 1 54 . . . .

3 0 44 . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1,500,000 1 32 . . . .
Types of Problems

▪ Customer and Student Retention


▪ Detection of patient’s symptoms
▪ Credit Scoring (Auto or Home Loans)
▪ Bond Ratings
▪ Detection of Fraudulent Insurance Claims
▪ Is a Newly Introduced Product Meeting with
Consumer Acceptance or Rejection?
▪ Who is a likely Donor to your Charity?
▪ Early Detection of a Stolen or Compromised
Credit Card
Data Rich, Information Poor
▪ The Amount of Raw Data Stored in Corporate Databases is
Exploding
▪ Most of this information is recorded instantaneously and with
minimal cost
▪ Data bases are measured in gigabytes and terabytes (One
terabyte = one trillion bytes. A terabyte is equivalent to about 2
million books!)
▪ Walmart uploads 20 million point-of-sale transactions to 500
parallel processing storage devices each day.
▪ Raw data by itself, however does not provide much
information. That is where Data analytic comes in!

31
Learning/Modeling/Decision making
Learning Task Examples
• Classification maps data into predefined groups or
classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item to a real valued
prediction variable
• Clustering groups similar data together into clusters
– Unsupervised learning
– Segmentation
– Anomaly detection
• Dimensionality Reduction transformation of data
from a high-dimensional space into a low-
dimensional space retaining meaningful properties of
the original data 33
Prediction Problem

▪ Early Detection of a Stolen or


Compromised Credit Card

Not So Interested in How or Why the


Credit Card was Stolen but Instead
Whether Recent Transactions are
Indicative of a Stolen or Compromised
Credit Card
Prediction:
Classification vs. Regression
▪ Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and
uses it in classifying new data
▪ Regression:
– models continuous-valued functions, i.e., predicts unknown
or missing values
▪ Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis

35
Classification: Definition

▪ Given a collection of records (training set )


– Each record contains a set of attributes, one of the attributes is the
class.
▪ Find a model for class attribute as a function of the
values of other attributes.
▪ Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model. Usually,
the given data set is divided into training and test sets, with training
set used to build the model and test set used to validate it.

36
Classification—A Two-Step Process

▪ Model construction: describing a set of predetermined classes


– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
▪ Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
37
Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

38
An Example Data Set and Decision Tree

3 sunny med big yes


4 sunny no small yes
5 sunny big big yes
6 rainy no small no outlook
7 rainy med small yes
sunny rainy
8 rainy big big yes
9 rainy no big no
10 rainy med big no yes company

no big
med

no sailboat yes

small big

yes no
Examples of Classification Task

• Predicting tumor cells as benign or malignant

• Classifying credit card transactions


as legitimate or fraudulent

• Classifying if the a driver subject is stressed or not

• Categorizing news stories as finance,


weather, entertainment, sports, etc

40
Issues regarding classification
Issues (1): Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and handle missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data

41
Issues regarding classification
Issues (2): Evaluating Classification Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– The concept of generalization
• Interpretability:
– understanding and insight provided by the model
• Goodness of rules
– decision tree size
– compactness of classification rules

42
Error Analysis for a Two Class Problem

fth
Negative Positive

1: True Negative (TN)


1 3 2: False Negative (FN)
3: True Positive (TP)
Confusion matrix
4: False Positive (FP)

2 4

43
Evaluation Criteria

Accuracy= TP – FP/P+N

44
Multi-class classification
▪ Multi-class vs. binary
classification
– one vs. all (one vs. many, one
vs. rest)
– N classes ➔ train N
classifiers
– each classifier uses one class
for the positive examples and
the rest classes for the
negative examples
▪ Combine the results: select the
classifier with the highest
confidence score

45
Bias, variance, generalization error
▪ A model underfits the training data, if it does not
capture all of the structure available from the data.
(b)
▪ A model overfits if it captures too many of the
idiosyncrasies of the training data. (d)

46
▪ What does it mean to overfit or underfit?
▪ Assume we are doing regression.
▪ Suppose we have a training set
Strain = {( x (1) , y (1) ),..., ( x ( m ) , y ( m ) )}
from some distribution D.
▪ Define the average training error of a hypothesis h
1 m
 Strain (h ) =  (h ( x (i ) ) − y (i ) ) 2
m i =1
▪ We are interested in generalization error,
 (h ) = E( x , y ) from D [(h ( x) − y ) 2 ]
▪ Both underfitting or overfitting lead to high generalization error.
(previous figure)

47
(a) linear regression fits of linear function to 3 different training sets
randomly selected over the interval [0,4] ➔ low variance
(b) linear model after parameters are averaged over 50,000 trials ➔ high bias
(underestimate in the mid-range, overestimate near the ends)
(c) linear regression fits of fourth-order polynomial to 3 random training sets
➔ high variance
(d) model after averaged over 50,000 trials ➔ low bias

48
▪ We can’t directly find out the generalization error.
▪ Instead, we estimate generalization error using the test error.
m
 S (h ) =  (h ( xtest
test
(i )
) − ytest
(i ) 2
)
i =1

▪ Diagram on the right


shows how training and
test error vary as a
function of model
complexity.
▪ (e.g.) model complexity:
degree of polynomial;
size of decision tree (depth);
number of features

49
50
▪ Define training error as the proportion of training
examples that are misclassified.
1 m
 Strain (h ) =  I { h ( xtrain
(i )
)  ytrain
(i )
)}
m i =1
where I {} is an indicator function such that I{true}=1,
I{false}=0.
▪ Generalization error is defined as the probability of a
new example being misclassified
 S (h ) = P( x , y ) from D (h ( x)  y )

51
Bias and variance in practice

▪ How to choose a model with a good tradeoff between bias and


variance:
▪ what if your learned model gives poor generalization error?
▪ collect more data? use fewer features? more features? adopt a different learning
algorithm?

▪ If your model has high bias, it is too simple.


▪ If your model has high variance, it is too complex.
▪ Compare training and test errors
▪ if they are very different, your model is likely to be high variance
▪ if they are almost the same, your model is likely to be high bias

52
Regression: Least Squares Fitting

▪ Given: data points, functional form,


find constants in function
▪ Example: given (xi, yi), find line through them;
i.e., find a and b in y = ax+b
▪ Example: for fitting a line, minimize  =  ( yi − (axi + b) )
2 2

y=ax+b
(x6,y6)
(x3,y3)
(x5,y5)

(x1,y1)
(x7,y7)

(x4,y4)
(x2,y2)
Function Approximation

▪ You might do this because you actually care


about those numbers…
– Example: measure position of falling object,
fit parabola

time

p = –1/2 gt2
position

 Estimate g from fit


Supervised vs. Unsupervised Learning

• Supervised learning (classification) teacher provides


a category label
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering) “natural groupings”
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data

55
Clustering

income

education

age

56
Recognition or Understanding?

•Which of these images are most scenic?

•How can we develop a system to automatically determine


scenic beauty?
Features Are Confusable

• Regions of overlap represent the • In real problems, features are


classification error confusable and represent
actual variation in the data.
• Error rates can be computed with
knowledge of the joint probability • The traditional role of the
distributions. signal processing engineer
has been to develop better
• Context is used to reduce overlap
features.
(e.g. more features).
Correlation

• Degrees of difficulty: • Real data is often much harder:


Feature selection - Example
No! The same separability can be achieved by
projecting the patterns onto the blue axis i.e.
only one-dimension feature-space is needed.
120

110

100
Males
90
weight [kg]

80

70

60

50
Females
40
1
0
-1
30
-1.2 120 130 140 150 160 170 180 190 200 210 220
-1
height [cm]
30
25
-0.8 20
15
-0.6 10
5
-0.4 0
The Design Cycle
Start

Collect Data Key issues:


• “There is no data like more data.”
• Perceptually-meaningful features?
Choose Features
• How do we find the best model?
• How do we estimate parameters?
Choose Model
• How do we evaluate performance?
Goal of the course:
Train Classifier • Introduce you to mathematically
rigorous ways to train and evaluate
models.
Evaluate Classifier

End
Common Mistakes

• I got 100% accuracy on...


▪ Almost any algorithm works some of the time, but few real-world
problems have ever been completely solved.
▪ Training on the evaluation data is forbidden.
▪ Once you use evaluation data, you should discard it.
• My algorithm is better because...
▪ Statistical significance and experimental design play a big role in
determining the validity of a result.
▪ There is always some probability a random choice of an algorithm will
produce a better result.
• Hence, in this course, we will also learn how to evaluate algorithms.

You might also like