0% found this document useful (0 votes)
20 views

Introduction ML

The document introduces machine learning and provides examples of how it can be used for tasks like image classification, regression, and dimensionality reduction. Both supervised and unsupervised learning techniques are covered, along with common applications of machine learning like computer vision, natural language processing, and bioinformatics. A variety of algorithms are discussed, from linear regression to neural networks to clustering and dimensionality reduction methods.

Uploaded by

howgibaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Introduction ML

The document introduces machine learning and provides examples of how it can be used for tasks like image classification, regression, and dimensionality reduction. Both supervised and unsupervised learning techniques are covered, along with common applications of machine learning like computer vision, natural language processing, and bioinformatics. A variety of algorithms are discussed, from linear regression to neural networks to clustering and dimensionality reduction methods.

Uploaded by

howgibaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Introduction to Machine Learning

Introduction
林彥宇 教授
Yen-Yu Lin, Professor
國立陽明交通大學 資訊工程學系
Computer Science, National Yang Ming Chiao Tung University

Some slides are modified from Prof. Sheng-Jyh Wang,


Prof. Hwang-Tzong Chen, and Prof. Yung-Yu Chuang
Pattern recognition and machine learning

• Pattern recognition is the automated recognition of patterns


and regularities in data
➢ Discover pattern regularities
➢ Take actions, such as classification or regression, with regularities

• Data: A set of hand-written digits and the class ground truth

• Computer algorithm: It extracts features from each image,


analyze the patterns and regularities in data

• Model: Given a new hand-written digit, predict its class label

2
Pattern recognition and machine learning

• Machine learning: to design and develop algorithms that allow


computers to predict data based on empirical data
➢ Try to explore certain patterns or regularities
➢ Learn models from the given data
➢ Based on the given data, the learner produces a useful output in
new cases

• Machine learning is one approach to pattern recognition, while


other approaches include hand-crafted (not learned) rules or
heuristics

• Machine learning ⊂ Pattern recognition

3
Applications of machine learning

• Computer vision
• Speech recognition
• Information retrieval
• Natural language processing
• Robotics
• Bioinformatics
• Data mining
• Finance
• …

4
Problem definition of a machine learning task

• Training data
➢ A set of N training data {x1, x2, …, xN}, sometimes together with
their target vectors {t1, t2, …, tN}
• Feature extraction
➢ Original input variables are usually transformed into some new
space of variables, where the problem can be better handled
• Model learning
➢ We learn a proper model for the problem
• Generalization or testing
➢ To correctly predict new examples (testing data) that differ from
those used for training

5
Cat image classification: Training data

• Collect a set of training data with target vectors

6
Cat image classification: Feature extraction

• Feature extraction is crucial


➢ Need to take feature variations into account

viewpoint variations Illumination variations background variations

pose variations
7
Cat image classification: Model learning

• Based on the given training data and the extracted features,


we learn a classifier

Classifier

8
Cat image classification: Testing

• Apply the learned classifier to the testing images

9
Cat image classification: Testing

• Apply the learned classifier to the testing images and make


prediction

10
Regression

a real value
x y

input Model predicted


data value

estimated TAIEX on 11/5


Taiwan Capitalization
Weighted Stock Index

11
Supervised vs. Unsupervised learning

• Supervised learning: the training data comprises examples of


the input vectors along with their corresponding target vectors
• Classification: assign each input vector to one of a finite number
of discrete categories
• Regression: assign each input vector to one or more continuous
variables
• Methods: linear regression, linear classification, neural
networks, support vector machine, ensemble learning,
dimensionality reduction, deep learning, …

12
Good vs. bad features for classification


 
Feature B


   
    
   
  
  
    

Feature B
  
    
Feature A     
     
    
       
Feature B

    
   
     Feature A
   
   
   bad features
 
Feature A
good features
13
Good vs. bad features for regression
Output Value


   
  
 
 
 

Output Value

    
  
 

Feature   
Output Value

  
  
   
    Feature

bad feature

Feature
good feature
14
Supervised vs. Unsupervised learning

• Unsupervised learning: the training data consist of a set of


input vectors x without any corresponding target values
• Clustering: to discover groups of similar examples within the
data
• Density estimation: to determine the distribution of data within
the input space
• Dimensionality reduction: to project the data from a high-
dimensional space down to a low-dimensional space
• Data generation: to synthesize new data with some particular
conditons

15
Unsupervised learning for clustering

• Clustering: To group a set of data in such a way that data


points in the same group, called a cluster, are more similar to
each other than to those in other clusters

k-mean clustering

16
Unsupervised learning for dimensionality reduction

• Dimensionality reduction: To project data from a high-


dimensional space to a low-dimensional one

PCA: Principal
component
analysis

17
Unsupervised learning for density estimation

• Density estimation: Based on given data, estimate the


underlying probability density function

kernel density
estimation (KDE)

18
Unsupervised learning for data generation

• Given a set of natural images, we try to generate new images


that look natural and photorealistic

Generative Adversarial
Networks (GAN): Given a
set of images, generate
new images from the
same distributions

19
Applications of data generation

• Face synthesis

Tero Karras et al. “Progressive Growing of GANs for


Improved Quality, Stability, and Variation”
20
Polynomial curve fitting: Problem definition

• Training data (observations)


➢ 10 blue circles, each of which has
◆One-dimensional input
◆One target output
• Green curve sin(2𝜋𝑥) is the
function used to generate these
data, which is unknown
• Each point is sampled from the function with a random
Gaussian noise
• Goal of curve fitting: To exploit the training data to discover
the underlying function so that we can make predictions of
the value 𝑡Ƹ for some new input 𝑥ො
21
Polynomial curve fitting: Choose a fitting function

• Fit the data using a polynomial function of the form:

➢ This function is parametrized by w


➢ 𝑤0 is the bias term
➢ Its input is a data point, while the output is estimated target
➢ M is the order of the polynomial function

22
Polynomial curve fitting: Error function

• An error function (objective function) is used to determine the


parameters

• In this case, we minimize the sum-of-squares error

➢ Differentiable
➢ Closed form solution

23
Polynomial curve fitting: Model selection

• Models with different values of hyperparameter M

• Model selection: To choose a proper value of M


24
Polynomial curve fitting: Model selection

• Under-fitting: M = 0 or M =1
➢ The constant or first order polynomial gives poor fit due to
insufficient flexibility
• The third order polynomial gives the best fit
• Over-fitting: M = 9
➢ All training points are perfectly fitted
➢ Poor representation of the green curve
➢ The generalization is poor
25
Polynomial curve fitting: Generalization

• Suppose we are given a set of training data and a separate set


of 100 test data
• Evaluate the generalization for each choice of M via root-
mean-square (RMS) error

26
Polynomial curve fitting: Generalization

• Small values of M give relatively large values of training and


test errors
• When M is between 3 and 8, reasonable representations are
obtained
• For M=9, the training error goes to zero, but the test error
increases significantly

27
Polynomial curve fitting: Data size vs. Over-fitting
M=9

• Over-fitting becomes less severe as the data size increases


• In general, the number of data points should be no less than
some multiple (say 5 or 10) of the number of adaptive
parameters in the model
• Regularization is often used to control the over-fitting
phenomenon

28
Polynomial curve fitting: Regularization

• Regularization: Add a penalty term to the error function to


discourage the coefficients from reaching large values

where w = wT w = 02 + 12 +  + M2


2

➢ The coefficient 0 is usually omitted


➢ This kind of techniques is called shrinkage methods in the
statistics literature
➢ A quadratic regularizer is called ridge regression
➢ In neural networks, this approach is known as weight decay
29
Polynomial curve fitting: Regularization

30
Probability theory

• We need to handle data uncertainties, which result from


➢ Noise on measurement
➢ Finite size of data sets

• Probability theory provides a consistent framework to


manipulate uncertainties, and hence is essential to pattern
recognition research

31
A toy examples

• Two boxes: r (red box) and b (blue box)


• Two types of fruits: a (apple) and o (orange)
• A trial: Randomly selecting a box from which we randomly
picking a fruit
• Introduce one variable B for box and one variable F for fruit
• Many trials: Repeat the process many times
• Question 1: What is the probability that an apple is picked
➢ Marginal probability
• Question 2: Given that we have picked an orange, what is the
probability that the box we chose was the blue one?
➢ Conditional probability

32
Probability theory: A two-variable case

• Two random variables: 𝑋 and 𝑌


• Each variable has a set of discrete states
➢ 𝑋 can take any value 𝑥𝑖 where 𝑖 = 1, 2, … , 𝑀
➢ 𝑌 can take any value 𝑦𝑗 where 𝑗 = 1, 2, … , 𝐿
• 𝑁 trails where both variables 𝑋 and 𝑌 are sampled

• Some notations
➢ Let the number of trails where 𝑋 = 𝑥𝑖 and 𝑌 = 𝑦𝑗 be 𝑛𝑖𝑗
➢ Let the number of trails where 𝑋 takes value 𝑥𝑖 be 𝑐𝑖
➢ Let the number of trails where 𝑌 takes value 𝑦𝑗 be 𝑟𝑗

33
Joint, marginal, and conditional probabilities

• The probability that 𝑋 takes value 𝑥𝑖 and 𝑌 takes value 𝑦𝑗 is


called joint probability

• It is defined by the fraction of points (trails) falling in the cell i,j

34
Joint, marginal, and conditional probabilities

• The probability that 𝑋 takes value 𝑥𝑖 irrespective of the value


of 𝑌 is called marginal probability and is written as 𝑝(𝑋 = 𝑥𝑖 )

• It is defined by the fraction of the number of points that fall in


column i, namely

• With the joint probability and 𝑐𝑖 = σ𝑗 𝑛𝑖𝑗 , we have

• The sum rule

35
Joint, marginal, and conditional probabilities

• If we consider only those cases where 𝑋 takes value 𝑥𝑖 , the


fraction of those cases where 𝑌 = 𝑦𝑗 is written as
𝑝 𝑌 = 𝑦𝑗 𝑋 = 𝑥𝑖 . It is called conditional probability

• It is defined by

• Relationships among joint, marginal, and conditional probabilities:

• The product rule


36
Joint, marginal, and conditional probabilities

37
Bayes’ theorem

• By using the product rule and the symmetry property


𝑝(𝑋, 𝑌)=𝑝(𝑌, 𝑋), we have

38
Probability with continuous variables

• The probability density 𝑝(𝑥) over a continuous variable 𝑥 must


satisfy the two conditions:

➢ Nonnegative: Probabilities are nonnegative


➢ Sum-to-1: The value of 𝑥 must lie somewhere on the real axis
• The cumulative distribution function defines the probability
that 𝑥 lies in the interval −∞, 𝑧 via

39
Sum rule and product rule

• Sum rule in discrete cases

• Sum rule in continuous cases

• Product rule in discrete cases

• Product rule in continuous cases

40
Expectations and covariances

• The average value of some function 𝑓(𝑥) under a probability


distribution 𝑝(𝑥) is called the expectation of 𝑓(𝑥)

• For a discrete distribution, the expectation of 𝑓(𝑥) is

• For a continuous probability, the expectation of 𝑓(𝑥) is

41
Expectations and covariances

• The variance of 𝑓(𝑥) under a probability distribution 𝑝(𝑥) is

• It is a measure of how much variability there is in 𝑓(𝑥) around


its mean

• For two random variables 𝑥 and 𝑦, the covariance is defined by

• It expresses the extent to which 𝑥 and 𝑦 vary together.

42
Gaussian distribution

• For a single continuous variable, the Gaussian or normal


distribution is defined by

which is specified by two parameters: mean 𝜇 and variance 𝜎 2

43
Mean and variance of a Gaussian distribution

• The average value of a random variable 𝑥 whose distribution is


Gaussian

• The second order moment of variable 𝑥

• The variance of variable 𝑥

44
Multivariate Gaussian

• The multivariate Gaussian distribution defined over a D-


dimensional vector x of continuous variables:

where D x D matrix is called the co-variance matrix while


denotes the determinant of

45
Bayes’ theorem for polynomial curve fitting

• Recall the curve fitting problem


➢ Given a set of N observations D = {x1, x2, …, xN} and their target
values {t1, t2, …, tN}
➢ Polynomial curve fitting: Determine the values of 𝒘

• Prior probability 𝑝 𝒘 : Express our assumption about 𝒘


before observing any data

• Likelihood function 𝑝 𝐷|𝒘 : Express how probable the


observed data D is under 𝒘. It is evaluated after the
observations D are given
46
Bayes’ theorem for polynomial curve fitting

• Bayes’ theorem takes the form

which allows us to evaluate the uncertainty after we have


observations D

• 𝑝 𝐷 is the normalization constant. Thus, we have

47
Determining Gaussian parameters by maximum
likelihood

• Given a set of N observations:

• Assume these observations are sampled from a Gaussian


distribution with mean 𝜇 and variance 𝜎 2 (unknown)

• Our goal is to determine 𝜇 and 𝜎 2 based on the observations

• We assume that data are sampled independently from the


same distribution, namely independent and identically
distributed, or i.i.d. for short

48
Determining Gaussian parameters by maximum
likelihood
• Since the data are i.i.d., the likelihood function of data given
mean 𝜇 and variance 𝜎 2 is

• The log likelihood function is

• Maximum likelihood solution:

49
Probabilistic perspective of polynomial curve fitting

• Given N data for regression: &


➢ Fit the data using a polynomial function of the form:

➢ This function is parametrized by 𝐰

• Given the value of 𝑥, we assume the corresponding value of 𝑡


has a Gaussian distribution with a mean equal to 𝑦(𝑥, 𝒘)

where 𝛽−1 is the variance 𝜎 2 (𝛽 is called precision)

50
Probabilistic perspective of polynomial curve fitting

• The Gaussian conditional distribution for 𝑡 given 𝑥

• If data are i.i.d., the likelihood function is

51
Maximum likelihood solution

• The log likelihood function

• Maximum likelihood (ML) solution for determining 𝐰 and 𝛽


➢ Compute the gradient of the log likelihood function w.r.t. 𝐰. And
set it to 0. We can get 𝐰ML .

➢ By setting the gradient of the log likelihood function w.r.t. 𝛽 to 0,


𝛽ML is obtained by solving

52
Maximum likelihood solution

• After determining the values of 𝐰ML and 𝛽ML , we can make


predictions for a new value of 𝑥

53
Maximum a posterior (MAP) solution

• While ML solution is obtained by maximizing the likelihood,


MAP solution is by maximizing the posterior
• Recall
• Introduce a prior distribution over the curve parameters 𝐰

➢ 𝑀 is the order of the polynomial


➢ 𝛼 is a hyperparameter
• The posterior distribution for 𝐰

54
Maximum a posterior (MAP) solution

• The MAP solution, 𝐰MAP and 𝛽MAP , is obtained by maximizing


the posterior function, or equivalently by minimizing

55
Bayesian curve fitting

• We make a point estimation of 𝐰 no matter in ML and MAP


solutions
• In a full Bayesian approach, we integrate over all possible
values of 𝐰 for regression, i.e.,

 ( xn ) = ( xn0 ,..., xnM )T

56
Probabilistic polynomial curve fitting

• Given the assumption


➢ ML solution: Find 𝐰 that maximizes the likelihood function
p(t | x, D) = p(t | x, w ML ,  −1 )

➢ MAP solution: Find 𝐰 that maximizes the posterior probability

p(t | x, D) = p(t | x, w MAP ,  −1 )

➢ Bayesian solution: Integrate over 𝐰

p(t | x, D) =

57
Model selection

• Hyperparameters, such as 𝑀 in polynomial curve fitting, control


the model behavior complexity

• Model selection: determine the values of hyperparameters that


achieve the best predictive performance on new (testing) data
• Idea: split training data into a training set and a validation set
➢ Training set: Used to learn the model with particular
hyperparameters values
➢ Validation set: Used to evaluate the performance of the learned
model

58
Model selection

• About the size of the validation set


➢ A large validation set: Less training data for model learning
➢ A small validation set: Less reliable performance evaluation

59
Model selection via cross validation

• S-fold cross-validation
➢ Partition training data into S equal-sized groups
➢ S-1 groups are used to train the model that is evaluated on the
remaining group
➢ Repeat the procedure for all S possible runs
➢ Average the performance

60
Drawbacks of model selection

• If training data are limited, a large value of S is appropriate

• At the extreme, setting S=N (number of training data), it gives


the leave-one-out technique

• Some drawbacks
➢ The number of training runs increases by a factor of S
➢ The number of hyperparameter value combinations increases
exponentially

61
Summary

• Polynomial curve fitting for regression


➢ Fitting by minimizing the sum-of-squares error

➢ Regularization for alleviating overfitting

• Probability density
➢ Expectation, variance, and covariance
➢ Gaussian distribution

62
Summary

• Bayes’ theorem

• When applying Bayes’ theorem to polynomial curve fitting,


➢ ML solution: Find 𝐰 that maximizes the likelihood function
➢ MAP solution: Find 𝐰 that maximizes the posterior probability
➢ Bayesian solution: Integrate over 𝐰

• Model selection by cross-validation

63
References

• Chapters 1.1, 1.2, 1.3, and 1.4 in the PRML textbook

64
Thank You for Your Attention!

Yen-Yu Lin (林彥宇)


Email: [email protected]
URL: https://www.cs.nycu.edu.tw/members/detail/lin

65

You might also like