0% found this document useful (0 votes)
101 views35 pages

Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS

- The document discusses maximum likelihood estimation and Bayesian parameter estimation for pattern recognition applications. - It describes the challenges of not having complete probabilistic knowledge and instead having training data. The goal is to use training data to estimate unknown parameters of the classifier. - Maximum likelihood estimation finds the parameter values that maximize the probability of obtaining the observed training samples. Bayesian estimation treats parameters as random variables with a prior distribution, and observation of training samples results in a posterior distribution over the parameters.

Uploaded by

Nikhil Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views35 pages

Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS

- The document discusses maximum likelihood estimation and Bayesian parameter estimation for pattern recognition applications. - It describes the challenges of not having complete probabilistic knowledge and instead having training data. The goal is to use training data to estimate unknown parameters of the classifier. - Maximum likelihood estimation finds the parameter values that maximize the probability of obtaining the observed training samples. Bayesian estimation treats parameters as random variables with a prior distribution, and observation of training samples results in a posterior distribution over the parameters.

Uploaded by

Nikhil Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

MAXIMUM LIKELIHOOD AND

BAYESIAN PARAMETER
ESTIMATION ‘-

Chapter 3, DHS

1
The Challenges we face

• We have learned about prior probabilities and class-conditional densities


• Unfortunately, in pattern recognition applications we rarely if ever have this kind of complete
knowledge about the probabilistic structure of the problem.
• ‘-
In a typical case we merely have some vague, general knowledge about the situation,
together with a number of design samples or training data — particular representatives of
the patterns we want to classify.
• Goal is to find some way to use this information to design or data train the classifier.

2
General Approach

• The number of available samples always seems too small, and serious problems arise
when the dimensionality of the feature vector 𝒙 is large.
• If we know the number of parameters in advance and our general knowledge about the
‘- then the severity of these
problem permits us to parameterize the conditional densities,
problems can be reduced significantly.
• Suppose, for example, that we can reasonably assume that 𝑝(𝑥|𝜔𝑖 ) is a normal
density with mean 𝜇𝑖 and covariance matrix Σ𝑖 , although we do not know the exact
values of these quantities. This knowledge simplifies the problem from one of
estimating an unknown function 𝑝(𝑥|𝜔𝑖 ) to one of estimating the parameters 𝜇𝑖 and
Σ𝑖 .
• We will consider only the supervised learning case where the true class label for each
sample is known.
3
Estimation Process we will study

• Maximum Likelihood estimation


• Maximum likelihood and several other methods view the parameters as quantities
whose values are fixed but unknown. The best estimate of their value is defined to be
‘- samples actually observed.
the one that maximizes the probability of obtaining the
• Bayesian estimation
• the parameters as random variables having some known a priori distribution.
Observation of the samples converts this to a posterior density, thereby revising our
opinion about the true values of the parameters.

4
Maximum Likelihood Estimation

• Given is a set 𝐷 = {𝑥1, . . . , 𝑥𝑛 } of independent and identically distributed (i.i.d.) samples drawn
from the density 𝑝(𝑥|𝜃).
• We would like to use training samples in 𝐷 to estimate the unknown parameter vector
• Define 𝐿(𝜃|𝐷) as the likelihood function of with respect ‘- to 𝐷 as:
𝐿 𝜃 𝐷 = 𝑝 𝐷 𝜃 = 𝑝 𝑥1 , . . . , 𝑥𝑛 𝜃 = ෑ 𝑝(𝑥𝑖 |𝜃)
x𝑖 ∈𝐷

• The maximum likelihood estimate (MLE) of is, by definition, the value 𝜃መ that maximizes
𝐿 𝜃 𝐷 and can be computed as:
𝜃መ = arg max 𝐿 𝜃 𝐷
𝜃
An equivalent log Likelihood is computed easily as:
𝜃መ = arg max log 𝐿 𝜃 𝐷 = arg max σ𝑖 𝑙𝑜𝑔 𝑝(𝑥𝑖 |𝜃)
𝜃 𝜃
5
Maximum Likelihood Estimation

• If the number of parameters is p, i.e.,𝜽 = (𝜃1, … . . , 𝜃𝑝 ), define the gradient operator as:

‘-

• The maximum likelihood estimate of 𝜽 then satisfies the necessary condition as:

=0

6
Observations on MLE

• The MLE is the parameter point for which the observed sample is the most
likely.
• The procedure with partial derivatives may result‘-in several local extrema. We
should check each solution individually to identify the global optimum.
• Boundary conditions must also be checked separately for extrema.
• Invariance property: if 𝜃መ is the MLE of , then for any function 𝑓(𝜃),
መ the MLE of

𝑓(𝜃) is 𝑓(𝜃).

7
The Gaussian Case: Unknown 𝜇

• Suppose that the samples are drawn from a multivariate normal population with mean 𝜇 and
covariance matrix Σ

‘-
• In this case, we have: 𝜽 = {𝝁}

• 𝝁) for 𝝁 can then be obtained using:


The Maximum Likelihood estimate (ෝ

• ෝ must vanish and therefore:


Observe that each component of 𝝁
8
The Gaussian Case: Unknown 𝜇 and Σ

• In the more general (and more typical) multivariate normal case, neither the mean 𝝁 nor the
covariance matrix 𝚺 is unknown.
• Consider first the univariate case with 𝜃1 = 𝜇 and 𝜃2 = 𝜎 2
‘-

• Computing the partial derivatives:

• The corresponding loglikelihood computation leads to conditions (𝜃መ𝑖 estimates 𝜃𝑖 ):

9
The Gaussian Case: Unknown 𝜇 and Σ

• Therefore, in the univariate case, we finally have the estimates as:

‘-
• For the multivariate case,

Sample mean An arithmetic mean of the n matrix

10
Bias of Estimators
• Bias of an estimator 𝜃መ is the difference between the expected value of 𝜃መ and 𝜃.
• The MLE of 𝜇 is an unbiased estimator for 𝜇 because 𝐸[𝜇Ƹ ] = 𝜇
• The MLE of 𝜎 2 is not an unbiased estimator for because the expected value over all data sets of size
𝑛 of the sample variance is not equal to the true variance:
‘-

• note that we have 𝜎 = 𝜎𝑥2= 𝐸 𝑥 2 − 𝜇2 and 𝐸 𝑥 = 𝐸 𝑥ҧ = 𝜇 . Therefore,


1 2 1 1 1
𝐸 σ 𝑥𝑖 − 𝑥ҧ = 𝐸(σ𝑖 𝑥𝑖2 − 2 σ𝑖 𝑥𝑖 𝑥ҧ + σ𝑖 𝑥ҧ 2) = 𝐸 σ𝑖 𝑥𝑖2 − 2 𝑛𝑥ҧ 2 + 𝑛𝑥ҧ 2 = 𝐸 σ𝑖 𝑥𝑖2 − 𝐸(𝑥ҧ 2)=
𝑛 𝑖 𝑛 𝑛 𝑛
1
𝐸 𝑥 2 − 𝐸 𝑥ҧ 2 = (𝜎𝑥2+𝜇2) -(𝜎𝑥2ҧ +𝜇2) = (𝜎𝑥2− 𝜎𝑥2ҧ )
𝑛
σ𝑖 𝑥𝑖 1 1 1 1
• Now, 𝜎𝑥2ҧ = var
n
=
n2
var(σ𝑖 𝑥𝑖 ) = σ 𝑣𝑎𝑟
𝑛2 𝑖
𝑥𝑖 =
𝑛2
𝑛 𝑣𝑎𝑟(𝑥) = 𝜎𝑥2 since the samples are i.i.d
𝑛
1 2 1
• Combining we get, 𝐸 σ
𝑛 𝑖
𝑥𝑖 − 𝑥ҧ = (𝜎𝑥2− 𝜎𝑥2) = (𝑛 − 1)𝜎𝑥2/𝑛
𝑛 11
Goodness of Fit

• To measure how well a fitted distribution resembles the sample data (goodness-of-fit), we can
use the Kolmogorov-Smirnov test statistic.
• It is defined as the maximum value of the absolute difference between the cumulative
distribution function estimated from the sample and the one‘-calculated from the fitted
distribution.
• After estimating the parameters for different distributions, we can compute the Kolmogorov-
Smirnov statistic for each distribution and choose the one with the smallest value as the best fit
to our sample.

12
MLE Examples

‘-

13
Bayesian Estimation

• Although the answers we get by this method will generally be nearly identical to those
obtained by maximum likelihood, there is a conceptual difference:
• In maximum likelihood methods we view the true parameter vector we seek, θ, to be
fixed ‘-
• In Bayesian learning we consider θ to be a random variable, and training data allows
us to convert a distribution on this variable into a posterior probability density.

14
Bayesian Estimation

• The computation of the posterior probabilities 𝑃(𝜔𝑖 |𝒙) lies at the heart of Bayesian classification.
Bayes’ formula allows us to compute these probabilities from the prior probabilities 𝑃(𝜔𝑖 ) and the
class-conditional densities 𝑝 𝒙 𝜔𝑖
• Given the sample collection 𝐷, the Bayes Formula: ‘-

• We will assume that the a priori probabilities are known and can be obtained from a trivial
calculation:
• In fact for different classes, we can compute 𝑐 different prior based on {𝐷1, … . 𝐷𝑐 }. Samples from
class 𝑖 has no influence in 𝑃𝑥|𝜔𝑗 , 𝐷), 𝑖 ≠ 𝑗

15
Parameter Distribution

• 𝑝(𝒙) is unknown but has known parametric form by saying that the function 𝑝(𝒙|𝜽) is
completely known.
• Any information we might have about 𝜽 prior to observing the samples is assumed to
be contained in a known prior density 𝑝(𝜽). ‘-
• Observation of the samples converts this to a posterior density 𝑝(𝜽|𝐷), which, we
hope, is sharply peaked about the true value of 𝜽.

16
MLE Vs Bayes Estimates

• Maximum likelihood estimation finds an estimate of based on


the samples in 𝐷 but a different sample set would give rise to a
different estimate. ‘-
• Bayes estimate takes into account the sampling variability.
• We assume that we do not know the true value of , and
instead of taking a single estimate, we take a weighted sum
of the densities 𝑝(𝑥|𝜃) weighted by the distribution 𝑝(𝜃|𝐷)

17
Gaussian Case

• We want to calculate the posterior density 𝑝(𝜽|𝐷), and the desired posterior density 𝑝 𝒙 𝐷
• Assume that 𝑝 𝑥 𝜇 = 𝑁(𝜇, Σ) and 𝜃 = [𝜇, Σ]
• Univariate Case: only 𝜇 is unknown, but its parametric form (both 𝜇0𝑎𝑛𝑑 𝜎0) is known
‘-

• Intuitively, 𝜇0 represents our guess for 𝜇 and 𝜎0 represents the uncertainty in our guess
• Suppose now {𝑥1, … . 𝑥𝑛 } are independently drawn from the resulting population. Using Bayes
Formula

Normalizati
on factor 18
Gaussian Case

• Since we have

‘-
• We compute,

Quadratic in 𝜇, so again a Gaussian

Factors that do not depend on 𝜇 have been absorbed into the


constants 𝛼, 𝛼 ′ , 𝛼 ′′ 19
Gaussian Distribution
• Typical Gaussian function:

‘-
• Identifying the coefficients we have: Sample mean

• Simplifying:

20
Gaussian Distribution: Observations
• 𝜇0 is our best prior guess and 𝜎 2 is the uncertainty about this guess.
• 𝜇𝑛 is our best guess after observing D and 𝜎𝑛2 is the uncertainty about this guess.
• 𝜇𝑛 always lies between 𝑥𝑛 and 𝜇0 with coefficients that are non-negative and sum to one..
• ‘-
If 𝜎0 ≠ 0, then 𝜇𝑛 approaches the sample mean as n approaches infinity
• If 𝜎0 = 0, we have a degenerate case in which our a priori certainty that 𝜇 = 𝜇0 is so strong that
no number of observations can change our opinion
• Otherwise, 𝜎0 ≫ 𝜎, we are so uncertain about our a priori guess that we take 𝜇𝑛 = 𝑥𝑛 , using only
the samples to estimate μ.
• In general, the relative balance between prior knowledge and empirical data is set by the ratio of
𝜎 2 to 𝜎02, which is sometimes called the dogmatism. If the dogmatism is not infinite, after enough
samples are taken the exact values assumed for 𝜇0 and 𝜎02 will be unimportant, and 𝜇𝑛 will
converge to the sample mean.
21
Univariate Case: 𝑝(𝑥|𝐷)

• Having obtained the a posteriori density for the mean, 𝑝(𝜇|𝐷), we obtain the “class-conditional”
density for 𝑝(𝑥|𝐷).

‘-

22
Multivariate Case

• 𝜇 is known, while Σ is unknown

• However, 𝜇0 and Σ0 are known. We have observed a set 𝑥𝑖 𝑛


‘- 𝑖=1

• Assuming the Gaussian form again ( ) we can compare the above


definition with the following

23
Multivariate Case

• Equating the corresponding coefficients

‘-

• Where the sample mean is

• To estimate the unknown 𝝁𝑛 and Σ𝑛 we simplify the above highlighted equation

24
Multivariate Case

• Given the posterior density 𝑝(𝜇|𝐷), the conditional density 𝑝(𝑥|𝐷)


can be computed as
𝑝 𝒙 𝐷 = 𝑁 𝝁𝒏 , 𝚺 + 𝚺‘-𝒏
which can be viewed as the sum of a random vector μ with
𝑝(𝝁|𝐷) = 𝑁(𝝁𝒏 , 𝚺𝒏 ) and an independent random vector 𝑦 with
𝑝(𝒚) = 𝑁(0,1).

25
Recursive Bayesian Learning

• Ideally, we want 𝑝(𝑥|𝐷) to converge to 𝑝(𝑥)


• So, given 𝐷𝑛 = {𝑥1, … 𝑥𝑛 } we have

‘-

• Corresponding posterior density function satisfies the relation:

• So, the iterative process goes like 𝑝 𝜃 𝐷0 , 𝑝 𝜃 𝑥1 , … … … . . , 𝑝 𝜃 𝑥1, … . 𝑥𝑛


• This is, too, our first example of an incremental or on-line learning Bayes incremental learning
method, where learning goes on as the data is collected.
26
Sufficient Statistics

• In principle, it requires that we preserve all the training points in 𝐷𝑛−1 in order to calculate p(θ|Dn)
but for some distributions, just a few parameters associated with 𝑝(𝜽|𝐷𝑛−1) contain all the
information needed. Such parameters are the sufficient statistics of those distributions.
‘-
• Sometime, it is also called True Recursive Bayesian Learning

27
When do Maximum Likelihood and Bayes methods
differ?
• In virtually every case, maximum likelihood and Bayes solutions are equivalent in
the asymptotic limit of infinite training data
• However since practical pattern recognition problems
‘-
invariably have a limited set
of training data, it is natural to ask when maximum likelihood and Bayes solutions
may be expected to differ, and then which we should prefer.

28
Observation

• Criterial to influence our choice:


• Computational complexity: maximum likelhood methods are often to be prefere
since they require merely differential calculus techniques or gradient search for 𝜃መ
‘-
rather than a possibly complex multidimensional integration needed in Bayesian
estimation
• Interpretability. The maximum likelihood solution will be easier to interpret and
understand since it returns the single best model from the set the designer
provided (and presumably understands). In contrast Bayesian methods give a
weighted average of models (parameters), often leading to solutions more
complicated and harder to understand than those provided by the designer.

29
Problems of Dimensionality
• In practical multicategory applications, it is not at all unusual to encounter problems involving
fifty or a hundred features, particularly if the features are binary valued.
• Two issues to deal with:
• how classification accuracy depends upon the dimensionality
‘- (and amount of training
data);
• the computational complexity of designing the classifier.
• Different sources of Classfication error:
• Bayes error: due to overlapping class-conditional densities (related to the features used).
• Model error: due to incorrect model.
• Estimation error: due to estimation from a finite sample (can be reduced by increasing the
amount of training data).
30
Bayes Error

• For a two class problem where,

Thus, the probability of error


decreases as r increases, approaching
‘-
zero as r approaches infinity.
• The Bayes error is:

the most useful features are the


ones for which the difference
Where 𝑟 is the Mahalanobis Distance, computed as: between the means is large relative
to the standard deviations. However
no feature is useless if its means for
In the conditionally independent case, we have, the two classes differ

and 31
Bayes Error

‘-

32
Model Error

• Exponential algorithms are generally so complex that for reasonable size cases we avoid them
altogether and resign ourselves to approximate solutions that can be found by polynomially
complex algorithms.
‘-

33
Estimation Error
Overfitting: It frequently happens that the number of available samples is inadequate, and the
question of how to proceed arises.
• Reduce the dimensionality, either by redesigning the feature extractor, by selecting an
appropriate subset of the existing features, or by combining the existing features in
some way. ‘-
• assume that all 𝑐 classes share the same covariance matrix, and to pool the available
data.
• look for a better estimate for Σ. If any reasonable a priori estimate Σ0 is available, a
Bayesian or pseudo-Bayesian estimate of the form λΣ0 + (1 − λ)Σ might be employed.
If Σ0 is diagonal, this diminishes the troublesome effects of “accidental” correlations.
• remove chance correlations heuristically by thresholding the sample covariance matrix. For
example, one might assume that all covariances for which the magnitude of the correlation
coefficient is not near unity are actually zero. An extreme of this approach is to assume
statistical independence, thereby making all the off-diagonal elements be zero, regardless
34
of empirical evidence to the contrary — an 𝑂(𝑛𝑑) calculation.
Overfitting Issue
In fitting the points in Fig. 3.4, then, we
might consider beginning with a
highorder polynomial (e.g., 10th order),
and successively smoothing or
simplifying our model by eliminating the
highest-order terms. While this would in
‘- virtually all cases lead to greater error
on the “training data,” we might expect
the generalization to improve.

35

You might also like