Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
BAYESIAN PARAMETER
ESTIMATION ‘-
Chapter 3, DHS
1
The Challenges we face
2
General Approach
• The number of available samples always seems too small, and serious problems arise
when the dimensionality of the feature vector 𝒙 is large.
• If we know the number of parameters in advance and our general knowledge about the
‘- then the severity of these
problem permits us to parameterize the conditional densities,
problems can be reduced significantly.
• Suppose, for example, that we can reasonably assume that 𝑝(𝑥|𝜔𝑖 ) is a normal
density with mean 𝜇𝑖 and covariance matrix Σ𝑖 , although we do not know the exact
values of these quantities. This knowledge simplifies the problem from one of
estimating an unknown function 𝑝(𝑥|𝜔𝑖 ) to one of estimating the parameters 𝜇𝑖 and
Σ𝑖 .
• We will consider only the supervised learning case where the true class label for each
sample is known.
3
Estimation Process we will study
4
Maximum Likelihood Estimation
• Given is a set 𝐷 = {𝑥1, . . . , 𝑥𝑛 } of independent and identically distributed (i.i.d.) samples drawn
from the density 𝑝(𝑥|𝜃).
• We would like to use training samples in 𝐷 to estimate the unknown parameter vector
• Define 𝐿(𝜃|𝐷) as the likelihood function of with respect ‘- to 𝐷 as:
𝐿 𝜃 𝐷 = 𝑝 𝐷 𝜃 = 𝑝 𝑥1 , . . . , 𝑥𝑛 𝜃 = ෑ 𝑝(𝑥𝑖 |𝜃)
x𝑖 ∈𝐷
• The maximum likelihood estimate (MLE) of is, by definition, the value 𝜃መ that maximizes
𝐿 𝜃 𝐷 and can be computed as:
𝜃መ = arg max 𝐿 𝜃 𝐷
𝜃
An equivalent log Likelihood is computed easily as:
𝜃መ = arg max log 𝐿 𝜃 𝐷 = arg max σ𝑖 𝑙𝑜𝑔 𝑝(𝑥𝑖 |𝜃)
𝜃 𝜃
5
Maximum Likelihood Estimation
• If the number of parameters is p, i.e.,𝜽 = (𝜃1, … . . , 𝜃𝑝 ), define the gradient operator as:
‘-
• The maximum likelihood estimate of 𝜽 then satisfies the necessary condition as:
=0
6
Observations on MLE
• The MLE is the parameter point for which the observed sample is the most
likely.
• The procedure with partial derivatives may result‘-in several local extrema. We
should check each solution individually to identify the global optimum.
• Boundary conditions must also be checked separately for extrema.
• Invariance property: if 𝜃መ is the MLE of , then for any function 𝑓(𝜃),
መ the MLE of
መ
𝑓(𝜃) is 𝑓(𝜃).
7
The Gaussian Case: Unknown 𝜇
• Suppose that the samples are drawn from a multivariate normal population with mean 𝜇 and
covariance matrix Σ
‘-
• In this case, we have: 𝜽 = {𝝁}
• In the more general (and more typical) multivariate normal case, neither the mean 𝝁 nor the
covariance matrix 𝚺 is unknown.
• Consider first the univariate case with 𝜃1 = 𝜇 and 𝜃2 = 𝜎 2
‘-
9
The Gaussian Case: Unknown 𝜇 and Σ
‘-
• For the multivariate case,
10
Bias of Estimators
• Bias of an estimator 𝜃መ is the difference between the expected value of 𝜃መ and 𝜃.
• The MLE of 𝜇 is an unbiased estimator for 𝜇 because 𝐸[𝜇Ƹ ] = 𝜇
• The MLE of 𝜎 2 is not an unbiased estimator for because the expected value over all data sets of size
𝑛 of the sample variance is not equal to the true variance:
‘-
• To measure how well a fitted distribution resembles the sample data (goodness-of-fit), we can
use the Kolmogorov-Smirnov test statistic.
• It is defined as the maximum value of the absolute difference between the cumulative
distribution function estimated from the sample and the one‘-calculated from the fitted
distribution.
• After estimating the parameters for different distributions, we can compute the Kolmogorov-
Smirnov statistic for each distribution and choose the one with the smallest value as the best fit
to our sample.
12
MLE Examples
‘-
13
Bayesian Estimation
• Although the answers we get by this method will generally be nearly identical to those
obtained by maximum likelihood, there is a conceptual difference:
• In maximum likelihood methods we view the true parameter vector we seek, θ, to be
fixed ‘-
• In Bayesian learning we consider θ to be a random variable, and training data allows
us to convert a distribution on this variable into a posterior probability density.
14
Bayesian Estimation
• The computation of the posterior probabilities 𝑃(𝜔𝑖 |𝒙) lies at the heart of Bayesian classification.
Bayes’ formula allows us to compute these probabilities from the prior probabilities 𝑃(𝜔𝑖 ) and the
class-conditional densities 𝑝 𝒙 𝜔𝑖
• Given the sample collection 𝐷, the Bayes Formula: ‘-
• We will assume that the a priori probabilities are known and can be obtained from a trivial
calculation:
• In fact for different classes, we can compute 𝑐 different prior based on {𝐷1, … . 𝐷𝑐 }. Samples from
class 𝑖 has no influence in 𝑃𝑥|𝜔𝑗 , 𝐷), 𝑖 ≠ 𝑗
15
Parameter Distribution
• 𝑝(𝒙) is unknown but has known parametric form by saying that the function 𝑝(𝒙|𝜽) is
completely known.
• Any information we might have about 𝜽 prior to observing the samples is assumed to
be contained in a known prior density 𝑝(𝜽). ‘-
• Observation of the samples converts this to a posterior density 𝑝(𝜽|𝐷), which, we
hope, is sharply peaked about the true value of 𝜽.
16
MLE Vs Bayes Estimates
17
Gaussian Case
• We want to calculate the posterior density 𝑝(𝜽|𝐷), and the desired posterior density 𝑝 𝒙 𝐷
• Assume that 𝑝 𝑥 𝜇 = 𝑁(𝜇, Σ) and 𝜃 = [𝜇, Σ]
• Univariate Case: only 𝜇 is unknown, but its parametric form (both 𝜇0𝑎𝑛𝑑 𝜎0) is known
‘-
• Intuitively, 𝜇0 represents our guess for 𝜇 and 𝜎0 represents the uncertainty in our guess
• Suppose now {𝑥1, … . 𝑥𝑛 } are independently drawn from the resulting population. Using Bayes
Formula
Normalizati
on factor 18
Gaussian Case
• Since we have
‘-
• We compute,
‘-
• Identifying the coefficients we have: Sample mean
• Simplifying:
20
Gaussian Distribution: Observations
• 𝜇0 is our best prior guess and 𝜎 2 is the uncertainty about this guess.
• 𝜇𝑛 is our best guess after observing D and 𝜎𝑛2 is the uncertainty about this guess.
• 𝜇𝑛 always lies between 𝑥𝑛 and 𝜇0 with coefficients that are non-negative and sum to one..
• ‘-
If 𝜎0 ≠ 0, then 𝜇𝑛 approaches the sample mean as n approaches infinity
• If 𝜎0 = 0, we have a degenerate case in which our a priori certainty that 𝜇 = 𝜇0 is so strong that
no number of observations can change our opinion
• Otherwise, 𝜎0 ≫ 𝜎, we are so uncertain about our a priori guess that we take 𝜇𝑛 = 𝑥𝑛 , using only
the samples to estimate μ.
• In general, the relative balance between prior knowledge and empirical data is set by the ratio of
𝜎 2 to 𝜎02, which is sometimes called the dogmatism. If the dogmatism is not infinite, after enough
samples are taken the exact values assumed for 𝜇0 and 𝜎02 will be unimportant, and 𝜇𝑛 will
converge to the sample mean.
21
Univariate Case: 𝑝(𝑥|𝐷)
• Having obtained the a posteriori density for the mean, 𝑝(𝜇|𝐷), we obtain the “class-conditional”
density for 𝑝(𝑥|𝐷).
‘-
22
Multivariate Case
23
Multivariate Case
‘-
24
Multivariate Case
25
Recursive Bayesian Learning
‘-
• In principle, it requires that we preserve all the training points in 𝐷𝑛−1 in order to calculate p(θ|Dn)
but for some distributions, just a few parameters associated with 𝑝(𝜽|𝐷𝑛−1) contain all the
information needed. Such parameters are the sufficient statistics of those distributions.
‘-
• Sometime, it is also called True Recursive Bayesian Learning
27
When do Maximum Likelihood and Bayes methods
differ?
• In virtually every case, maximum likelihood and Bayes solutions are equivalent in
the asymptotic limit of infinite training data
• However since practical pattern recognition problems
‘-
invariably have a limited set
of training data, it is natural to ask when maximum likelihood and Bayes solutions
may be expected to differ, and then which we should prefer.
28
Observation
29
Problems of Dimensionality
• In practical multicategory applications, it is not at all unusual to encounter problems involving
fifty or a hundred features, particularly if the features are binary valued.
• Two issues to deal with:
• how classification accuracy depends upon the dimensionality
‘- (and amount of training
data);
• the computational complexity of designing the classifier.
• Different sources of Classfication error:
• Bayes error: due to overlapping class-conditional densities (related to the features used).
• Model error: due to incorrect model.
• Estimation error: due to estimation from a finite sample (can be reduced by increasing the
amount of training data).
30
Bayes Error
and 31
Bayes Error
‘-
32
Model Error
• Exponential algorithms are generally so complex that for reasonable size cases we avoid them
altogether and resign ourselves to approximate solutions that can be found by polynomially
complex algorithms.
‘-
33
Estimation Error
Overfitting: It frequently happens that the number of available samples is inadequate, and the
question of how to proceed arises.
• Reduce the dimensionality, either by redesigning the feature extractor, by selecting an
appropriate subset of the existing features, or by combining the existing features in
some way. ‘-
• assume that all 𝑐 classes share the same covariance matrix, and to pool the available
data.
• look for a better estimate for Σ. If any reasonable a priori estimate Σ0 is available, a
Bayesian or pseudo-Bayesian estimate of the form λΣ0 + (1 − λ)Σ might be employed.
If Σ0 is diagonal, this diminishes the troublesome effects of “accidental” correlations.
• remove chance correlations heuristically by thresholding the sample covariance matrix. For
example, one might assume that all covariances for which the magnitude of the correlation
coefficient is not near unity are actually zero. An extreme of this approach is to assume
statistical independence, thereby making all the off-diagonal elements be zero, regardless
34
of empirical evidence to the contrary — an 𝑂(𝑛𝑑) calculation.
Overfitting Issue
In fitting the points in Fig. 3.4, then, we
might consider beginning with a
highorder polynomial (e.g., 10th order),
and successively smoothing or
simplifying our model by eliminating the
highest-order terms. While this would in
‘- virtually all cases lead to greater error
on the “training data,” we might expect
the generalization to improve.
35