0% found this document useful (0 votes)
33 views35 pages

Chap1 Bishop

Uploaded by

Sireesha RM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views35 pages

Chap1 Bishop

Uploaded by

Sireesha RM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Patt. Rec. and Mach.

Learning
Ch. 1: Introduction

Radu Horaud & Pierre Mahé

September 28, 2007

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Chapter content

I Goals, terminology, scope of the book;


I 1.1 Example: Polynomial curve fitting;
I 1.2 Probability theory;
I 1.3 Model selection;
I 1.4 The curse of dimensionality;
I 1.5 Decision theory;
I 1.6 Information theory.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Goals

Pattern Recognition: automatic discovery of regularities in data


and the use of these regularities to take actions –
classifying the data into different categories.
Example: handwritten recognition. Input: a vector x
of pixel values. Output: A digit from 0 to 9.
Machine learning: a large set of input vectors x1 , . . . , xN , or a
training set is used to tune the parameters of an
adaptive model. The category of an input vector is
expressed using a target vector t.
The result of a machine learning algorithm: y(x)
where the output y is encoded as the target vectors.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Terminology

I training or learning phase: determine y(x) on the basis of the


training data.
I test set, generalization,
I supervised learning (input/target vectors in the training data),
I classification (discrete categories) or regression (continuous
variables),
I unsupervised learning (no target vectors in the training data)
also called clustering, or density estimation.
I reinforcement learning, credit assignment, exploration,
exploitation.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.1 Polynomial curve fitting

I Training set: x ≡ (x1 , . . . , xN ) AND t ≡ (t1 , . . . , tN )


I Goal: predict the target t̂ for some new input x̂
I Probability theory allows to express the uncertainty of the
target.
I Decision theory allows to make optimal predictions.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
I Minimize:
N
1X
E(w) = {y(xn , w) − tn }2
2
i=1
I The case of a polynomial function linear in w:
M
X
y(xn , w) = wj xj
j=0

I Model selection: choosing M .


I Regularization (adding a penalty term):
N
1X λ
Ẽ(w) = {y(xn , w) − tn }2 + kwk2
2 2
i=1

I This can be expressed in the Bayesian framework using


maximum likelihood.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2 Probability theory (discrete random variables)

I Sum rule: X
p(X) = p(X, Y )
Y
I Product rule:

p(X, Y ) = p(X|Y )p(Y ) = p(Y |X)p(X)

I Bayes:
p(X|Y )p(Y )
p(Y |X) = P
Y p(X|Y )p(Y )
likelihood × prior
posterior =
normalization

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.1 Probability densities (continuous random variables)

I Probability that x lies in an interval:


Z b
p(x ∈ (a, b)) = p(x)dx
a

I p(x) is called the probability density over x.


I p(x) ≥ 1, p(x ∈ (−∞, ∞)) = 1
I nonlinear change of variable x = g(y):

dx
py (y) = px (x)
dy
I cumulative distribution function: P (z) = p(x ∈ (−∞, z))
I sum and product rules extend to probability densities.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.2 Expectations and covariances

I Expectation: the average value of some function f (x) under a


probability distribution p(x);
P
I discrete case: E[f ] = x p(x)f (x)
R
I continuous case: E[f ] = p(x)f (x)dx
I N points drawn from the prob. distribution or prob. density,
expectation can be approximated by:
N
1 X
E[f ] ≈ f (xn )
N
n=1
P
I functions of several variables: Ex [f ] = x p(x)f (x, y)
(MODIFIED)
P
I conditional expectation: Ex [f |y] = x p(x|y)f (x)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
I Variance of f (x): a measure of the variations of f (x) around
E[f ].
I var[f ] = E[f 2 ] − E[f ]2
I var[x] = E[x2 ] − E[x]2
I Covariance for two random variables:
cov[x, y] = Ex,y [xy] − E[x]E[y]
I Two vectors of random variables:
cov[x, y] = Ex,y [xy > ] − E[x]E[y > ]
I ADDITIONAL FORMULA:
XX
Ex,y [f (x, y)] = p(x, y)f (x, y)
x y

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.3 Bayesian probabilities
I frequentist versus Bayesian interpretation of probabilities;
I frequentist estimator: maximum likelihood (MLE or ML);
I Bayesian estimator: MLE and maximum a posteriori (MAP);
I back to curve fitting: D = {t1 , . . . , tN } is a set of N
observations of N random variables, and w is the vector of
unknown parameters.
I Bayes theorem writes in this case: p(w|D) = p(D|p(D) w)p(w)
I posterior ∝ likelihood × prior (all these quantities are
parameterized by w)
I p(D|w) is the likelihood function and denotes how probable is
the observed data set for various values of w. It is not a
probability distribution over w.
I The denominator:
Z Z
p(D) = . . . p(D|w)p(w)dw

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.4 The Gaussian distribution
I The Gaussian distribution of a single real-valued variable x:
1 1
N (x|µ, σ 2 ) = 2 1/2
exp{− 2 (x − µ)2 }
(2πσ ) 2σ
I in D dimensions: N (x|µ, Σ) : RD → R
I E[x] = µ, var[x] = σ 2
I x = (x1 , . . . , xN ) is a set of N observations of the SAME
scalar variable x
I Assume that this data set is independent and identically
distributed:
N
Y
2
p(x1 , . . . , xN |µ, σ ) = N (xn |µ, σ 2 )
n=1
I max p is equivalent to max ln(p) or min(− ln(p))
ln p(x1 , . . . , xN |µ, σ 2 ) = 2σ1 2 N 2 N 2
P
n=1 (x − µ) − 2 ln σ − . . .
I
I maximum likelihood solution: µM L and σM 2
L
I MLE underestimates the variance: bias
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.5 Curve fitting re-visited
I training data: x = (x1 , . . . , xN ) and t = (t1 , . . . , tN )
I it is assumed that t is Gaussian:
p(t|x, w, β) = (t|y(x, w), β −1 )
I recall that: y(x, w) = w0 + w1 x + . . . + wM xM
I (joint) likelihood
Q function:
p(t|x, w, β) = N −1
n=1 N (tn |y(xn , w), β )
I log-likelihood:
N
βX N
ln p(t|x, w, β) = − {tn − y(xn , w)}2 + ln β − . . .
2 2
n=1

1
I β= σ2
is called the precision.
I The ML solution can be used as a predictive distribution:
−1
p(t|x, wM L , βM L ) = (t|y(x, wM L ), βM L)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Introducing a prior distribution

I The polynomial coefficients are treated as random variables


with a Gaussian distribution taken over a vector of dimension
M + 1:
 α (M +1)/2 α
p(w|α) = N (w|0, α−1 I) = exp{− w> w}
2π 2
I from Bayes we get the posterior probability:
p(w|x, t, α, β) ∝ p(t|x, w, β)p(w|α)
I maximum posterior or MAP. We take the negative logarithm,
we throw out constant terms and we get:
N
βX α
{tn − y(xn , w)}2 + w> w
2 2
n=1

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.6 Bayesian curve fitting

I Apply the correct Bayes formula:

p(t|x, w, β)p(w|α) p(t|x, w, β)p(w|α)


p(w|x, t, α, β) = =R
p(t|x) p(t|x, w, β)p(w|α)dw
I Section 3.3: the posterior distribution is a Gaussian and can
be evaluated analytically.
I the sum and product rules can be used to compute the
predictive distribution:
Z
p(t|x, x, t) = p(t|x, w)p(w|x, t)dw = N (t, m(x), s2 (x))

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.3 Model selection

I which is the optimal order of the polynomial that gives the


best generalization?
I train a range of models and test them on an independent
validation set
I cross-validation: use a subset for training and the whole set
for assessing the performance
I Akaike information criterion: ln p(D|wM L ) − M
I Bayesian information criterion (BIC), section 4.4.1.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.4 The curse of dimensionality

I curse: malédiction, fléau ...


I polynomial fitting: replace x by a vector x of dimension D.
The number of unknowns becomes DM .
I Not all the intuitions developed in spaces of low
dimensionality will generalize to spaces of many dimensions

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Section 1.5: Decision theory

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - introduction

I The decision problem:


I given x, predict t according to a probabilistic model p(x, t)
I For now: binary classification: t ∈ {0, 1} ⇔ {C1 , C2 }
I Important quantity: p(Ck |x)

p(x, Ck ) p(x, Ck )
p(Ck |x) = = P2
p(x) k=1 p(x, Ck )
⇒ getting p(x, Ci ) is the (central!) inference problem
p(x|Ck )p(Ck )
=
p(x)
∝ likelihood × prior

I Intuition: choose k that maximizes p(Ck |x)?

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - binary classification
I Decision region: Ri = {x : pred(x) = Ci }
I Probability of misclassification:

p(mis) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )


Z Z
= p(x, C2 )dx + p(x, C1 )dx
R1 R2

⇒ In order to minimize, affect x to R1 if:

p(x, C1 ) > p(x, C2 )


⇔ p(C1 |x)p(x) > p(C2 |x)p(x)
⇔ p(C1 |x) > p(C2 |x)
XZ X 
I Similarly, for k classes: minimize p(x, Ck ) dx
j Rj k6=j
⇒ pred(x) = argmaxk p(Ck |x)
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - loss-sensitive decision

I Cost/Loss of a decision: Lkj = predict Cj while truth is Ck .


I Loss-sensitive decision ⇒ minimize the expected loss:
XZ X 
E[L] = Lkj p(x, Ck ) dx
j Rj k

I Solution: for each x, choose the class Cj that minimizes:


X X
Lkj p(x, Ck ) ∝ Lkj p(Ck |x)
k k

⇒ straightforward when we know p(Ck |x)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - loss-sensitive decision
I Typical example = medical diagnosis:
I Ck ={1, 2} ⇔
 {sick, healthy}
0 100
I L= ⇒ strong cost of ”missing” a diseased person
1 0
I Expected loss:
Z Z
E[L] = L1,2 p(x, C1 )dx + L2,1 p(x, C2 )dx
ZR2 R1
Z
= 100 × p(x, C1 )dx + p(x, C2 )dx
R2 R1

I Note: minimizing the probability of misclassification:


Z Z
p(mis) = p(x, C2 )dx + p(x, C1 )dx
R1 R2
 
0 1
corresponds to minimizing the 0/1 loss: L =
1 0
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - the ”reject option”

I For the 0/1 loss1 , pred(x) = argmaxk p(Ck |x)


I Note: K classes ⇒ 1/K ≤ max p(Ck |x) ≤ 1
k

I When max p(Ck |x) → 1/K the confidence in the prediction


k
decreases.
I classes tend to become as likely

I ”Reject option”: make a decision provided max p(Ck |x) > σ


k
⇒ the value of σ controls the amount of rejection:
I σ = 1: systematic rejection
I σ < 1/K: no rejection

I Motivation: switch between automatic/human decision


I Illustration in Figure 1.26 page 42

1
For a general loss matrix, see exercice 1.24
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - regression setting
I The regression setting: quantitative target t ∈ R
I Typical regression loss-function: L(t, y(x)) = (y(x) − t)2
I the squared loss
I The decision problem = minimize the expected loss:
Z Z
E[L] = (y(x) − t)2 p(x, t)dxdt
X R
Z
I Solution: y(x) = tp(t|x)dt
R
I this is known as the regression function
I intuitively appealing: conditional average of t given x
I illustration in figure 1.28, page 47

I Note: general class of loss functions L(t, y(x)) = |y(x) − t|q


I q = 2 is analytically convenient (derivable and continuous)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - regression setting
I Derivation:
Z Z
E[L] = (y(x) − t)2 p(x, t)dxdt
ZX h RZ i
= (y(x) − t)2 p(t|x)dt p(x)dx
X R
Z
⇒ for each x, find y(x) that minimizes (y(x) − t)2 p(t|x)dt
ZR
I Derivating with respect to y(x) gives: 2 (y(x) − t)p(t|x)dt
R
I Setting to zero leads to:
Z Z
y(x)p(t|x)dt = tp(t|x)dt
R R
Z
y(x) = tp(t|x)dt
R

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - inference and decision

2 (or 3) different approaches to the decision problem:


1. rely on a probabilistic model, with 2 flavours:
1.1 generative:
I use a generative model to infer p(x|Ck )
I combine with priors p(Ck ) to get p(x, Ck ) and eventually
p(Ck |x)
1.2 discriminative: infer directly p(Ck |x)
I this is sufficient for the decision problem

2. learn a discriminant function f (x)


I directly map input to class labels
I for binary classification, f (x) is typically defined as the sign
(+1/-1) of an auxiliary function

(Note: similar discussion for regression)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - inference and decision

Pros and Cons:


I probabilistic generative models:
I pros: acess to p(x) → easy detection of outliers
I i.e., low-confidence predictions
I cons: estimating the joint probability p(x, Ck ) can be
computational and data demanding
I probabilistic discrimative models:
I pros: less demanding than the generative approach
I see figure 1.27, page 44

I discriminant functions:
I pros: a single learning problem (vs inference + decision)
I cons: no access to p(Ck |x)
I ... which can have many advantages in practice for (e.g.)
rejection and model combination – see page 45

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Section 1.6: Information theory

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - Entropy
I Consider a discrete random variable X
I We want to define a measure h(x) of surprise/information of
observing X = x
I Natural requirements:
I if p(x) is low (resp. high), h(x) should be high (resp.low)
I h(x) should be a monotonically decreasing function of p(x)
I if X and Y are unrelated, h(x, y) should be h(x) + h(y)
I i.e., if X and Y are independent, that is p(x, y) = p(x)p(y)

⇒ this leads to h(x) = − log p(x)


I Entropy of the variable X:
X
H[X] = E[h(X)] = − p(x) log(p(x))
x

(Convention: p log p = 0 if p = 0)
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - Entropy
Some remarks:
I H[X] ≥ 0 since p ∈ [0, 1] (hence p log p ≤ 0)
I H[X] = 0 if ∃ x s.t. p(x) = 1
I Maximum entropy distribution = uniform distribution 
P
optimization problem: maximize H[X] + λ xi p(xi ) − 1
I

I derivating w.r.t. p(xi ) shows they must be constant


I hence p(xi ) = 1/M, ∀xi ⇒ H[X] = log(M )
⇒ we therefore have 0 ≤ H[X] ≤ log(M )

I H[X] = lower bound on the # of bits required to (binarily)


encode the values of X (using log2 in the defintion of H)
I trivial code of length log2 (M ) (ex: M = 8, messages of size 3)
I no ”clever” coding scheme for uniform distributions
I for non-uniform distributions, optimal coding schemes can be
designed
I high probability values ⇒ short codes
I see illustration in page 50
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - Entropy
For continuous random variables:
Z
I differential entropy: H[X] = − p(x) ln p(x)dx
I because p(x) can be > 1, care must be taken when
transposing properties of the discrete entropy
I in particular, can be negative (if X ,→ U (0, 1/2) : H[X] = − ln 2)
I Given (µ, σ), maximum entropy distribution p(x) = N (µ, σ 2 )
I optimization problem: maximize H[X] with µ, σ equality
constraint + normalization constraint

I entropy: H[X] = 1/2 1 + ln(2πσ 2 )
I Conditional entropy of y given x:
Z Z
H[Y |X] = − p(x, y) ln p(y|x)dxdy

⇒ we have easily H[X, Y ] = H[Y |X] + H[X]


(natural interpretation with the notion of information)
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - KL divergence
I Kullback-Leibler divergence between distributions p and q:
Z  Z 
KL(p||q) = − p(x) ln q(x)dx − − p(x) ln p(x)dx
Z
q(x)
= − p(x) ln dx
p(x)
I KL(p||q) 6= KL(q||p)
I KL(p||p) = 0
I KL(p||q) ≥ 0 (next slide)

⇒ measures the difference between the ”true” distribution p and


the distribution q

(Information therory interpretation: amount of additional information


required to encode the values of X using q(x) instead of p(x))
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - KL divergence
I A function is convex iff every cord lies above the function
I illustration in figure 1.31, page 56
I Jensen’s inequality for convex functions:

E[f (x)] ≥ f E[x]
(strict inequality for strictly convex functions)
I When applied to KL(p||q):
Z
q(x)
KL(p||q) = − p(x) ln dx
p(x)
Z
q(x)
> − ln p(x) × dx (because − ln is stricly convex)
p(x)
Z
= − ln q(x)dx = − ln 1 = 0

Moreover, straightforward to see that KL(p||p) = 0


I Conclusion: KL(p||q) ≥ 0, with equality if p = q
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - KL divergence: illustration
I Data generated by an (unknown) distribution p(x)
I We want to fit a parametric probabilistic model q(x|θ) = qθ (x)
⇒ i.e., we want to minimize KL(p||qθ )
I Data available: observations (x1 , . . . , xN ):
Z
q(x|θ)
KL(p||qθ ) = − ln p(x) × ln dx
p(x)
N
X q(xi |θ)
'− ln
p(xi )
i=1
N
X 
= − ln q(xi |θ) + ln p(xi )
i=1

⇒ it followsP
that minimizing KL(p||qθ ) corresponds to
maximizing N i=1 ln q(xi |θ) = log-likelihood

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - Mutual information
I Mutual information: I[X, Y ] = KL(p(X, Y )||p(X), p(Y ))
I Quantifies the amount of independence between X and Y
I I[X, Y ] = 0 ⇔ p(X, Y ) = p(X)p(Y )
I We have:
Z Z
p(x)p(y)
I[x, y] = − p(x, y) ln dxdy
p(x, y)
Z Z
p(x)p(y)
=− p(x, y) ln dxdy
p(x|y)p(y)
Z Z
p(x)
=− p(x, y) ln dxdy
p(x|y)
Z Z  Z Z 
=− p(x, y) ln p(x)dxdy − − p(x, y) ln p(x|y)dxdy

= H[X] − H[X|Y ]
I Conclusion: I[X, Y ] = H[X] − H[X|Y ] = H[Y ] − H[Y |X]
I I[X, Y ] = reduction of the uncertainty about X obtained by
telling the value of Y (that is, 0 for independent variables)
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction

You might also like