0% found this document useful (0 votes)

33 views35 pages

Chap1 Bishop

Uploaded by

Sireesha RM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views35 pages

Chap1 Bishop

Uploaded by

Sireesha RM

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Patt. Rec. and Mach.

Learning
Ch. 1: Introduction

Radu Horaud & Pierre Mahé

September 28, 2007

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Chapter content

I Goals, terminology, scope of the book;

I 1.1 Example: Polynomial curve fitting;
I 1.2 Probability theory;
I 1.3 Model selection;
I 1.4 The curse of dimensionality;
I 1.5 Decision theory;
I 1.6 Information theory.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Goals

Pattern Recognition: automatic discovery of regularities in data

and the use of these regularities to take actions –
classifying the data into different categories.
Example: handwritten recognition. Input: a vector x
of pixel values. Output: A digit from 0 to 9.
Machine learning: a large set of input vectors x1 , . . . , xN , or a
training set is used to tune the parameters of an
adaptive model. The category of an input vector is
expressed using a target vector t.
The result of a machine learning algorithm: y(x)
where the output y is encoded as the target vectors.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Terminology

I training or learning phase: determine y(x) on the basis of the

training data.
I test set, generalization,
I supervised learning (input/target vectors in the training data),
I classification (discrete categories) or regression (continuous
variables),
I unsupervised learning (no target vectors in the training data)
also called clustering, or density estimation.
I reinforcement learning, credit assignment, exploration,
exploitation.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.1 Polynomial curve fitting

I Training set: x ≡ (x1 , . . . , xN ) AND t ≡ (t1 , . . . , tN )

I Goal: predict the target t̂ for some new input x̂
I Probability theory allows to express the uncertainty of the
target.
I Decision theory allows to make optimal predictions.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
I Minimize:
N
1X
E(w) = {y(xn , w) − tn }2
2
i=1
I The case of a polynomial function linear in w:
M
X
y(xn , w) = wj xj
j=0

I Model selection: choosing M .

I Regularization (adding a penalty term):
N
1X λ
Ẽ(w) = {y(xn , w) − tn }2 + kwk2
2 2
i=1

I This can be expressed in the Bayesian framework using

maximum likelihood.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2 Probability theory (discrete random variables)

I Sum rule: X
p(X) = p(X, Y )
Y
I Product rule:

p(X, Y ) = p(X|Y )p(Y ) = p(Y |X)p(X)

I Bayes:
p(X|Y )p(Y )
p(Y |X) = P
Y p(X|Y )p(Y )
likelihood × prior
posterior =
normalization

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.1 Probability densities (continuous random variables)

I Probability that x lies in an interval:

Z b
p(x ∈ (a, b)) = p(x)dx
a

I p(x) is called the probability density over x.

I p(x) ≥ 1, p(x ∈ (−∞, ∞)) = 1
I nonlinear change of variable x = g(y):

dx
py (y) = px (x)
dy
I cumulative distribution function: P (z) = p(x ∈ (−∞, z))
I sum and product rules extend to probability densities.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.2 Expectations and covariances

I Expectation: the average value of some function f (x) under a

probability distribution p(x);
P
I discrete case: E[f ] = x p(x)f (x)
R
I continuous case: E[f ] = p(x)f (x)dx
I N points drawn from the prob. distribution or prob. density,
expectation can be approximated by:
N
1 X
E[f ] ≈ f (xn )
N
n=1
P
I functions of several variables: Ex [f ] = x p(x)f (x, y)
(MODIFIED)
P
I conditional expectation: Ex [f |y] = x p(x|y)f (x)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
I Variance of f (x): a measure of the variations of f (x) around
E[f ].
I var[f ] = E[f 2 ] − E[f ]2
I var[x] = E[x2 ] − E[x]2
I Covariance for two random variables:
cov[x, y] = Ex,y [xy] − E[x]E[y]
I Two vectors of random variables:
cov[x, y] = Ex,y [xy > ] − E[x]E[y > ]
I ADDITIONAL FORMULA:
XX
Ex,y [f (x, y)] = p(x, y)f (x, y)
x y

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.3 Bayesian probabilities
I frequentist versus Bayesian interpretation of probabilities;
I frequentist estimator: maximum likelihood (MLE or ML);
I Bayesian estimator: MLE and maximum a posteriori (MAP);
I back to curve fitting: D = {t1 , . . . , tN } is a set of N
observations of N random variables, and w is the vector of
unknown parameters.
I Bayes theorem writes in this case: p(w|D) = p(D|p(D) w)p(w)
I posterior ∝ likelihood × prior (all these quantities are
parameterized by w)
I p(D|w) is the likelihood function and denotes how probable is
the observed data set for various values of w. It is not a
probability distribution over w.
I The denominator:
Z Z
p(D) = . . . p(D|w)p(w)dw

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.4 The Gaussian distribution
I The Gaussian distribution of a single real-valued variable x:
1 1
N (x|µ, σ 2 ) = 2 1/2
exp{− 2 (x − µ)2 }
(2πσ ) 2σ
I in D dimensions: N (x|µ, Σ) : RD → R
I E[x] = µ, var[x] = σ 2
I x = (x1 , . . . , xN ) is a set of N observations of the SAME
scalar variable x
I Assume that this data set is independent and identically
distributed:
N
Y
2
p(x1 , . . . , xN |µ, σ ) = N (xn |µ, σ 2 )
n=1
I max p is equivalent to max ln(p) or min(− ln(p))
ln p(x1 , . . . , xN |µ, σ 2 ) = 2σ1 2 N 2 N 2
P
n=1 (x − µ) − 2 ln σ − . . .
I
I maximum likelihood solution: µM L and σM 2
L
I MLE underestimates the variance: bias
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.5 Curve fitting re-visited
I training data: x = (x1 , . . . , xN ) and t = (t1 , . . . , tN )
I it is assumed that t is Gaussian:
p(t|x, w, β) = (t|y(x, w), β −1 )
I recall that: y(x, w) = w0 + w1 x + . . . + wM xM
I (joint) likelihood
Q function:
p(t|x, w, β) = N −1
n=1 N (tn |y(xn , w), β )
I log-likelihood:
N
βX N
ln p(t|x, w, β) = − {tn − y(xn , w)}2 + ln β − . . .
2 2
n=1

1
I β= σ2
is called the precision.
I The ML solution can be used as a predictive distribution:
−1
p(t|x, wM L , βM L ) = (t|y(x, wM L ), βM L)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Introducing a prior distribution

I The polynomial coefficients are treated as random variables

with a Gaussian distribution taken over a vector of dimension
M + 1:
α (M +1)/2 α
p(w|α) = N (w|0, α−1 I) = exp{− w> w}
2π 2
I from Bayes we get the posterior probability:
p(w|x, t, α, β) ∝ p(t|x, w, β)p(w|α)
I maximum posterior or MAP. We take the negative logarithm,
we throw out constant terms and we get:
N
βX α
{tn − y(xn , w)}2 + w> w
2 2
n=1

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.2.6 Bayesian curve fitting

I Apply the correct Bayes formula:

p(t|x, w, β)p(w|α) p(t|x, w, β)p(w|α)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.3 Model selection

I which is the optimal order of the polynomial that gives the

best generalization?
I train a range of models and test them on an independent
validation set
I cross-validation: use a subset for training and the whole set
for assessing the performance
I Akaike information criterion: ln p(D|wM L ) − M
I Bayesian information criterion (BIC), section 4.4.1.

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
1.4 The curse of dimensionality

I curse: malédiction, fléau ...

I polynomial fitting: replace x by a vector x of dimension D.
The number of unknowns becomes DM .
I Not all the intuitions developed in spaces of low
dimensionality will generalize to spaces of many dimensions

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Section 1.5: Decision theory

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - introduction

I The decision problem:

I given x, predict t according to a probabilistic model p(x, t)
I For now: binary classification: t ∈ {0, 1} ⇔ {C1 , C2 }
I Important quantity: p(Ck |x)

p(x, Ck ) p(x, Ck )
p(Ck |x) = = P2
p(x) k=1 p(x, Ck )
⇒ getting p(x, Ci ) is the (central!) inference problem
p(x|Ck )p(Ck )
=
p(x)
∝ likelihood × prior

I Intuition: choose k that maximizes p(Ck |x)?

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - binary classification
I Decision region: Ri = {x : pred(x) = Ci }
I Probability of misclassification:

p(mis) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )

Z Z
= p(x, C2 )dx + p(x, C1 )dx
R1 R2

⇒ In order to minimize, affect x to R1 if:

p(x, C1 ) > p(x, C2 )

⇔ p(C1 |x)p(x) > p(C2 |x)p(x)
⇔ p(C1 |x) > p(C2 |x)
XZ X
I Similarly, for k classes: minimize p(x, Ck ) dx
j Rj k6=j
⇒ pred(x) = argmaxk p(Ck |x)
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - loss-sensitive decision

I Cost/Loss of a decision: Lkj = predict Cj while truth is Ck .

I Loss-sensitive decision ⇒ minimize the expected loss:
XZ X
E[L] = Lkj p(x, Ck ) dx
j Rj k

I Solution: for each x, choose the class Cj that minimizes:

X X
Lkj p(x, Ck ) ∝ Lkj p(Ck |x)
k k

⇒ straightforward when we know p(Ck |x)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - loss-sensitive decision
I Typical example = medical diagnosis:
I Ck ={1, 2} ⇔
{sick, healthy}
0 100
I L= ⇒ strong cost of ”missing” a diseased person
1 0
I Expected loss:
Z Z
E[L] = L1,2 p(x, C1 )dx + L2,1 p(x, C2 )dx
ZR2 R1
Z
= 100 × p(x, C1 )dx + p(x, C2 )dx
R2 R1

I Note: minimizing the probability of misclassification:

Z Z
p(mis) = p(x, C2 )dx + p(x, C1 )dx
R1 R2

0 1
corresponds to minimizing the 0/1 loss: L =
1 0
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - the ”reject option”

I For the 0/1 loss1 , pred(x) = argmaxk p(Ck |x)

I Note: K classes ⇒ 1/K ≤ max p(Ck |x) ≤ 1
k

I When max p(Ck |x) → 1/K the confidence in the prediction

k
decreases.
I classes tend to become as likely

I ”Reject option”: make a decision provided max p(Ck |x) > σ

k
⇒ the value of σ controls the amount of rejection:
I σ = 1: systematic rejection
I σ < 1/K: no rejection

I Motivation: switch between automatic/human decision

I Illustration in Figure 1.26 page 42

1
For a general loss matrix, see exercice 1.24
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - regression setting
I The regression setting: quantitative target t ∈ R
I Typical regression loss-function: L(t, y(x)) = (y(x) − t)2
I the squared loss
I The decision problem = minimize the expected loss:
Z Z
E[L] = (y(x) − t)2 p(x, t)dxdt
X R
Z
I Solution: y(x) = tp(t|x)dt
R
I this is known as the regression function
I intuitively appealing: conditional average of t given x
I illustration in figure 1.28, page 47

I Note: general class of loss functions L(t, y(x)) = |y(x) − t|q

I q = 2 is analytically convenient (derivable and continuous)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - regression setting
I Derivation:
Z Z
E[L] = (y(x) − t)2 p(x, t)dxdt
ZX h RZ i
= (y(x) − t)2 p(t|x)dt p(x)dx
X R
Z
⇒ for each x, find y(x) that minimizes (y(x) − t)2 p(t|x)dt
ZR
I Derivating with respect to y(x) gives: 2 (y(x) − t)p(t|x)dt
R
I Setting to zero leads to:
Z Z
y(x)p(t|x)dt = tp(t|x)dt
R R
Z
y(x) = tp(t|x)dt
R

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - inference and decision

2 (or 3) different approaches to the decision problem:

1. rely on a probabilistic model, with 2 flavours:
1.1 generative:
I use a generative model to infer p(x|Ck )
I combine with priors p(Ck ) to get p(x, Ck ) and eventually
p(Ck |x)
1.2 discriminative: infer directly p(Ck |x)
I this is sufficient for the decision problem

2. learn a discriminant function f (x)

I directly map input to class labels
I for binary classification, f (x) is typically defined as the sign
(+1/-1) of an auxiliary function

(Note: similar discussion for regression)

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Decision theory - inference and decision

Pros and Cons:

I probabilistic generative models:
I pros: acess to p(x) → easy detection of outliers
I i.e., low-confidence predictions
I cons: estimating the joint probability p(x, Ck ) can be
computational and data demanding
I probabilistic discrimative models:
I pros: less demanding than the generative approach
I see figure 1.27, page 44

I discriminant functions:
I pros: a single learning problem (vs inference + decision)
I cons: no access to p(Ck |x)
I ... which can have many advantages in practice for (e.g.)
rejection and model combination – see page 45

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Section 1.6: Information theory

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - Entropy
I Consider a discrete random variable X
I We want to define a measure h(x) of surprise/information of
observing X = x
I Natural requirements:
I if p(x) is low (resp. high), h(x) should be high (resp.low)
I h(x) should be a monotonically decreasing function of p(x)
I if X and Y are unrelated, h(x, y) should be h(x) + h(y)
I i.e., if X and Y are independent, that is p(x, y) = p(x)p(y)

⇒ this leads to h(x) = − log p(x)

I Entropy of the variable X:
X
H[X] = E[h(X)] = − p(x) log(p(x))
x

(Convention: p log p = 0 if p = 0)
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - Entropy
Some remarks:
I H[X] ≥ 0 since p ∈ [0, 1] (hence p log p ≤ 0)
I H[X] = 0 if ∃ x s.t. p(x) = 1
I Maximum entropy distribution = uniform distribution
P
optimization problem: maximize H[X] + λ xi p(xi ) − 1
I

I derivating w.r.t. p(xi ) shows they must be constant

I hence p(xi ) = 1/M, ∀xi ⇒ H[X] = log(M )
⇒ we therefore have 0 ≤ H[X] ≤ log(M )

I H[X] = lower bound on the # of bits required to (binarily)

encode the values of X (using log2 in the defintion of H)
I trivial code of length log2 (M ) (ex: M = 8, messages of size 3)
I no ”clever” coding scheme for uniform distributions
I for non-uniform distributions, optimal coding schemes can be
designed
I high probability values ⇒ short codes
I see illustration in page 50
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - Entropy
For continuous random variables:
Z
I differential entropy: H[X] = − p(x) ln p(x)dx
I because p(x) can be > 1, care must be taken when
transposing properties of the discrete entropy
I in particular, can be negative (if X ,→ U (0, 1/2) : H[X] = − ln 2)
I Given (µ, σ), maximum entropy distribution p(x) = N (µ, σ 2 )
I optimization problem: maximize H[X] with µ, σ equality
constraint + normalization constraint

I entropy: H[X] = 1/2 1 + ln(2πσ 2 )
I Conditional entropy of y given x:
Z Z
H[Y |X] = − p(x, y) ln p(y|x)dxdy

⇒ we have easily H[X, Y ] = H[Y |X] + H[X]

(natural interpretation with the notion of information)
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - KL divergence
I Kullback-Leibler divergence between distributions p and q:
Z Z
KL(p||q) = − p(x) ln q(x)dx − − p(x) ln p(x)dx
Z
q(x)
= − p(x) ln dx
p(x)
I KL(p||q) 6= KL(q||p)
I KL(p||p) = 0
I KL(p||q) ≥ 0 (next slide)

⇒ measures the difference between the ”true” distribution p and

the distribution q

(Information therory interpretation: amount of additional information

required to encode the values of X using q(x) instead of p(x))
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - KL divergence
I A function is convex iff every cord lies above the function
I illustration in figure 1.31, page 56
I Jensen’s inequality for convex functions:

E[f (x)] ≥ f E[x]
(strict inequality for strictly convex functions)
I When applied to KL(p||q):
Z
q(x)
KL(p||q) = − p(x) ln dx
p(x)
Z
q(x)
> − ln p(x) × dx (because − ln is stricly convex)
p(x)
Z
= − ln q(x)dx = − ln 1 = 0

Moreover, straightforward to see that KL(p||p) = 0

I Conclusion: KL(p||q) ≥ 0, with equality if p = q
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - KL divergence: illustration
I Data generated by an (unknown) distribution p(x)
I We want to fit a parametric probabilistic model q(x|θ) = qθ (x)
⇒ i.e., we want to minimize KL(p||qθ )
I Data available: observations (x1 , . . . , xN ):
Z
q(x|θ)
KL(p||qθ ) = − ln p(x) × ln dx
p(x)
N
X q(xi |θ)
'− ln
p(xi )
i=1
N
X
= − ln q(xi |θ) + ln p(xi )
i=1

⇒ it followsP
that minimizing KL(p||qθ ) corresponds to
maximizing N i=1 ln q(xi |θ) = log-likelihood

Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction
Information theory - Mutual information
I Mutual information: I[X, Y ] = KL(p(X, Y )||p(X), p(Y ))
I Quantifies the amount of independence between X and Y
I I[X, Y ] = 0 ⇔ p(X, Y ) = p(X)p(Y )
I We have:
Z Z
p(x)p(y)
I[x, y] = − p(x, y) ln dxdy
p(x, y)
Z Z
p(x)p(y)
=− p(x, y) ln dxdy
p(x|y)p(y)
Z Z
p(x)
=− p(x, y) ln dxdy
p(x|y)
Z Z Z Z
=− p(x, y) ln p(x)dxdy − − p(x, y) ln p(x|y)dxdy

= H[X] − H[X|Y ]
I Conclusion: I[X, Y ] = H[X] − H[X|Y ] = H[Y ] − H[Y |X]
I I[X, Y ] = reduction of the uncertainty about X obtained by
telling the value of Y (that is, 0 for independent variables)
Radu Horaud & Pierre Mahé Patt. Rec. and Mach. Learning Ch. 1: Introduction

Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
ML 1
No ratings yet
ML 1
64 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
Applied Statistics - Lecture 1: Mario Beraha
No ratings yet
Applied Statistics - Lecture 1: Mario Beraha
52 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
ML 3
No ratings yet
ML 3
66 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
L09 Learning I Bayesian Learning
No ratings yet
L09 Learning I Bayesian Learning
66 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
4.4 Parametric and Non-Parametric Estimator
No ratings yet
4.4 Parametric and Non-Parametric Estimator
47 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
Bayes
No ratings yet
Bayes
10 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Bayesian
No ratings yet
Bayesian
91 pages
Week 2
No ratings yet
Week 2
43 pages
Lecture1 Intro ML
No ratings yet
Lecture1 Intro ML
60 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Bishop CH 3 Notes
No ratings yet
Bishop CH 3 Notes
6 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Statistical Perspective
No ratings yet
Statistical Perspective
85 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
Assignment 10 Solution
No ratings yet
Assignment 10 Solution
8 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
2 Probability
No ratings yet
2 Probability
30 pages
Machine Learning in 10 Pages PDF
No ratings yet
Machine Learning in 10 Pages PDF
10 pages
Slide 1
No ratings yet
Slide 1
37 pages
Introduction To Probabilistic Learning
No ratings yet
Introduction To Probabilistic Learning
9 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
SML Lecture2
No ratings yet
SML Lecture2
35 pages
Neural Networks Study Notes
100% (2)
Neural Networks Study Notes
11 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
No ratings yet
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
65 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Module 2 Notes Bcs602
No ratings yet
Module 2 Notes Bcs602
19 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
Regression Probabilistic Perspective
No ratings yet
Regression Probabilistic Perspective
20 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Blast Furnace Burden Permeability: Oleh Nick Standish, October 2013
100% (1)
Blast Furnace Burden Permeability: Oleh Nick Standish, October 2013
43 pages
Lecture 2 - Unit 1 - Types of Research
No ratings yet
Lecture 2 - Unit 1 - Types of Research
17 pages
WTOS - Dell Wyse ThinOS v8.3 INI Guide
No ratings yet
WTOS - Dell Wyse ThinOS v8.3 INI Guide
126 pages
Technical Article SSAB Structural Hollow Sections For Functional Design According To Eurocode3
No ratings yet
Technical Article SSAB Structural Hollow Sections For Functional Design According To Eurocode3
17 pages
Nuclear and Radiochemistry: Prof. B.S.Tomar Prof. P.K.Mohapatra
No ratings yet
Nuclear and Radiochemistry: Prof. B.S.Tomar Prof. P.K.Mohapatra
1 page
Analog and Mixed Mode Vlsi Design
No ratings yet
Analog and Mixed Mode Vlsi Design
4 pages
Crestron CI-KNX 1 Bit FB Pulse v1.6 Help
No ratings yet
Crestron CI-KNX 1 Bit FB Pulse v1.6 Help
2 pages
562
No ratings yet
562
98 pages
Psychological Statistics PP
No ratings yet
Psychological Statistics PP
2 pages
q3 Math-6 Tos Grade-6
No ratings yet
q3 Math-6 Tos Grade-6
6 pages
Sci 10 Lesson 1st Week
No ratings yet
Sci 10 Lesson 1st Week
18 pages
Methods of Mathematical Physics
No ratings yet
Methods of Mathematical Physics
31 pages
Inequalities Questions
No ratings yet
Inequalities Questions
7 pages
Astm - C177 - 10
No ratings yet
Astm - C177 - 10
23 pages
ISC 5 Years Chemistry-1
No ratings yet
ISC 5 Years Chemistry-1
8 pages
GSM Network: S.H.Jamali
No ratings yet
GSM Network: S.H.Jamali
42 pages
Service Manual: Separation Unit 841
100% (1)
Service Manual: Separation Unit 841
160 pages
4.3 Orthogonal Diagonalization
No ratings yet
4.3 Orthogonal Diagonalization
11 pages
Chap 4
No ratings yet
Chap 4
8 pages
TVM
No ratings yet
TVM
1 page
Lab 4 Report
No ratings yet
Lab 4 Report
10 pages
Magic Square AP PC Unit 1 Review
No ratings yet
Magic Square AP PC Unit 1 Review
5 pages
Question For Machine Stitch.
No ratings yet
Question For Machine Stitch.
4 pages
Winding
No ratings yet
Winding
15 pages
Nelson MHF 4U Advanced Function 1.1
No ratings yet
Nelson MHF 4U Advanced Function 1.1
10 pages
Adding and Subtracting Integers Lesson Plan
No ratings yet
Adding and Subtracting Integers Lesson Plan
3 pages
Technical Information
No ratings yet
Technical Information
88 pages
Digital Lab Report
No ratings yet
Digital Lab Report
9 pages
The Mobius Strip 18
No ratings yet
The Mobius Strip 18
2 pages
Well Foundation
No ratings yet
Well Foundation
15 pages

Chap1 Bishop

Uploaded by

Chap1 Bishop

Uploaded by

Patt. Rec. and Mach.

Radu Horaud & Pierre Mahé

September 28, 2007

I Goals, terminology, scope of the book;

Pattern Recognition: automatic discovery of regularities in data

I training or learning phase: determine y(x) on the basis of the

I Training set: x ≡ (x1 , . . . , xN ) AND t ≡ (t1 , . . . , tN )

I Model selection: choosing M .

I This can be expressed in the Bayesian framework using

p(X, Y ) = p(X|Y )p(Y ) = p(Y |X)p(X)

I Probability that x lies in an interval:

I p(x) is called the probability density over x.

I Expectation: the average value of some function f (x) under a

I The polynomial coefficients are treated as random variables

I Apply the correct Bayes formula:

p(t|x, w, β)p(w|α) p(t|x, w, β)p(w|α)

I which is the optimal order of the polynomial that gives the

I curse: malédiction, fléau ...

I The decision problem:

I Intuition: choose k that maximizes p(Ck |x)?

p(mis) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )

⇒ In order to minimize, affect x to R1 if:

p(x, C1 ) > p(x, C2 )

I Cost/Loss of a decision: Lkj = predict Cj while truth is Ck .

I Solution: for each x, choose the class Cj that minimizes:

⇒ straightforward when we know p(Ck |x)

I Note: minimizing the probability of misclassification:

I For the 0/1 loss1 , pred(x) = argmaxk p(Ck |x)

I When max p(Ck |x) → 1/K the confidence in the prediction

I ”Reject option”: make a decision provided max p(Ck |x) > σ

I Motivation: switch between automatic/human decision

I Note: general class of loss functions L(t, y(x)) = |y(x) − t|q

2 (or 3) different approaches to the decision problem:

2. learn a discriminant function f (x)

(Note: similar discussion for regression)

Pros and Cons:

⇒ this leads to h(x) = − log p(x)

I derivating w.r.t. p(xi ) shows they must be constant

I H[X] = lower bound on the # of bits required to (binarily)

⇒ we have easily H[X, Y ] = H[Y |X] + H[X]

⇒ measures the difference between the ”true” distribution p and

(Information therory interpretation: amount of additional information

Moreover, straightforward to see that KL(p||p) = 0

You might also like