Lecture2 2015
Lecture2 2015
414/2104:
Machine
Learning
Russ
Salakhutdinov
Department of Computer Science!
Department of Statistics!
[email protected]!
h0p://www.cs.toronto.edu/~rsalakhu/
Lecture 2
Linear
Least
Squares
From
last
class:
Minimize
the
sum
of
the
squares
of
the
errors
between
the
predicAons
for
each
data
point
xn
and
the
corresponding
real-‐valued
targets
tn.
Source:
Wikipedia
Linear
Least
Squares
If
is
nonsingular,
then
the
unique
soluAon
is
given
by:
Source: Wikipedia
Goal: Fit the data using a polynomial funcAon of the form:
Note:
the
polynomial
funcAon
is
a
nonlinear
funcAon
of
x,
but
it
is
a
linear
funcAon
of
the
coefficients
w
!
Linear
Models.
Example:
Polynomial
Curve
FiQng
•
As
for
the
least
squares
example:
we
can
minimize
the
sum
of
the
squares
of
the
errors
between
the
predicAons
for
each
data
point
xn
and
the
corresponding
target
values
tn.
It is oXen convenient to maximize the log of the likelihood funcAon:
• The probability of x=1 will be denoted by the parameter µ, so that:
• Equivalently, we can maximize the log of the likelihood funcAon:
•
Note
that
the
likelihood
funcAon
depends
on
the
N
observaAons
xn
only
through
the
sum
Sufficient
StaAsAc
Parameter
EsAmaAon
•
Suppose
we
observed
a
dataset
•
SeQng
the
derivaAve
of
the
log-‐likelihood
funcAon
w.r.t
µ
to
zero,
we
obtain:
•
If
a
random
variable
can
take
on
K=6
states,
and
a
parAcular
observaAon
of
the
variable
corresponds
to
the
state
x3=1,
then
x
will
be
resented
as:
•
If
we
denote
the
probability
of
xk=1
by
the
parameter
µk,
then
the
distribuAon
over
x
is
defined
as:
MulAnomial
Variables
•
MulAnomial
distribuAon
can
be
viewed
as
a
generalizaAon
of
Bernoulli
distribuAon
to
more
than
two
outcomes.
• It is easy to see that the distribuAon is normalized:
and
Maximum
Likelihood
EsAmaAon
•
Suppose
we
observed
a
dataset
•
We
can
construct
the
likelihood
funcAon,
which
is
a
funcAon
of
µ.
•
Note
that
the
likelihood
funcAon
depends
on
the
N
data
points
only
though
the
following
K
quanAAes:
•
To
find
a
maximum
likelihood
soluAon
for
µ,
we
need
to
maximize
the
log-‐likelihood
taking
into
account
the
constraint
that
•
The
normalizaAon
coefficient
is
the
number
of
ways
of
parAAoning
N
objects
into
K
groups
of
size
m1,m2,…,mK.
•
Note
that
Dirichlet
DistribuAon
•
Consider
a
distribuAon
over
µk,
subject
to
constraints:
•
The
Dirichlet
distribuAon
is
confined
to
a
simplex
as
a
consequence
of
the
constraints.
Dirichlet
DistribuAon
•
Plots
of
the
Dirichlet
distribuAon
over
three
variables.
Gaussian
Univariate
DistribuAon
•
In
the
case
of
a
single
variable
x,
the
Gaussian
distribuAon
takes
form:
•
Let
us
analyze
the
funcAonal
dependence
of
the
Gaussian
on
x
through
the
quadraAc
form:
• The covariance can be expressed in terms of its eigenvectors:
• Remember:
• Hence:
•
We
can
interpret
{yi}
as
a
new
coordinate
system
defined
by
the
orthonormal
vectors
ui
that
are
shiXed
and
rotated
.
Geometry
of
the
Gaussian
DistribuAon
•
Because
the
parameter
matrix
§
governs
the
covariance
of
x
under
the
Gaussian
distribuAon,
it
is
called
the
covariance
matrix.
Moments
of
the
Gaussian
DistribuAon
•
Contours
of
constant
probability
density:
•
In
many
situaAons,
it
will
be
more
convenient
to
work
with
the
precision
matrix
(inverse
of
the
covariance
matrix):
•
Note
that
¤aa
is
not
given
by
the
inverse
of
§aa.
CondiAonal
DistribuAon
•
It
turns
out
that
the
condiAonal
distribuAon
is
also
a
Gaussian
distribuAon:
Linear
funcAon
of
xb.
Marginal
DistribuAon
•
It
turns
out
that
the
marginal
distribuAon
is
also
a
Gaussian
distribuAon:
•
For
a
marginal
distribuAon,
the
mean
and
covariance
are
most
simply
expressed
in
terms
of
parAAoned
covariance
matrix.
CondiAonal
and
Marginal
DistribuAons
Maximum
Likelihood
EsAmaAon
•
Suppose
we
observed
i.i.d
data
•
We
can
construct
the
log-‐likelihood
funcAon,
which
is
a
funcAon
of
µ
and
§:
•
Note
that
the
likelihood
funcAon
depends
on
the
N
data
points
only
though
the
following
sums:
Sufficient
StaBsBcs
Maximum
Likelihood
EsAmaAon
•
To
find
a
maximum
likelihood
esAmate
of
the
mean,
we
set
the
derivaAve
of
the
log-‐likelihood
funcAon
to
zero:
Biased esAmate
Infinite
mixture
where
of
Gaussians
where
•
ProperAes:
Mixture
of
Gaussians
•
When
modeling
real-‐world
data,
Gaussian
assumpAon
may
not
be
appropriate.
•
Consider
the
following
example:
Old
Faithful
Dataset
Component
Mixing
coefficient
K=3
•
Note
that
each
Gaussian
component
has
its
own
mean
µk
and
covariance
§k.
The
parameters
¼k
are
called
mixing
coefficients.
•
Mote
generally,
mixture
models
can
comprise
linear
combinaAons
of
other
distribuAons.
Mixture
of
Gaussians
•
IllustraAon
of
a
mixture
of
3
Gaussians
in
a
2-‐dimensional
space:
(a)
Contours
of
constant
density
of
each
of
the
mixture
components,
along
with
the
mixing
coefficients
(b)
Contours
of
marginal
probability
density
where
-
´
is
the
vector
of
natural
parameters
-
u(x)
is
the
vector
of
sufficient
staAsAcs
•
The
funcAon
g(´)
can
be
interpreted
the
coefficient
that
ensures
that
the
distribuAon
p(x|´)
is
normalized:
Bernoulli
DistribuAon
•
The
Bernoulli
distribuAon
is
a
member
of
the
exponenAal
family:
we see that
and so
LogisAc
sigmoid
Bernoulli
DistribuAon
•
The
Bernoulli
distribuAon
is
a
member
of
the
exponenAal
family:
where
MulAnomial
DistribuAon
•
The
MulAnomial
distribuAon
is
a
member
of
the
exponenAal
family:
where
and
NOTE:
The
parameters
´k
are
not
independent
since
the
corresponding
µk
must
saAsfy
•
In
some
cases
it
will
be
convenient
to
remove
the
constraint
by
expressing
the
distribuAon
over
the
M-‐1
parameters.
MulAnomial
DistribuAon
•
The
MulAnomial
distribuAon
is
a
member
of
the
exponenAal
family:
• Let
and
where
Gaussian
DistribuAon
•
The
Gaussian
distribuAon
can
be
wri0en
as:
where
ML
for
the
ExponenAal
Family
•
Remember
the
ExponenAal
Family:
•
Thus
ML
for
the
ExponenAal
Family
•
Remember
the
ExponenAal
Family:
• Thus
•
Note
that
the
covariance
of
u(x)
can
be
expressed
in
terms
of
the
second
derivaAve
of
g(´),
and
similarly
for
the
higher
moments.
ML
for
the
ExponenAal
Family
•
Suppose
we
observed
i.i.d
data
•
We
can
construct
the
log-‐likelihood
funcAon,
which
is
a
funcAon
of
the
natural
parameter
´.
Sufficient StaAsAc