lecture6
lecture6
[Greg Shakhnarovich]
Bernoulli: a model for coins
A Bernoulli random variable r.v. X takes values in {0,1}
q if x=1
p(x|q ) =
1- q if x=0
H(X) = - S
x
p(x|q ) log p(x|q )
X n
d
g(w) = J(µ) = xTi (¼i ¡ yi ) = XT (¼ ¡ y)
dµ i=1
d X
T
H = g(µ) = ¼i (1 ¡ ¼i )xi xTi = XT diag(¼i (1 ¡ ¼i ))X
dµ i
where ¼i = sigm(xi µ)
One can show that H is positive de¯nite; hence the NLL is convex and
has a unique global minimum.
gk = XT (¼ k ¡ y)
Hk = XT Sk X
Sk := diag(¼1k (1 ¡ ¼1k ); : : : ; ¼nk (1 ¡ ¼nk ))
¼ik = sigm(xi µ k )
µ k+1 = µ k ¡ H¡1 gk
= µ k + (XT Sk X)¡1 XT (y ¡ ¼ k )
T ¡1
£ T T
¤
= (X Sk X) (X Sk X)µ k + X (y ¡ ¼ k )
= (XT Sk X)¡1 XT [Sk Xµ k + y ¡ ¼ k ]
Softmax formulation
Likelihood function
Negative log-likelihood criterion
Neural network representation of loss
Manual gradient computation
Manual gradient computation
Next lecture
In the next lecture, we develop an automatic layer-wise way of
computing all the necessary derivatives known as back-propagation.
This is the approach used in Torch. We will review the torch nn class.