Week 7
Week 7
Mahmmoud Mahdi
Learning: Cross-Entropy Loss
Wait, where did the W’s come from?
3
Learning components
Maximize:
Minimize:
Let's see if this works for our sentiment example
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
The gradient
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Stochastic Gradient Descent
An example and more details
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
where
where
where
where
where
+ X1 = "this"
X2 = "movie
This movie drew me in, and it'll X3 = "hated"
do the same to you. X4 = "drew me in"
-
I can't tell you how much I
4gram features that just
"memorize" training set and
might cause problems
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
42
Overfitting
4-gram model on tiny data will just memorize the data
○ 100% accuracy on the training set
43
Regularization
A solution for overfitting
Add a regularization term R(θ) to the loss function (for now
written as maximizing logprob rather than minimizing loss)
w5 = 3.0
53