0% found this document useful (0 votes)
16 views

Week 7

The document discusses natural language processing and machine learning concepts such as cross-entropy loss, gradient descent, and regularization. It provides explanations of cross-entropy loss as a form of conditional maximum likelihood estimation. It also discusses how gradient descent can be used to minimize the loss by moving in the opposite direction of the steepest slope. Regularization techniques like L1 and L2 regularization are introduced to help reduce overfitting. Finally, it briefly introduces multinomial logistic regression for classification problems with more than two classes.

Uploaded by

Khaled Tarek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Week 7

The document discusses natural language processing and machine learning concepts such as cross-entropy loss, gradient descent, and regularization. It provides explanations of cross-entropy loss as a form of conditional maximum likelihood estimation. It also discusses how gradient descent can be used to minimize the loss by moving in the opposite direction of the steepest slope. Regularization techniques like L1 and L2 regularization are introduced to help reduce overfitting. Finally, it briefly introduces multinomial logistic regression for classification problems with more than two classes.

Uploaded by

Khaled Tarek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Natural Language Processing

Mahmmoud Mahdi
Learning: Cross-Entropy Loss
Wait, where did the W’s come from?

3
Learning components

1.A loss function:


● cross-entropy loss
1.An optimization algorithm:
● stochastic gradient descent
Intuition of negative log likelihood loss
= cross-entropy loss

A case of conditional maximum likelihood estimation


We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
Deriving cross-entropy loss for a single observation x
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)

Maximize:

Whatever values maximize log p(y|x) will also maximize


p(y|x)
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:

Minimize:
Let's see if this works for our sentiment example

We want loss to be:

• smaller if the model estimate is close to correct


• bigger if model is confused

Let's first suppose the true label of this is y=1 (positive)


It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and start
dancing . It sucked me in , and it'll do the same to you .
Let's see if this works for our sentiment example

True value is y=1. How well is our model doing?

Pretty well! What's the loss?


Let's see if this works for our sentiment example

Suppose true value instead was y=0.

What's the loss?


Let's see if this works for our sentiment example
The loss when model was right (if true y=1)

Is lower than the loss when model was wrong


(if true y=0):

Sure enough, loss was bigger when model was


wrong!
Stochastic Gradient Descent
Our goal: minimize the loss
Intuition of gradient descent
How do I get to the bottom of this river canyon?

Look around me 360∘


Find the direction of
steepest slope down
x Go that way
Our goal: minimize the loss

For logistic regression, loss function is convex


• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

So we'll move positive


Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

So we'll move positive


Gradients
The gradient of a function of many variables is a vector
pointing in the direction of the greatest increase in a
function.

Gradient Descent: Find the gradient of the loss function at


the current point and move in the opposite direction.
How much do we move in that direction ?
Now let's consider N dimensions

● We want to know where in the N-dimensional


space (of the N parameters that make up θ ) we
should move.
● The gradient is just such a vector; it
expresses the directional components of the
sharpest slope along each of the N dimensions.
Imagine 2 dimensions, w and b

Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
The gradient

The final equation for updating θ based on the


gradient is thus
What are these partial derivatives for logistic regression?
The loss function

The elegant derivative of this function


Hyperparameters
The learning rate η is a hyperparameter
○ too high: the learner will take big steps and overshoot
○ too low: the learner will take too long

Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Stochastic Gradient Descent
An example and more details
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)

Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:


w1 = w2 = b = 0
η = 0.1
Example of gradient descent w1 = w2 = b = 0;
x1 = 3; x2 = 2
Update step for update θ is:

where

Gradient vector has 3 dimensions:


Example of gradient descent w1 = w2 = b = 0;
x1 = 3; x2 = 2
Update step for update θ is:

where

Gradient vector has 3 dimensions:


Example of gradient descent w1 = w2 = b = 0;
x1 = 3; x2 = 2
Update step for update θ is:

where

Gradient vector has 3 dimensions:


Example of gradient descent w1 = w2 = b = 0;
x1 = 3; x2 = 2
Update step for update θ is:

where

Gradient vector has 3 dimensions:


Example of gradient descent w1 = w2 = b = 0;
x1 = 3; x2 = 2
Update step for update θ is:

where

Gradient vector has 3 dimensions:


Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Mini-batch training
Stochastic gradient descent chooses a single random example at
a time.
That can result in choppy movements
More common to compute gradient over batches of training
instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)
Regularization
Overfitting
A model that perfectly match the training data has a problem.
It will also overfit to the data, modeling noise
○ A random word that perfectly predicts y (it happens to only occur in one
class) will get a very high weight.
○ Failing to generalize to a test set without this word.

A good model should be able to generalize


Overfitting Useful or harmless features

+ X1 = "this"
X2 = "movie
This movie drew me in, and it'll X3 = "hated"
do the same to you. X4 = "drew me in"

-
I can't tell you how much I
4gram features that just
"memorize" training set and
might cause problems
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
42
Overfitting
4-gram model on tiny data will just memorize the data
○ 100% accuracy on the training set

But it will be surprised by the novel 4-grams in the test data


○ Low accuracy on test set

Models that are too powerful can overfit the data


● Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
● How to avoid overfitting?
○ Regularization in logistic regression
○ Dropout in neural networks

43
Regularization
A solution for overfitting
Add a regularization term R(θ) to the loss function (for now
written as maximizing logprob rather than minimizing loss)

Idea: choose an R(θ) that penalizes large weights


○ fitting the data well with lots of big weights not as good as fitting
the data a little less well, with small weights
L2 Regularization (= ridge regression)
The sum of the squares of the weights
The name is because this is the (square of the)L2 norm ||θ||2, =
Euclidean distance of θ to the origin.

L2 regularized objective function:


L1 Regularization (= lasso regression)
The sum of the (absolute value of the) weights
Named after the L1 norm ||W||1, = sum of the absolute values of
the weights, = Manhattan distance

L1 regularized objective function:


Multinomial Logistic Regression
Multinomial Logistic Regression
Often we need more than 2 classes
○ Positive/negative/neutral
○ Parts of speech (noun, verb, adjective, adverb,
preposition, etc.)
○ Classify emergency SMSs into different actionable
classes
If >2 classes we use multinomial logistic regression
= Softmax regression
= Multinomial logit
= (defunct names : Maximum entropy modeling or MaxEnt
So "logistic regression" will just mean binary (2 output
classes)
Multinomial Logistic Regression

The probability of everything must still sum to 1


P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1

Need a generalization of the sigmoid called the softmax

● Takes a vector z = [z1, z2, ..., zk] of k arbitrary values


● Outputs a probability distribution
○ each value in the range [0,1]
○ all the values summing to 1
The softmax function
Turns a vector z = [z1, z2, ... , zk] of k arbitrary
values into probabilities
The softmax function
○ Turns a vector z = [z1,z2,...,zk] of k arbitrary values into
probabilities
Softmax in multinomial logistic regression

Input is still the dot product between weight vector w and


input vector x
But now we’ll need separate weight vectors for each of the K
classes.
Features in binary versus multinomial logistic regression
Binary: positive weight 🡪 y=1 neg weight 🡪 y=0

w5 = 3.0

Multinominal: separate weights for each class:

53

You might also like