Linear Regression and Gradient Descent (1)
Linear Regression and Gradient Descent (1)
Figure 1. Car heaviness (in pounds) versus miles per gallon rating. As a car gets
heavier, its miles per gallon rating generally decreases.
• We could create our own model by drawing a best fit
line through the points:
Figure 2. A best fit line drawn through the data from the
previous figure.
Figure 4. Using the model, a 4,000-pound car has a predicted fuel efficiency of 15.6
miles per gallon.
By graphing some of these additional features, we can see that they also have a linear relationship
to the label, miles per gallon:
Figure 6. A car's displacement in cubic centimeters and its miles per gallon rating. As a car's
engine gets bigger, its miles per gallon rating generally decreases.
Figure 7. A car's acceleration and its miles per gallon rating. As a car's acceleration takes
longer, the miles per gallon rating generally increases.
Figure 8. A car's horsepower and its miles per gallon rating. As a car's horsepower
increases, the miles per gallon rating generally decreases.
LINEAR REGRESSION LOSS
• Loss is a numerical metric that describes how wrong a
model's predictions are. Loss measures the distance
between the model's predictions and the actual labels.
The goal of training a model is to minimize the loss,
reducing it to its lowest possible value.
• In the following image, you can visualize loss as arrows
drawn from the data points to the model. The arrows show
how far the model's predictions are from the actual values.
Figure 9. Loss is measured from the actual value to the predicted
value.
The functional difference between L1 loss and L2 loss (or between MAE and MSE) is squaring. When the
difference between the prediction and label is large, squaring makes the loss even larger. When the
difference is small (less than 1), squaring makes the loss even smaller.
When processing multiple examples at once, we recommend averaging the losses across all the
examples, whether using MAE or MSE.
CHOOSING A LOSS
• Deciding whether to use MAE or MSE can depend on the dataset and the way you want to handle
certain predictions. Most feature values in a dataset typically fall within a distinct range. For example,
cars are normally between 2000 and 5000 pounds and get between 8 to 50 miles per gallon. An 8,000-
pound car, or a car that gets 100 miles per gallon, is outside the typical range and would be considered
an outlier.
• An outlier can also refer to how far off a model's predictions are from the real values. For instance,
3,000 pounds is within the typical car-weight range, and 40 miles per gallon is within the typical fuel-
efficiency range. However, a 3,000-pound car that gets 40 miles per gallon would be an outlier in terms
of the model's prediction because the model would predict that a 3,000-pound car would get between
18 and 20 miles per gallon.
• When choosing the best loss function, consider how you want the model to treat outliers. For instance,
MSE moves the model more toward the outliers, while MAE doesn't. L 2 loss incurs a much higher
penalty for an outlier than L1 loss. For example, the following images show a model trained using MAE
and a model trained using MSE. The red line represents a fully trained model that will be used to make
predictions. The outliers are closer to the model trained with MSE than to the model trained with MAE.
Figure 10. A model trained with MSE moves the model closer to the
outliers.
Figure 11. A model trained with MAE is farther from the
outliers.
Note the relationship between the model and the data:
•MSE. The model is closer to the outliers but further away from most of the other data points.
•MAE. The model is further away from the outliers but closer to most of the other data points.
• Gradient descent is a mathematical technique that iteratively
finds the weights and bias that produce the model with the
lowest loss. Gradient descent finds the best weight and bias by
repeating the following process for a number of user-defined
iterations.
• The model begins training with randomized weights and biases
near zero, and then repeats the following steps:
1.Calculate the loss with the current weight and bias.
2.Determine the direction to move the weights and bias that
reduce loss.
3.Move the weight and bias values a small amount in the direction
that reduces loss.
4.Return to step one and repeat the process until the model can't
reduce the loss any further.
The diagram outlines the iterative steps gradient descent performs to find the weights and bias
that produce the model with the lowest loss.
MODEL CONVERGENCE AND LOSS
CURVES
• When training a model, you'll often look at a loss curve
to determine if the model has converged. The loss
curve shows how the loss changes as the model trains.
The following is what a typical loss curve looks like. Loss
is on the y-axis and iterations are on the x-axis:
Figure 13. Loss curve showing the model converging around the 1,000th-
iteration mark.
• You can see that loss dramatically decreases during the first few iterations,
then gradually decreases before flattening out around the 1,000th-iteration
mark. After 1,000 iterations, we can be mostly certain that the model has
converged.
• In the following figures, we draw the model at three points during the training
process: the beginning, the middle, and the end. Visualizing the model's state
at snapshots during the training process solidifies the link between updating
the weights and bias, reducing loss, and model convergence.
• In the figures, we use the derived weights and bias at a particular iteration to
represent the model. In the graph with the data points and the model
snapshot, blue loss lines from the model to the data points show the amount
of loss. The longer the lines, the more loss there is.
• In the following figure, we can see that around the second iteration the model
would not be good at making predictions because of the high amount of loss.
Figure 14. Loss curve and snapshot of the model at the beginning of the
training process.
At around the 400th-iteration, we can see that gradient
descent has found the weight and bias that produce a
better model.
Figure 15. Loss curve and snapshot of model about midway through
training.
And at around the 1,000th-iteration, we can see that the
model has converged, producing a model with the lowest
possible loss.
Figure 16. Loss curve and snapshot of the model near the end of the training
process.
CONVERGENCE AND CONVEX
FUNCTIONS
• The loss functions for linear models always produce a
convex surface. As a result of this property, when a
linear regression model converges, we know the model
has found the weights and bias that produce the lowest
loss.
• If we graph the loss surface for a model with one
feature, we can see its convex shape. The following is
the loss surface of the miles per gallon dataset used in
the previous examples. Weight is on the x-axis, bias is
on the y-axis, and loss is on the z-axis:
Figure 17. Loss surface that shows its convex
shape.
In this example, a weight of -5.44 and bias of 35.94 produce
the lowest loss at 5.54:
Figure 18. Loss surface showing the weight and bias values that produce the lowest loss.
A linear model converges when it's found the minimum loss. Therefore, additional
iterations only cause gradient descent to move the weight and bias values in very small
amounts around the minimum. If we graphed the weights and bias points during gradient
descent, the points would look like a ball rolling down a hill, finally stopping at the point
where there's no more downward slope.
Figure 19. Loss graph showing gradient descent points stopping at the lowest point on the grap
Notice that the black loss points create the exact shape of
the loss curve: a steep decline before gradually sloping
down until they've reached the lowest point on the loss
surface.
It's important to note that the model almost never finds the
exact minimum for each weight and bias, but instead finds
a value very close to it. It's also important to note that the
minimum for the weights and bias don't correspond to zero
loss, only a value that produces the lowest loss for that
parameter.
Using the weight and bias values that produce the lowest
loss—in this case a weight of -5.44 and a bias of 35.94—we
can graph the model to see how well it fits the data:
Figure 20. Model graphed using the weight and bias values that produce the lowest loss.
This would be the best model for this dataset because no other weight and bias values produce
a model with lower loss.