loss function
loss function
Explained
Explore the crucial role of loss functions in machine learning with our
comprehensive guide. Understand the difference between loss and cost
functions, delve into various types like MSE and MAE, and learn their
applications in ML tasks.
Nov 2023 · 20 min read
CONTENTS
Loss Functions in Brief
What is a Loss Function?
How Loss Functions Work
Types of Loss Functions
Loss Functions for Regression
Loss Functions for Classification
Choosing the Right Loss Function
Implementing Loss Functions
Conclusion
SHARE
The role of the loss function is crucial in the training of machine learning
models and includes the following:
Often, the terms loss function and cost function are used interchangeably;
despite this, both terms have distinct definitions:
As mentioned earlier, the loss function, also known as the error function,
quantifies how well a single prediction of the machine learning algorithm is
compared to the actual target value. The key takeaway is that a loss function
applies to a single training example and is part of the overall model's
learning process that provides the signal by which the model's learning
algorithm updates the weights and parameters.
When exploring the topic of loss function, machine learning algorithms, and
the learning process within neural networks, the topic of Empirical Risk
Minimization(ERM) comes up. ERM is an approach to selecting the optimal
parameters of a machine learning algorithm that minimizes the empirical risk.
The empirical risk, in this case, is the training dataset.
The risk minimization component of ERM is the process by which the internal
learning algorithm minimizes the error of prediction of machine learning
algorithm to a known dataset in the outcome that the model has an expected
performance and accuracy in a scenario where an unseen dataset or data
sample which could have similar statical data distribution as the dataset the
model’s has been initially trained on.
Types of Loss Functions
Loss functions in machine learning can be categorized based on the machine
learning tasks to which they are applicable. Most loss functions apply to
regression and classification machine learning problems. The model is
expected to predict continuous output values for regression machine learning
tasks. In contrast, the model is expected to provide discrete labels
corresponding to a dataset class for classification tasks.
Below are standard loss functions and their classification into machine
learning problems they lend themselves well to. Most of these loss functions
are covered in detail later in this article.
Applicability to Applicability to
Loss Function
Classification Regression
Hinge Loss ✔️ ✖️
Log Loss ✔️ ✖️
The Mean Square Error(MSE) or L2 loss is a loss function that quantifies the
magnitude of the error between a machine learning algorithm prediction and
an actual output by taking the average of the squared difference between the
predictions and the target values. Squaring the difference between the
predictions and actual target values results in a higher penalty assigned to
more significant deviations from the target value. A mean of the errors
normalizes the total errors against the number of samples in a dataset or
observation.
The mathematical equation for Mean Square Error (MSE) or L2 Loss is:
MSE = (1/n) * Σ(yᵢ - ȳ)²
Where:
Mean Absolute Error (MAE), also known as L1 Loss, is a loss function used in
regression tasks that calculates the average absolute differences between
predicted values from a machine learning model and the actual target values.
Unlike Mean Squared Error (MSE), MAE does not square the differences,
treating all errors with equal weight regardless of their magnitude.
The mathematical equation for Mean Absolute Error (MAE) or L1 Loss is:
MAE = (1/n) * Σ|yᵢ - ȳ|
Where:
MAE measures the average absolute difference between the predicted and
actual values. Unlike MSE, MAE does not square the differences, which
makes it less sensitive to outliers. Compared to Mean Squared Error (MSE),
Mean Absolute Error (MAE) is inherently less sensitive to outliers because it
assigns an equal weight to all errors, regardless of their magnitude.
This means that while an outlier can significantly skew the MSE by
contributing a disproportionately large error when squared, its impact on MAE
is much more contained. An outlier's influence on the overall error metric is
minimal when using MAE as a loss function. In contrast, MSE amplifies the
effect of outliers due to the squaring of error terms, affecting the model's error
estimation more substantially.
MAE notably adds a uniform error weighting to all data points; in the scenario
described, penalizing outlier data points could result in over-estimating or
under-estimating delivery times.
Huber Loss or Smooth Mean Absolute Error is a loss function that takes the
advantageous characteristics of the Mean Absolute Error and Mean Squared
Error loss functions and combines them into a single loss function. The hybrid
nature of Huber Loss makes it less sensitive to outliers, just like MAE, but also
penalizes minor errors within the data sample, similar to MSE. The Huber
Loss function is also utilized in regression machine learning tasks.
Where:
The Huber Loss function effectively combines two components for handling
errors differently, with the transition point between these components
determined by the threshold δ:
Quadratic Component for Small Errors: For errors smaller than δ, it uses the
quadratic component (1/2) * (f(x) - y)^2
Linear Component for Large Errors: For errors larger than δ, it applies the
linear component δ * |f(x) - y| - (1/2) * δ^2
Huber loss operates in two modes that are switched based on the size of the
calculated difference between the actual target value and the prediction of the
machine learning algorithm. The key term within Huber Loss is delta (δ). Delta
is a threshold that determines the numerical boundary at which the Huber
Loss utilizes the quadratic application of loss or linear calculation.
Suppose the calculated error, which is the difference between the actual and
predicted values, is larger than the delta. In that case, Huber Loss utilizes the
linear calculation of loss similar to MAE, where there is less sensitivity to the
error size to ensure the trained model isn’t over-penalizing large errors,
especially if the dataset contains outliers or unlikely-to-occur data samples.
The mathematical equation for Binary Cross-Entropy Loss, also known as Log
Loss, is:
L(y, f(x)) = -[y * log(f(x)) + (1 - y) * log(1 - f(x))]
OpenAI
Where:
As noted in the equation by the negative symbol: ‘-’ BCE calculates the loss
by determining the negative of two terms, and for several predictions or data
samples, the average of the negative of the following two terms:
1. The logarithm of the model’s predicted probability went the positive class is
present y * log(f(x))
2. The logarithm of 1 minus the predicted probability of the negative class: (1 - y)
* log(1 - f(x))
The BCE loss function penalizes inaccurate predictions, which are predictions
that have a significant difference from the positive class or, in other words,
have a high quantification of entropy. When BCE is utilized as a component
within learning algorithms, this encourages the model to refine its predictions,
which are probabilities for the appropriate class during its training.
Hinge Loss
OpenAI
Where:
When appropriately selected, the loss function enables the learning algorithm
to effectively converge to an optimal loss during its training phase and
generalize well to unseen data samples. An appropriately selected loss
function acts as a guide, steering the learning algorithm towards accuracy and
reliability, ensuring that it captures the underlying patterns in the data while
avoiding overfitting or underfitting.
Classification vs Regression
Sensitivity
Computational efficiency
Factor Description
Type of Learning
Classification vs Regression; Binary vs Multiclass Classification.
Problem
Model Sensitivity to Some loss functions are more sensitive to outliers (e.g., MSE), while others
Outliers are more robust (e.g., MAE).
Desired Model Influences how the model behaves, e.g., hinge loss in SVMs focuses on
Behavior maximizing the margin.
Computational Some loss functions are more computationally intensive, impacting the
Efficiency choice based on available resources.
The smoothness and convexity of a loss function can affect the ease and
Convergence Properties
speed of training.
For large-scale tasks, a loss function that scales well and can be efficiently
The scale of the Task
optimized is crucial.
Outliers are data samples that fall outside the overall statistical distribution of
a dataset; they are sometimes referred to as anomalies or irregularities. How
outliers are managed determines the performance and accuracy of the trained
machine learning model.
As mentioned earlier, outliers in datasets affect the error values utilized in loss
functions, depending on the loss function used. The effect of outliers on the
loss functions propagates to the outcome of the learning process of the
machine learning algorithm, which can lead to intended or unintended
behavior from the machine learning algorithm or model.
For example, mean squared error penalizes outliers contributing to large error
values/terms; this means during the training process, the model weights are
adjusted to learn how to accommodate these outliers. Again, if this isn’t the
intended behavior of the machine learning model, the finalized model created
after training will have poor generalization to unseen data. For scenarios
where mitigating the impact of outliers is required, functions such as MAE and
Huber loss are more applicable.
"""
Calculate the Mean Absolute Error between actual and predicted values
"""
raise ValueError("The length of actual values and predicted values must be the same")
return mae
# Example usage:
# actual values
# predicted values
# Calculate MAE
mae_value = mean_absolute_error(y_true, y_pred)
print(mae_value)
# 0.5
OpenAI
"""
Calculate the Mean Squared Error between actual and predicted values
"""
if len(actual) != len(predicted):
raise ValueError("The length of actual values and predicted values must be the same")
return mse
# Example usage:
# actual values
y_true = [1, 2, 3, 4, 5]
# predicted values
# Calculate MSE
print(mse_value)
# 0.015999999999999993
The use of libraries for loss function implementation
Utilizing these deep learning libraries provides advantages over pure Python
implementations, some of which are:
Ease of use
Efficiency and optimisation
GPU and parallel computing support
Developer community support
Mean Absolute Error (MAE) using the scikit-learn library
# actual values
# predicted values
print(mae_value)
#0.5
OpenAI
# actual values
y_true = [1, 2, 3, 4, 5]
# predicted values
print(mse_value)
# 0.016
OpenAI
Conclusion
In summary, choosing the right loss function is crucial for effective machine
learning model training. This article highlighted key loss functions, their roles
in machine learning algorithms, and their suitability for different tasks. From
Mean Squared Error (MSE) to Huber Loss, each function has its unique
advantages, whether it's handling outliers or balancing bias and variance.
The decision to use custom or pre-built loss functions from libraries like Scikit-
learn, TensorFlow, and PyTorch hinges on specific project needs,
computational efficiency, and user expertise. These libraries offer ease of
implementation, ongoing community support, and regular updates.