Unit-2
Shallow Neural Networks
Contents
• Neural Architectures for Binary Classification Models,
• Neural Architectures for Multiclass Models,
• Autoencoder: Basic Principles,
• Neural embedding with continuous bag of words,
• Simple neural architectures for graph embeddings
Types of Loss functions
Mean Square Error:
• Applicable to tasks: Regression, Classification
Mean Absolute Error:
• Applicable to tasks: Regression
Cross Entropy:
• Applicable to tasks: Classification
How to choose the loss function?
Introduction
• Conventional machine learning often uses optimization and gradient-descent
methods for learning parameterized models.
Ex: linear regression, support vector machines, logistic regression,
dimensionality reduction, and matrix factorization.
• Neural networks are also parameterized models that are learned with continuous
optimization methods.
As the amount of data increases, neural
networks have an advantage because they
retain the flexibility to model more complex
functions with the addition of neurons to the
computational graph
Introduction
• Two classes of models for machine learning:
1. Supervised models:
• linear models and their variants.
Ex:least-squares regression, support vector machines, and logistic regression.
• Multiclass variants of these models
2. Unsupervised models:
• dimensionality reduction and matrix factorization. Traditional methods like principal
component analysis can also be presented as simple neural network architectures.
• The neural network framework also provides a way of understanding the relationships
• between widely different unsupervised methods like linear dimensionality reduction,
• nonlinear dimensionality reduction, and sparse feature learning.
Neural Architectures for Binary Classification Models
• The corresponding neural architectures of machine learning models, such
as Least-squares regression and Classification, are minor variations of the
perceptron model.
• The main difference is in
• the choice of the activation function used in the final layer,
• the loss function used on these outputs.
• In a single-layer network with d input nodes and a single output node, the
coefficients of the connections from the d input nodes to the output node
are denoted by
Neural Architectures for Binary Classification Models
1. Revisiting Perceptron
• Let (�� , �� ) be a training instance, in which the observed value �� is predicted from
the feature variables �� using the following relationship:
�� = ����(� ·�� )
• Here, � is the d-dimensional coefficient vector learned by the perceptron
• The goal of training is to ensure that the prediction �� is as close as possible to the
observed value �� .
• The gradient-descent steps of the perceptron are focused on reducing the number of
misclassifications, and therefore the updates are proportional to the difference (�� −
�� ) between the observed and predicted values
• A gradient-descent update that is proportional to the difference between the
observed and
�
predicted values is naturally caused by a squared loss function such as
(�� − �� ) .
Neural Architectures for Binary Classification Models
1. Revisiting Perceptron
linear activations to compute the continuous
Loss function is discrete because it takes
loss function,
on the
value of either 0 or 4. Such a loss function
It still uses sign activations to compute the
is not differentiable because of its
discrete predictions for a given test instance.
Neural Architectures for Binary Classification Models
2. Least-Squares Regression
• In least-squares regression, the training data contains n different
training pairs
• (�1, �1) . . . (��, ��), where each �� is a d-dimensional
representation of the data points, and each yi is a real-valued target.
• The target is real-valued is important
• Least-squares regression is the oldest of all learning problems
• The gradient-descent methods proposed by Tikhonov and Arsenin in
the 1970s are very closely related to the gradient-descent updates of
Rosenblatt for the perceptron algorithm.
Least-Squares Regression
• In least-squares regression, the target variable is related to the feature
variables using the following relationship:
�� = � ·��
• The bias is missing in the relationship
• one of the features in the training data has a constant value of 1, and
the coefficient of this dummy feature is the bias.
In neural networks, the bias is often represented using a bias neuron with
a constant output of 1.
Least-Squares Regression
• The error of the prediction, �� , is given by �� = (�� − �� ).
• Here, � = (�1 . . . ��) is a d-dimensional coefficient vector that needs
to be learned so as to minimize the total squared error on the training data,
which is
•
• The portion of the loss that is specific to the ��ℎ training instance is given
by the following:
• This loss can be simulated with the use of an architecture similar to the
perceptron
except that the squared loss is paired with the identity activation function.
Least-Squares Regression
As in the perceptron algorithm, the stochastic gradient-descent steps are
determined by computing the gradient of �2� with respect to W, when the
training pair (�� , �� ) is presented to the neural network. This gradient can
be computed as follows:
One can rewrite the above update as follows:
Note that the update above
looks identical to the
perceptron update of Equation
Least-Squares Regression
• It is possible to modify the gradient-descent updates of least-squares
regression to incorporate forgetting factors.
• Adding regularization is equivalent to penalizing the loss function of
• least-squares classification with the additional term proportional to
� · ||�||2 , where λ > 0
• is the regularization parameter. With regularization, the update can
be written as follows:
In least-squares regression, the prediction ˆyi is a real value without the application
of the sign function
Least-Squares Regression
• Both the perceptron and least-squares regression have the same goal of
minimizing the prediction error.
• However, since the loss function in classification is inherently discrete, the
perceptron algorithm uses a smooth approximation of the desired goal.
• The gradient-descent update in least-squares regression is very similar to
that in the perceptron, with the main difference being that real-valued errors
are used in regression rather than discrete errors drawn from {−2, +2}.
• However, the least-squares classification method does not yield the same
result as the perceptron algorithm, because the real-valued training errors
(�� − �� ) in least-squares classification are computed differently from the
integer error (�� − �� ) in the perceptron.
• This direct application of least-squares regression to binary targets is referred
to as Widrow-Hoff learning.
Least-Squares Regression-Widrow-Hoff Learning
• The Widrow-Hoff learning rule was proposed in 1960.
• It is a direct application of least-squares regression to binary targets.
• The sign function is applied to the real-valued prediction of unseen test
instances to convert them to binary predictions.
• Error of training instances is computed directly using real-valued
predictions (unlike the perceptron).
• Therefore, it is also referred to as least-squares classification or the linear
least-squares method.
• Remarkably, a seemingly unrelated method proposed in 1936, known as
the Fisher discriminant, also reduces to Widrow-Hoff learning in the special
case of binary targets.
Least-Squares Regression-Widrow-Hoff Learning
The neural architecture for classification with the Widrow-Hoff
method
Least-Squares Regression-Widrow-Hoff Learning
• The Widrow-Hoff learning rule is also referred to as Adaline, which is short for adaptive linear
neuron. It is also referred to as the delta rule.
• The loss function of theWidrow-Hoff method can be rewritten slightly from least-squares
regression because of its binary responses:
The gradient-descent updates of least-squares regression can be rewritten slightly for Widrow-Hoff learning
because of binary response variables:
The second form of the update is helpful in relating it to perceptron and SVM
updates,
Least-Squares Regression-Closed Form Solutions
• The special case of least-squares regression and classification is solvable in closed form (without
gradient descent) by using the pseudo-inverse of the n × d training data matrix D, whose rows are
�1. . ��.
• Let the n-dimensional column vector of dependent variables be denoted by � = [�� . . . ��]� . The
pseudo-inverse of matrix D is defined as follows
Then, the row-vector W is defined by the following relationship:
If regularization is incorporated, the coefficient vector W is given by the following:
Here, λ > 0 is the regularization parameter.
Neural Architectures for binary classification-
Logistic Regression
• Logistic regression is a probabilistic model that classifies the
instances in terms of probabilities.
• Because the classification is probabilistic, a natural approach for
optimizing the parameters is to ensure that the predicted probability
of the observed class for each training instance is as large as
possible.
Logistic Regression
• This goal is achieved by using the notion of maximum likelihood
estimation in order to learn the parameters of the model.
• The likelihood of the training data is defined as the product
• of the probabilities of the observed labels of each training
instance.
• By using the negative logarithm of this value, one obtains a loss
function in minimization form.
• The output node uses the negative log-likelihood as a loss
function.
Logistic Regression
Logistic Regression
The output layer can be formulated with the sigmoid activation
function
Logistic Regression
Let the loss for the ith training instance be denoted by Li, which is also annotated in the above
equation. Then, the gradient of Li with respect to the weights in W can be computed as follows:
Therefore, the gradient-descent updates of logistic regression are given by the following (including
regularization):
• Just as the perceptron and
the Widrow-Hoff algorithms
use the magnitudes of the
mistakes to make updates,
• the logistic regression
method uses the
probabilities of the mistakes
to make updates. This is a
natural extension of the
probabilistic nature of the
loss function to the update.
Logistic regression-Alternative Choices of Activation and Loss
• It is possible to implement the same model by using different choices of activation and loss in
the output node as long as they combine to yield the same result.
• Instead of using sigmoid activation to create the output �� ∈ (0, 1), it is also possible to use
identity activation to create the output �� ∈ (−∞,+∞), and then apply the following loss function:
For the final prediction of the test instance, the sign
function can be applied to �� , which is equivalent to
predicting it to the class for which its probability is
greater than 0.5.
One desirable property of using the identity activation
to define ˆyi is that it is consistent with how the loss
functions of other models like the perceptron and
Widrow-Hoff learning are defined.
The loss function of the above Equation contains the
product of yi and
ˆyi as in other models. This makes it possible to directly
compare the loss functions of various models
Neural Architectures for binary classification-
Support Vector Machines
• The loss function in support vector machines is closely related to that in logistic regression.
• Instead of using a smooth loss function, the hinge-loss is used
• Consider the training data set of n instances denoted by (�1 , �1 ), (�1 , �2 ), . . . (�� , �� ).
• The neural architecture of the support-vector machine is identical to that of least-squares
classification (Widrow-Hoff).
• The main difference is in the choice of loss function. As in the case of least-squares classification,
the prediction �� for the training point �� is obtained by applying the identity activation function
• on � · �� . Here, � = (w1, . . . wd) contains the vector of d weights for the d different inputs into
the single-layer network.
• The output of the neural network is �� = � · �� for computing the loss function
• A test instance is predicted by applying the sign function to the output.
• The loss function Li for the ith training instance in the support-vector machine is defined as
follows:
This loss is referred to as the
hinge-loss
Support Vector Machines
• The overall idea behind this loss function is that
• a positive training instance is only penalized for being less than 1,
• a negative training instance is only penalized for being greater than −1.
• In both cases, the penalty is linear and abruptly flattens out at the
aforementioned thresholds.
• It is helpful to compare this loss function with the Widrow-Hoff loss value
of (1 − �� �� )2 , in which predictions are penalized for being different from
the target values.
• This difference is an important advantage for the support vector machine
over the Widrow-Hoff loss function.
Support Vector Machines
• The difference in loss functions between the perceptron, Widrow-Hoff, logistic
regression, and the support vector machine, the loss for a single positive training
instance at different values of �� = � · �� in shown in below Figure
The loss functions of different variants of the
perceptron. Key observations: (i)
The SVM loss is shifted from the perceptron
(surrogate) loss by exactly one unit to the right;
(ii) the logistic loss is a smooth variant of the SVM
loss;
(iii) the Widrow-Hoff/Fisher loss is the only case in
which points are increasingly penalized for
classifying points “too correctly”
(i.e., increasing � · � beyond +1 for � in positive
class). Repairing the Widrow-Hoff loss
function by setting it to 0 for � · � >1 yields the
quadratic loss SVM
Support Vector Machines
• In the case of the support-vector machine the hinge-loss function flattens
out beyond threshold point.
• Only misclassified points or points that are too close to the decision
boundary � · � = 0 are penalized.
• The perceptron criterion is identical in shape to the hinge loss, except that
it is shifted by one unit to the left.
• The Widrow-Hoff method is the only case in which a positive training point
is penalized for having too large a positive value of � · �� .
• the Widrow-Hoff method penalizes points for being properly classified in a
very strong way.
• This is a potential problem with the Widrow-Hoff objective function, in
which well-separated points cause problems in training.
Support Vector Machines
�� = � · ��
• The stochastic gradient-descent method computes the partial
derivative of the point-wise loss function Li with respect to the
elements in W. The gradient is computed as follows:
• The stochastic gradient method samples a point and checks whether
�� �� < 1. If this is the case, an update is performed that is
proportional to �� �� :
Here, �(·) ∈ {0, 1} is the indicator function that takes on the value of 1 when the condition in its
argument is satisfied. This approach is the simplest version of the primal update for SVMs
Support Vector Machines
• This update is identical to that of a (regularized) perceptron, except that the condition for making this
update in the perceptron is �� �� < 0.
• A perceptron makes the update only when a point is misclassified, whereas the support vector
machine also makes updates for points that are classified correctly, albeit not very confidently.
• This neat relationship is because the loss function of the perceptron criterion is shifted from the
hinge-loss in the SVM.
• To emphasize the similarities and differences in the loss functions used by the different methods, we
tabulate the loss functions below:
Neural Architectures for Multiclass Models
Binary Classes versus Multiple Classes
• All the models discussed so far discuss only the binary class setting in
which the class label is drawn from {−1,+1}.
• Many natural applications contain multiple classes without a natural
ordering among them:
– Predicting the category of an image (e.g., truck, carrot).
– Language models: Predict the next word in a sentence.
• Models like logistic regression are naturally designed to predict two
classes.
Neural Architectures for Multiclass Models- Multiclass Perceptron
• Consider a setting with k different classes. Each training instance (�� , c(i))
contains a d dimensional feature vector �� and the index c(i) ∈ {1 . . . k} of its
observed class.
• In such a case, we would like to find k different linear separators
�1 . . . �� simultaneously so that the value of ��(�) · �� is larger than �� · ��
for each r = c(i). This is because one always predicts a data instance Xi to the
class r with the largest value of �� · �� .
• The loss function for the ith training instance in the case of the multiclass
perceptron is defined as follows:
Neural Architectures for Multiclass Models- Multiclass Perceptron
class 2 is assumed to be the ground-
truth class
Neural Architectures for Multiclass Models- Multiclass
Perceptron
• As in all neural network models, one can use gradient-descent in
order to determine the updates.
• For a correctly classified instance, the gradient is always 0, and there
are no updates.
• For a misclassified instance, the gradients are as follows:
Neural Architectures for Multiclass Models- Multiclass
Perceptron
• The stochastic gradient-descent method is applied as follows.
• Each training instance is fed into the network. If the correct class
r = c(i) receives the largest of output �� · �� , then no update needs
to
be executed. Otherwise, the following update is made to each
separator �� for learning rate α > 0:
Neural Architectures for Multiclass Models- Multiclass Perceptron
• Only two of the separators are always updated at a given time.
• In the special case that k = 2, these gradient updates reduce to the
perceptron because both the separators �1 and �2 will be related
as�1 = − �2 if the descent is started at �1 = �2 = 0.
• Another quirk that is specific to the unregularized perceptron is that it is
possible to use a learning rate of α = 1 without affecting the learning
because the value of α only has the effect of scaling the weight when
starting with �� = 0.
• This property is, however, not true for other linear models in which the
value of α does affect the learning.
Neural Architectures for Multiclass Models- Weston-Watkins SVM
• The Weston-Watkins SVM varies on the multiclass perceptron in two ways:
1. The multiclass perceptron only updates the linear separator of a class that is
predicted most incorrectly, along with the linear separator of the true class.
The Weston-Watkins SVM updates the separator of any class that is predicted
more favorably than the true class.
2. Not only does the Weston-Watkins SVM update the separator in the case of
misclassification,
• It updates the separators in cases where an incorrect class gets a prediction
that is “uncomfortably close” to the true class. This is based on the notion of
margin.
Neural Architectures for Multiclass Models- Weston-Watkins
SVM
• As in the case of the multiclass perceptron, it is assumed that the
��ℎ training instance is denoted by (�� , c(i)), where �� contains the d-
dimensional feature variables, and c(i) contains the class index drawn
from {1, . . . , k}.
• One wants to learn d-dimensional coefficients �1 . . . �� of the k
linear separators so that the class index r with the largest value of
�� · �� is predicted to be the, correct class c(i).
• The loss function �� for the ith training instance (�� c(i)) in the Weston-
Watkins SVM is as follows:
Neural Architectures for Multiclass Models- Weston-
Watkins SVM
Neural Architectures for Multiclass Models- Weston-
Watkins SVM
Comparison of the objective function of the Weston-Watkins SVM and
multiclass perceptron:
• First, for each class r = c(i), if the prediction �� · �� lags behind that of the
true class by less than a margin amount of 1, then a loss is incurred for that
class.
• The losses over all such classes r = c(i) are added, rather than taking the
maximum of the losses.
• In order to determine the gradient-descent updates, one can find the
gradient of the loss function with respect to each �� .
• In the event that the loss function �� is 0, the gradient of the loss function
is 0 as well.
• Therefore, no update is required when the training instance is classified
correctly with sufficient margin with respect to the second-best class.
Neural Architectures for Multiclass Models- Weston-
Watkins SVM
• If the loss function is non-zero we have either a misclassified or a
“barely correct” prediction in which the second-best and best class
prediction are not sufficiently separated.
• In such cases, the gradient of the loss is non-zero. The loss function of
is created by adding up the contributions of the (k−1) separators
belonging to the incorrect classes.
• Let δ(r, �� ) be a 0/1 indicator function, which is 1 when the ��ℎ class
separator contributes positively to the loss function.
• In such a case, the gradient of the loss function is as follows:
Neural Architectures for Multiclass Models- Weston-
Watkins SVM
• This results in the following stochastic gradient-descent step for the
��ℎ separator �� at learning rate α:
• For training instances �� in which the loss �� is zero, the above update
can be shown to simplify to a regularization update of each
hyperplane �� :
The regularization uses the parameter λ > 0. Regularization is considered essential to the
proper functioning of a support vector machine.
Neural Architectures for Multiclass Models-
Multinomial Logistic Regression (Softmax Classifier)
• Multinomial logistic regression can be considered the multi-way generalization of logistic
regression, just as the Weston-Watkins SVM is the multiway generalization of the binary
SVM.
• Multinomial logistic regression uses negative log-likelihood loss, and is therefore a probabilistic
model.
• In the multiclass perceptron, it is assumed that the input to the model is a training data set
containing pairs of the form (�� , c(i)), where c(i) ∈ {1 . . . k} is the index of the class of d-
dimensional data point �� .
• In the previous two models, the class r with the largest value of �� · �� is predicted to be the
label of the data point �� .
• In this case, there is an additional probabilistic interpretation of �� · �� in terms of the posterior
probability P(r| �� ) that the data point �� takes on the label r. This estimation can be naturally
accomplished with the softmax activation function:
Neural Architectures for Multiclass Models- Multinomial
Logistic Regression (Softmax Classifier)
• The model predicts the class membership in terms of probabilities. The loss function �� for the ��ℎ
training instance is defined by the cross-entropy, which is the negative logarithm of the
probability of the true class.
• The neural architecture of the softmax classifier is illustrated in below Figure
Neural Architectures for Multiclass Models- Multinomial
Logistic Regression (Softmax Classifier)
• The cross-entropy loss may be expressed in terms of either the input features or in terms of the softmax
pre-activation values �� = �� · �� as follows:
• Therefore, the partial derivative of �� with respect to �� can be computed as follows:
Neural Architectures for Multiclass Models- Multinomial
Logistic Regression (Softmax Classifier)
• The gradient of the loss of the ��ℎ training instance with respect to the
separator of the ��ℎ class is computed by using the chain rule of
differential calculus in terms of its pre-activation value �� = �� · �� :
• In the above simplification, we used the fact that �� has a zero gradient
with respect to �� for � ≠ �.
���
• The value of can be substituted to obtain the following result:
���
Neural Architectures for Multiclass Models- Multinomial
Logistic Regression (Softmax Classifier)
• Each of the terms [1 − P(r| �� )] and P(r| �� ) is the probability of
making a mistake for an instance with label c(i) with respect to the
predictions for the ��ℎ class.
• After including similar regularization impact as other models, the
separator for the ��ℎ class is updated as follows:
Here, α is the learning rate, and λ is the regularization parameter. The softmax classifier updates all the k
separators for each training instance, unlike the multiclass perceptron and the Weston-Watkins SVM,
each of which updates only a small subset of separators (or no separator) for each training instance. This
is a consequence of probabilistic modeling, in which correctness is defined in a soft way
Neural Architectures for Multiclass Models- Hierarchical
Softmax for Many Classes
• Consider a classification problem in which we have an extremely large number of
classes.
• In such a case, learning becomes too slow, because of the large number of
separators that need to be updated for each training instance.
• This situation can occur in applications like text mining, where the prediction is a
target word. Predicting target words is particularly common in neural language
models, which try to predict the next word given the immediate history of previous
words.
• The cardinality of the number of classes will typically be larger than 105 in such
cases.
• Hierarchical softmax is a way of improving learning efficiency by decomposing the
classification problem hierarchically.
• The idea is to group the classes hierarchically into a binary tree-like structure, and
then perform log2 � binary classifications from the root to the leaf for k-way
classification.
Autoencoders
Unsupervised Learning
• The models we have discussed so far use training pairs of the form (�, y)
in which the feature variables X and target y are clearly separated.
– The target variable y provides the supervision for the learning process.
• What happens when we do not have a target variable?
– We want to capture a model of the training data without the
guidance of the target.
– This is an unsupervised learning problem.
Autoencoders
• Autoencoders represent a fundamental architecture that is used for
various types of unsupervised learning, including matrix factorization,
principal component analysis, and dimensionality reduction.
• Natural architectural variations of the autoencoder can also be used for
matrix factorization of incomplete data to create recommender systems.
• Some recent feature engineering methods in the natural language
domain like word2vec can also be viewed as variations of autoencoders,
which perform nonlinear matrix factorizations of word-context matrices.
Autoencoder: Basic Principles
• The basic idea of an autoencoder is to have an output layer with the same
dimensionality as the inputs.
• The idea is to try to reconstruct each dimension exactly by passing it through the
network.
• An autoencoder replicates the data from the input to the output, and is therefore
sometimes referred to as a replicator neural network.
Defining the Input and Output of an Autoencoder:
• It is common (but not necessary) for an M-layer autoencoder to have a symmetric
architecture between the input and output,
§ the number of units in the kth layer is the same as that in the (M − k + 1)th layer.
• The value of M is often odd, as a result of which the (M + 1)/2th layer is often the most
constricted layer.
§ The (non-computational) input layer is counted as the first layer
§ The minimum number of layers in an autoencoder would be three,
corresponding to the input layer, the constricted layer, and the output layer.
The basic schematic of the autoencoder
• The reduced representation of the data is also sometimes referred to as the code, and
• the number of units in this layer is the dimensionality of the reduction.
• The initial part of the neural architecture before the bottleneck is referred to as the
encoder (because it creates a reduced code), and
• the final part of the architecture is referred to as the decoder (because it reconstructs
from the code).
Auto Encoders-examples
• The essential principle of an automatic decoder is to
compress and effectively represent input data without
specific labels.
• The encoder transforms the input data into a reduced-
dimensional representation, which is often referred to as
“latent space” or “encoding”.
• From that representation, a decoder rebuilds the initial input.
Autoencoder: Basic Principles--Architecture of Auto
Encoders
An autoencoder consist of three layers:
• Encoder
• Input layer take raw input data
• The hidden layers progressively reduce the dimensionality of the
input
• capture important features and patterns
• The compressed image is the distorted version of the original image
• Code/Bottleneck/latent space
• The Code layer is the final hidden layer
• dimensionality is significantly reduced
• This layer represents the compressed encoding of the input data.
• Decoder
• takes the encoded representation and expands it back to the
dimensionality of the original input.
• Increase the dimensionality and aim to reconstruct the original input.
• The output layer produces the reconstructed output, which ideally
should be as close as possible to the input data.
• The decoded image is a lossy reconstruction of the original image
Auto Encoders : Basic Principles
Auto Encoders
• The loss function used during training is
§ typically a reconstruction loss,
§ measures the difference between the input
and the reconstructed output
§ EX: mean squared error (MSE) for continuous
data or binary cross-entropy for binary
data.
• During training, the auto encoder
§ learns to minimize the reconstruction loss,
§ forcing the network to capture the most
important features of the input data in the
bottleneck layer.
Under complete and Over complete auto encoders
• There are several ways to design auto encoders to copy only approximately.
• The system learns useful properties of the data.
Undercomplete:
• The number of units in each middle
layer is typically fewer than that in the
input (or output).
• The final layer can no longer
reconstruct the data exactly.
• This type of reconstruction is inherently
lossy
Overcomplete:
• the number of units in hidden layer is
There are infinitely many hidden representations with zero error.
equal to or larger than input/output? – The middle layers often do not learn the identity
layers, function.
• Must be regularalized
Autoencoder with a Single Hidden Layer
Auto encoder or Matrix factorization:
• The simplest version of an autoencoder, which is used for matrix factorization.
• This autoencoder only has a single hidden layer of k ≪ d units between the input and
output layers of d units each.
• Assume that we have an n×d matrix denoted by D, which we would like to
factorize into an n×k matrix U and a d × k matrix V :
• Here, k is the rank of the factorization. The matrix U contains the reduced
representation of the data, and the matrix V contains the basis vectors.
• Matrix factorization is one of the most widely studied problems in supervised learning,
§ It is used for dimensionality reduction, clustering, and predictive
modeling in recommender systems.
Autoencoder with a Single Hidden Layer
• In traditional machine learning, this problem is solved by minimizing the
Frobenius norm of the residual matrix denoted by (D − U �� ).
• The squared Frobenius norm of a matrix is the sum of the squares of the
entries in the matrix.
• The objective function of the optimization problem as follows:
• Here, the notation || · ||F indicates the Frobenius norm.
• The parameter matrices U and V need to be learned in order to optimize the
aforementioned error.
• That particular solution is referred to as truncated singular value
decomposition.
• The vector of values in a particular layer of the
network can be obtained by multiplying the
vector of values in the previous layer by the
matrix of weights connecting the two layers
(with linear activation).
• Since the activations of the hidden layer are U
and the decoder weights contain the matrix �� , it
follows that the reconstructed output contains
the rows of U �� .
• The autoencoder minimizes the sum-of-squared
differences between the input and the output,
which is equivalent to minimizing
A basic autoencoder with a single layer(neural
||� − � �� ||� .
architecture for SVD)
• Therefore, the same problem is being solved as
• where the hidden layer contains k units. singular value decomposition.
• The rows of D are input into the autoencoder,
• k-dimensional rows of U are the activations of the
hidden layer.
• The k×d matrix of weights in the decoder is ��
Autoencoder with a Single Hidden Layer
• One can use this approach to provide the reduced representation of
out-of sample instances that were not included in the original matrix
D.
• One simply has to feed these out-of-sample rows as the input, and
the activations of the hidden layer will provide the reduced
representation.
• Reducing out-of-sample instances is particularly useful for nonlinear
dimensionality-reduction methods, as it is more difficult for
traditional machine learning methods to fold in new instances
Encoder Weights
• The encoder weights are contained in the k × d matrix denoted by W.
• the autoencoder creates the reconstructed representation D�� �� of the
original data matrix.
• It tries to optimize the problem of minimizing || ��� �� − �||2 .
• The optimal solution to this problem is obtained when the matrix W contains the
pseudo-inverse of V , which is defined as follows:
• By the definition of the pseudo-inverse, it follows that WV = I and �� �� = I,
where I is a k×k identity matrix. Post-multiplying Equation with �� we
obtain the following:
Connections with Singular Value Decomposition
• The single-layer autoencoder architecture is closely connected with
singular value decomposition (SVD).
• Singular value decomposition finds a factorization U �� in which the
columns of V are orthonormal.
• The loss function of this neural network is identical to that of singular
value decomposition
• A solution V in which the columns of V are orthonormal will always be
one of the possible optima obtained by training the neural network
Sharing Weights in Encoder and Decoder
• common practice that is used in the autoencoder construction is to
share some of the weights between the encoder and the decoder.
This is also referred to as tying the weights.
• The autoencoder has an inherently symmetric structure, in which the
weights of the encoder and decoder are forced to be the same in
symmetrically matching layers.
• In the shallow case, the encoder and decoder weights are shared by
using the following relationship:
Sharing Weights in Encoder and Decoder
• The tying of the weights
effectively means that �� is the
pseudo-inverse of V .
• We have �� V = I, and therefore
the columns of V are mutually
orthogonal.
• As a result, by tying the weights,
it is now possible to exactly
simulate SVD
Basic autoencoder with a single layer; note tied
weights
Sharing Weights in Encoder and Decoder
• Advantages:
• Reduces the number of parameters by a factor of 2.
• This is beneficial from the point of view of reducing
overfitting.
• Another benefit of tying the weight matrices in the encoder and
the decoder is that
It automatically normalizes the columns of V to similar values.
• This is also useful from the perspective of providing better
normalization of the embedded representation.
Auto encoders--Nonlinear Activations
• The real power of autoencoders is realized
• One can also use a sigmoid function in the final layer to predict the
output.
• This sigmoid layer is combined with negative log loss.
• Therefore,for a binary matrix B = [bij ], the model assumes the
following:
Auto encoders--Nonlinear Activations
• If the output layer is linear, the overall factorization is still of the following form:
• we can write U′ = D �� , which is a linear projection of the original
• matrix D. Then, the factorization can be written as follows:
Here, U′ is a linear
projection of D.
• Multiple hidden layers are used to learn more complex forms of nonlinear
dimensionality reduction. Nonlinearity can also be combined in the hidden layers
and in the output layer.
• Nonlinear dimensionality reduction methods can map the data into much lower
dimensional spaces
• Nonlinear dimensionality-reduction methods often require deeper
networks due to the more complex transformations possible with
the combination of nonlinear units.
Deep Autoencoders
• The real power of autoencoders in the neural network domain is realized when deeper variants
• are used. For example, an autoencoder with three hidden layers is shown in Figure
An example of an autoencoder with three hidden
layers. Combining nonlinear activations with
multiple hidden layers increases the
Deep Autoencoders
• One can increase the number of intermediate layers in order to
further increase the representation power of the neural network.
• It is essential for some of the layers of the deep autoencoder to use a
nonlinear activation function to increase its representation power
• Deep networks with multiple layers provide an extraordinary amount
of representation power.
• The multiple layers of this network provide hierarchically reduced
representations of the data.
• For some data domains like images, hierarchically reduced
representations are particularly natural.
Most autoencoders can be extended to have more than one hidden layer
Sparse Auto Encoder
• contains more hidden units than the input but only a few are allowed to be active at
once. This property is called the sparsity of the network.
• The sparsity can be controlled by
§ either manually zeroing the required hidden units,
§ tuning the activation functions or
§ by adding a loss term to the cost function.
• Advantages
§ filtering out noise and irrelevant features during the encoding process.
§ learn important and meaningful features due to their emphasis on sparse activations.
• Disadvantages
§ Different inputs should result in the activation of different nodes of the network.
§ sparsity constraint increases computational complexity.
Denoising Auto encoder
• Injecting random noise into input data to improve robustness
Denoising Auto encoder
Fig. An example about denoise through denoising autoencoder
• Denoising auto encoder
§ works on a partially corrupted input
§ trains to recover the original undistorted image.
§ An effective way to constrain the network from simply copying the input and
thus learn the underlying structure and important features of the data.
Autoencoder Applications
• Dimensionality reduction ⇒ Use activations of constricted hidden
layer
• Sparse feature learning ⇒ Use activations of constrained/regularized
hidden layer
• Outlier detection: Find data points with larger reconstruction error
– Related to denoising applications
• Generative models with probabilistic hidden layers (variational
autoencoders)
• Representation learning ⇒ Pretraining
Application to Outlier Detection
• Dimensionality reduction is closely related to outlier detection, because
outlier points are hard to encode and decode without losing substantial
information.
• It is a well-known fact that if a matrix D is factorized as D ≈ D′ = U �� , then
the low-rank matrix D′ is a de-noised representative of the data.
• After all, the compressed representation U captures only the regularities in
the data, and is unable to capture the unusual variations in specific points.
• As a result, reconstruction to D′ misses all these unusual variations.
• The absolute values of the entries of (D −D′) represent the outlier scores of
the matrix entries.
Application to Outlier Detection
• Therefore, one can use this approach to find outlier entries, or add
the squared scores of the entries in each row of D to find the outlier
score of that row.
• Therefore, one can identify outlier data points. Furthermore, by
adding the squared scores in each column of D, one can find outlier
features.
• This is useful for applications like feature selection in clustering,
where a feature with a large outlier score can be removed because it
adds noise to the clustering.