0% found this document useful (0 votes)
18 views

Q Learning

Uploaded by

Jeffy Shiny
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Q Learning

Uploaded by

Jeffy Shiny
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 38

Q learning:

Reinforcement learning requires a machine learning model to learn from the


problem and come up with the most optimal solution by itself. This means that
we also arrive at fast and unique solutions which the programmer might not
even have thought of.

Consider the image below. You can see a dog in a room that has to perform an
action, which is fetching. The dog is the agent; the room is the environment it
has to work in, and the action to be performed is fetching.

Figure 1: Agent, Action, and Environment

If the correct action is performed, we will reward the agent. If it performs the
wrong action, we will not give it any reward or give it a negative reward, like a
scolding.

Figure 2: Agent performing an action


What Is Q-Learning?

Q-Learning is a Reinforcement learning policy that will find the next best
action, given a current state. It chooses this action at random and aims to
maximize the reward.

Figure 3: Components of Q-Learning

Q-learning is a model-free, off-policy reinforcement learning that will find the


best course of action, given the current state of the agent. Depending on where
the agent is in the environment, it will decide the next action to be taken.
The objective of the model is to find the best course of action given its current
state. To do this, it may come up with rules of its own or it may operate outside
the policy given to it to follow. This means that there is no actual need for a
policy, hence we call it off-policy.

Model-free means that the agent uses predictions of the environment’s expected
response to move forward. It does not use the reward system to learn, but rather,
trial and error.

An example of Q-learning is an Advertisement recommendation system. In a


normal ad recommendation system, the ads you get are based on your previous
purchases or websites you may have visited. If you’ve bought a TV, you will
get recommended TVs of different brands.
Figure 4: Ad Recommendation System

Using Q-learning, we can optimize the ad recommendation system to


recommend products that are frequently bought together. The reward will be if
the user clicks on the suggested product.

Figure 5: Ad Recommendation System with Q-Learning

Important Terms in Q-Learning

1. States: The State, S, represents the current position of an agent


in an environment.
2. Action: The Action, A, is the step taken by the agent when it is
in a particular state.

3. Rewards: For every action, the agent will get a positive or


negative reward.

4. Episodes: When an agent ends up in a terminating state and


can’t take a new action.

5. Q-Values: Used to determine how good an Action, A, taken at a


particular state, S, is. Q (A, S).

6. Temporal Difference: A formula used to find the Q-Value by


using the value of current state and action and previous state and
action.

What Is The Bellman Equation?

The Bellman Equation is used to determine the value of a particular state and
deduce how good it is to be in/take that state. The optimal state will give us the
highest optimal value.

The equation is given below. It uses the current state, and the reward associated
with that state, along with the maximum expected reward and a discount rate,
which determines its importance to the current state, to find the next state of our
agent. The learning rate determines how fast or slow, the model will be
learning.
Figure 6: Bellman Equation

How to Make a Q-Table?

While running our algorithm, we will come across various solutions and the
agent will take multiple paths. How do we find out the best among them? This
is done by tabulating our findings in a table called a Q-Table.

A Q-Table helps us to find the best action for each state in the environment. We
use the Bellman Equation at each state to get the expected future state and
reward and save it in a table to compare with other states.

Lets us create a q-table for an agent that has to learn to run, fetch and sit on
command. The steps taken to construct a q-table are :

Step 1: Create an initial Q-Table with all values initialized to 0

When we initially start, the values of all states and rewards will be 0. Consider
the Q-Table shown below which shows a dog simulator learning to perform
actions :

Figure 7: Initial Q-Table (to refer images please refer simplelearn.com)

Step 2: Choose an action and perform it. Update values in the table
This is the starting point. We have performed no other action as of yet. Let us
say that we want the agent to sit initially, which it does. The table will change
to:

Figure 8: Q-Table after performing an action

Step 3: Get the value of the reward and calculate the value Q-Value using
Bellman Equation

For the action performed, we need to calculate the value of the actual reward
and the Q( S, A ) value

Figure 9: Updating Q-Table with Bellman Equation

Step 4: Continue the same until the table is filled or an episode ends

The agent continues taking actions and for each action, the reward and Q-value
are calculated and it updates the table.
Figure 10: Final Q-Table at end of an episode

DEEP Q NETWORK (DQN):

What is DQN in reinforcement learning?.


 Deep Q-Network (DQN) is a deep reinforcement learning algorithm that
uses a neural network to approximate the Q-value function in a
reinforcement learning environment. The Q-value function is a measure
of the expected return for taking a particular action in a particular state,
and is used to guide the agent's actions in the environment.

 In a DQN, the neural network takes in the current state of the


environment as input, and outputs a vector of Q-values for each possible
action. The agent then selects the action with the highest Q-value and
performs that action in the environment. The neural network is trained
using a variant of the Q-learning algorithm, where the target Q-values are
computed using a Bellman equation and the neural network weights are
updated using stochastic gradient descent.

 One of the main advantages of using a neural network to approximate the
Q-value function is that it can handle high-dimensional state spaces, such
as images or audio data. By learning a compact and informative
representation of the state space, the neural network can generalize across
different states and actions, and can learn to make optimal decisions in
complex environments.

 However, DQN training can be challenging due to issues such as


instability and overestimation of Q-values. To address these issues,
several variants of DQN have been proposed, including Double DQN,
Dueling DQN, and Rainbow DQN. These variants introduce
modifications to the original DQN algorithm, such as target network
updates, prioritized experience replay, or distributional Q-learning, in
order to improve performance and stability.

 Overall, DQN is a powerful and widely used algorithm in the field of


deep reinforcement learning, and has been applied to a variety of
applications, such as game playing, robotics, and autonomous driving.

 [06:18, 24/03/2023] Jeffy: Deep Q-Network (DQN) is a deep


reinforcement learning algorithm that uses a neural network to
approximate the Q-value function in a reinforcement learning
environment. The Q-value function is a measure of the expected return
for taking a particular action in a particular state, and is used to guide the
agent's actions in the environment.

 In a DQN, the neural network takes in the current state of the
environment as input, and outputs a vector of Q-values for each possible
action. The agent then selects the action with the highest Q-value and
performs that action in the environment. The neural network is trained
using a variant of the Q-learning algorithm, where the target Q-values are
computed using a Bellman equation and the neural network weights are
updated using stochastic gradient descent.

 One of the main advantages of using a neural network to approximate the
Q-value function is that it can handle high-dimensional state spaces, such
as images or audio data. By learning a compact and informative
representation of the state space, the neural network can generalize across
different states and actions, and can learn to make optimal decisions in
complex environments.

 However, DQN training can be challenging due to issues such as
instability and overestimation of Q-values. To address these issues,
several variants of DQN have been proposed, including Double DQN,
Dueling DQN, and Rainbow DQN. These variants introduce
modifications to the original DQN algorithm, such as target network
updates, prioritized experience replay, or distributional Q-learning, in
order to improve performance and stability.

 Overall, DQN is a powerful and widely used algorithm in the field of
deep reinforcement learning, and has been applied to a variety of
applications, such as game playing, robotics, and autonomous driving.
 [06:20, 24/03/2023] Jeffy: Experience Replay: One of the key features of
DQN is the use of experience replay, where the agent stores transitions
(state, action, reward, next state) in a replay buffer and samples a batch of
transitions randomly for training the neural network. Experience replay
helps to reduce the correlation between subsequent samples, making the
training more efficient and stable.

 Target Network: DQN also employs a separate target network that is used
to compute the target Q-values for training the neural network. The target
network is a copy of the main network, but its weights are updated less
frequently (e.g., every few thousand steps) to stabilize the training and
reduce overestimation of Q-values.

 Exploration vs Exploitation: In order to balance exploration and
exploitation, DQN often uses an epsilon-greedy policy, where the agent
selects the action with the highest Q-value with probability (1-epsilon),
and a random action with probability epsilon. The value of epsilon is
gradually reduced over time to encourage the agent to explore less as it
becomes more confident in its decisions.

 Reward Shaping: Reward shaping is a technique that can be used to
adjust the rewards received by the agent in order to provide additional
information and guidance during training. For example, a negative reward
could be given when the agent loses a life in a game, or a higher reward
could be given for reaching a certain goal state.

 Applications: DQN has been applied to a variety of tasks, including game
playing (e.g., Atari games), robotics, and autonomous driving. DQN has
also been extended to handle continuous action spaces using techniques
such as Deep Deterministic Policy Gradient (DDPG) and Twin Delayed
DDPG (TD3).

 Overall, DQN is a powerful and flexible algorithm for solving
reinforcement learning problems with high-dimensional state spaces.
While it can be challenging to train and requires careful hyperparameter
tuning, DQN has been shown to achieve impressive results in a variety of
domains, and continues to be an active area of research in the field of
deep reinforcement learning.

POLICY GRADIENT METHODS:

Policy gradient methods are a class of algorithms used in reinforcement learning


that aim to learn a policy, or a mapping from states to actions, by optimizing a
performance objective. Deep reinforcement learning, a subfield of machine
learning that combines reinforcement learning with deep neural networks, often
employs policy gradient methods to solve complex problems such as playing
video games, robotic control, and natural language processing.

Policy gradient methods optimize the policy by iteratively adjusting its


parameters to increase the expected cumulative reward. They operate by
estimating the gradient of the expected cumulative reward with respect to the
policy parameters and then updating the policy in the direction of the gradient.

One of the most popular policy gradient methods is the REINFORCE algorithm,
which uses the likelihood ratio trick to estimate the gradient of the expected
cumulative reward. Another popular method is the actor-critic algorithm, which
combines a policy network (the actor) with a value function network (the critic)
to estimate the gradient more efficiently.

More recently, researchers have proposed advanced policy gradient methods that
incorporate techniques such as trust region optimization, natural gradient, and
importance sampling to improve the stability and convergence of the algorithms.
Overall, policy gradient methods have proven to be effective in solving complex
reinforcement learning problems, and they continue to be an active area of
research in deep learning.

Here are some popular policy gradient methods in deep reinforcement learning:

REINFORCE: This is a simple but effective policy gradient algorithm that uses
Monte Carlo estimation to estimate the expected reward for each action taken in
a given state, and then uses these estimates to update the policy parameters in the
direction of the estimated gradient.

Advantage Actor-Critic (A2C): This is an actor-critic algorithm that combines


the benefits of both policy-based and value-based methods. The actor network
learns the policy, while the critic network learns an estimate of the value
function, which is used to estimate the advantage function. The advantage
function measures how much better or worse an action is than the average
action, and is used to improve the policy more efficiently.

Proximal Policy Optimization (PPO): This is a family of policy gradient


algorithms that are designed to be more sample-efficient and stable than other
methods. PPO uses a surrogate objective function that constrains the policy
update to a "trust region" around the current policy, which prevents the update
from being too large and destabilizing the learning process.

Trust Region Policy Optimization (TRPO): This is another algorithm that


constrains the policy update to a trust region, but it uses a different optimization
technique called conjugate gradient descent to compute the policy update. TRPO
is known for its stability and convergence properties, but can be computationally
expensive due to the use of a Hessian matrix.

Asynchronous Advantage Actor-Critic (A3C): This is a parallelized version of


the A2C algorithm that can utilize multiple CPU cores to train multiple agents
simultaneously. This allows for faster training and better exploration of the state-
action space, but can be computationally expensive due to the need for multiple
copies of the environment and agent networks.

ACTOR CRITIC ALGORITHM

The Actor-Critic Reinforcement Learning algorithm

Actor-Critic architecture. Source:[1]


The actor-critic algorithm is a popular reinforcement learning algorithm that
combines elements of both policy-based and value-based methods. In an actor-
critic algorithm, there are two main components: an actor network and a critic
network.

The actor network learns the policy, which is a mapping from states to actions.
The policy can be either deterministic or stochastic. The actor network is
typically implemented as a neural network that takes the current state as input
and outputs the actions to be taken.

The critic network, on the other hand, learns an estimate of the value function,
which measures how good a state is in terms of expected future rewards. The
value function is used to evaluate the quality of the actions taken by the actor
network, and to provide feedback to the actor network to improve its
performance.
During training, the actor network takes actions in the environment and receives
feedback in the form of rewards and the next state. The critic network then
evaluates the quality of the actions taken by the actor network and provides
feedback to improve the actor's performance. The feedback signal can be in the
form of the advantage function, which measures how much better or worse an
action is compared to the average action.

The actor and critic networks are trained using a combination of policy gradient
and value-based methods. The policy gradient method is used to update the
actor network to maximize the expected future rewards, while the value-based
method is used to update the critic network to minimize the difference between
the predicted and actual values.

The actor-critic algorithm has been successfully applied to a wide range of


reinforcement learning problems, including robotics, game playing, and natural
language processing. It is particularly useful in situations where the state space
is large and complex, and where the optimal policy is difficult to determine.
dvantage function: The advantage function is a measure of how much better or
worse an action is compared to the average action. It is defined as the difference
between the Q-value (the expected cumulative reward starting from a given
state and taking a particular action) and the value function (the expected
cumulative reward starting from a given state and following the policy).

TD-learning: The critic network in the actor-critic algorithm can be trained


using a value-based method called TD-learning (temporal difference learning),
which involves estimating the value function based on the difference between
the predicted and actual values.

Exploration-exploitation trade-off: The actor-critic algorithm can balance the


trade-off between exploration (taking new actions to learn more about the
environment) and exploitation (taking the best known action to maximize
reward) by using an exploration strategy such as epsilon-greedy or softmax.

Continuous action spaces: The actor-critic algorithm can handle continuous


action spaces by using a deterministic policy (where the action is determined by
the output of the actor network) or a stochastic policy (where the action is
sampled from a probability distribution defined by the actor network).

Variants: There are many variants of the actor-critic algorithm, including


asynchronous advantage actor-critic (A3C), deep deterministic policy gradient
(DDPG), and twin delayed deep deterministic policy gradient (TD3), which
introduce modifications to improve performance and stability.

Overall, the actor-critic algorithm is a powerful reinforcement learning


algorithm that can learn complex policies in high-dimensional state and action
spaces. Its combination of policy-based and value-based methods enables it to
balance exploration and exploitation, and its flexibility allows it to handle a
wide range of reinforcement learning problems.

Model-free vs model-based: The actor-critic algorithm is a model-free


reinforcement learning algorithm, which means that it does not require
knowledge of the transition dynamics or the reward function of the
environment. Instead, it learns the policy and value function directly from
experience through trial and error.

Batch vs online learning: The actor-critic algorithm can be used for both batch
learning (where the agent learns from a fixed dataset of experiences) and online
learning (where the agent learns from experience as it interacts with the
environment). Online learning is generally more efficient, as it allows the agent
to adapt to changes in the environment and learn from new experiences.

Convergence and stability: The actor-critic algorithm can suffer from issues of
convergence and stability, especially when dealing with high-dimensional state
and action spaces. To address these issues, various modifications have been
proposed, such as using trust region optimization, clipped surrogate objectives,
or target networks.

Applications: The actor-critic algorithm has been successfully applied to a wide


range of reinforcement learning problems, including game playing, robotics,
autonomous driving, and natural language processing. It has also been used in
combination with other machine learning techniques, such as deep learning and
imitation learning, to improve its performance and efficiency.

Extensions: The actor-critic algorithm can be extended to handle more complex


reinforcement learning scenarios, such as multi-agent reinforcement learning,
hierarchical reinforcement learning, or adversarial reinforcement learning.
These extensions introduce additional challenges, such as coordination,
hierarchy, or competition, that require new algorithmic approaches and
techniques.

Overall, the actor-critic algorithm is a versatile and powerful reinforcement


learning algorithm that can learn complex policies in a wide range of
environments and scenarios. Its flexibility, scalability, and efficiency make it a
popular choice for many applications in artificial intelligence and robotics.
AUTOENCODING:.
Autoencoders are very useful in the field of unsupervised machine
learning. You can use them to compress the data and reduce its dimensionality.

The main difference between Autoencoders and Principle Component Analysis


(PCA) is that while PCA finds the directions along which you can project the
data with maximum variance, Autoencoders reconstruct our original input given
just a compressed version of it.

If anyone needs the original data can reconstruct it from the compressed data
using an autoencoder.
The encoder part of the network is used for encoding and sometimes even for
data compression purposes although it is not very effective as compared to other
general compression techniques like JPEG. Encoding is achieved by the
encoder part of the network which has a decreasing number of hidden units in
each layer. Thus this part is forced to pick up only the most significant and
representative features of the data. The second half of the network performs
the Decoding function. This part has an increasing number of hidden units in
each layer and thus tries to reconstruct the original input from the encoded data.
Thus Auto-encoders are an unsupervised learning technique.
Example: See the below code, in autoencoder training data, is fitted to itself.
That’s why instead of fitting X_train to Y_train we have used X_train in both
places.

 Python3

autoencoder.fit(X_train, X_train, epochs=200)


Training of an Auto-encoder for data compression: For a data compression
procedure, the most important aspect of the compression is the reliability of the
reconstruction of the compressed data. This requirement dictates the structure of
the Auto-encoder as a bottleneck. Step 1: Encoding the input data The Auto-
encoder first tries to encode the data using the initialized weights and biases.

Step 2: Decoding the input data The Auto-encoder tries to reconstruct the
original input from the encoded data to test the reliability of the encoding.
Step 3: Backpropagating the error After the reconstruction, the loss function is
computed to determine the reliability of the encoding. The error generated is
backpropagated.
The above-described training process is reiterated several times until an
acceptable level of reconstruction is reached.
After the training process, only the encoder part of the Auto-encoder is retained
to encode a similar type of data used in the training process. The different ways
to constrain the network are:-
 Keep small Hidden Layers: If the size of each hidden layer is
kept as small as possible, then the network will be forced to pick up
only the representative features of the data thus encoding the data.
 Regularization: In this method, a loss term is added to the cost
function which encourages the network to train in ways other than
copying the input.
 Denoising: Another way of constraining the network is to add
noise to the input and teach the network how to remove the noise from
the data.
 Tuning the Activation Functions: This method involves chan-
ging the activation functions of various nodes so that a majority of the
nodes are dormant thus effectively reducing the size of the hidden lay-
ers.
The different variations of Auto-encoders are:-
 Denoising Auto-encoder: This type of auto-encoder works on a
partially corrupted input and trains to recover the original undistorted
image. As mentioned above, this method is an effective way to con-
strain the network from simply copying the input.
 Sparse Auto-encoder: This type of auto-encoder typically con-
tains more hidden units than the input but only a few are allowed to be
active at once. This property is called the sparsity of the network. The
sparsity of the network can be controlled by either manually zeroing
the required hidden units, tuning the activation functions or by adding
a loss term to the cost function.
Variational Auto-encoder: This type of auto-encoder makes strong assumptions
about the distribution of latent variables and uses the Stochastic Gradient
Variational Bayes estimator in the training process. It assumes that the data is
generated by a Directed Graphical Model and tries to learn an approximation to
to the conditional property where and are the parameters of the encoder and
the decoder respectively.
Below is the basic intuition code of how to build the autoencoder model and
fitting X_train to itself.

An Autoencoder consists of three layers:

1. Encoder

2. Code

3. Decoder

The Encoder layer compresses the input image into a latent space
representation. It encodes the input image as a compressed representation in a
reduced dimension.

The compressed image is a distorted version of the original image.

The Code layer represents the compressed input fed to the decoder layer.
The decoder layer decodes the encoded image back to the original dimension.
The decoded image is reconstructed from latent space representation, and it is
reconstructed from the latent space representation and is a lossy reconstruction
of the original image.

Convolutional autoencoding:

A convolutional autoencoder is a neural network (a special case of an


unsupervised learning model) that is trained to reproduce its input image in the
output layer. An image is passed through an encoder, which is a ConvNet that
produces a low-dimensional representation of the image. The decoder, which is
another sample ConvNet, takes this compressed image and reconstructs the
original image.
The encoder is used to compress the data and the decoder is used to reproduce
the original image. Therefore, autoencoders may be used for data, compression.
Compression logic is data-specific, meaning it is learned from data rather than
predefined compression algorithms such as JPEG, MP3, and so on. Other
applications of autoencoders can be image denoising (producing a cleaner
image from a corrupted image), dimensionality reduction, and image search:

VARIATIONAL AUTOENCODING:

A variational autoencoder (VAE) provides a probabilistic manner for describing


an observation in latent space. Thus, rather than building an encoder which
outputs a single value to describe each latent state attribute, we'll formulate our
encoder to describe a probability distribution for each latent attribute.
EXPLANATION:
DIAGRAM:
In neural net language, a variational autoencoder consists of an encoder, a
decoder, and a loss function.
Variational autoencoder (VAE) is a generative model that combines elements of
deep learning and Bayesian inference. It is a type of neural network that can
learn to encode high-dimensional data into a lower-dimensional representation,
and to decode the lower-dimensional representation back into the original high-
dimensional data.

The main idea behind VAEs is to learn a probabilistic model of the data, where
the lower-dimensional representation is treated as a random variable with a
known probability distribution. The encoder network maps the input data to the
parameters of this probability distribution, while the decoder network samples
from the distribution to generate new data points.

The encoder network consists of one or more layers of neural networks that map
the input data to the mean and standard deviation of the probability distribution.
The standard deviation is used to ensure that the encoded data has some
randomness and variability.

The decoder network maps the lower-dimensional representation back to the


high-dimensional data. The decoder network takes as input a sample from the
probability distribution defined by the encoder network and generates a new
data point.

During training, the VAE learns to optimize the parameters of the encoder and
decoder networks to minimize the difference between the input data and the
reconstructed data. The VAE also learns to minimize the difference between the
probability distribution defined by the encoder network and a known prior
probability distribution.

The VAE is trained using a variant of stochastic gradient descent called the
reparameterization trick. The reparameterization trick involves sampling from a
standard normal distribution and then transforming the sample using the mean
and standard deviation output by the encoder network.

The VAE has many applications in deep learning, including image and video
generation, anomaly detection, and dimensionality reduction. It has also been
used in combination with other deep learning techniques, such as generative
adversarial networks (GANs) and adversarial autoencoders, to improve its
performance and stability.
GLOSSARY:

 Loss function: in neural net language, we think of loss


functions. Training means minimizing these loss functions. But
in variational inference, we maximize the ELBO (which is not a
loss function). This leads to awkwardness like
calling optimizer.minimize(-elbo) as optimizers in neural net
frameworks only support minimization.
 Encoder: in the neural net world, the encoder is a neural
network that outputs a representation �z of data �x. In
probability model terms, the inference network parametrizes the
approximate posterior of the latent variables �z. The inference
network outputs parameters to the distribution �(�∣�)q(z∣x).
 Decoder: in deep learning, the decoder is a neural net that
learns to reconstruct the data �x given a representation �z. In
terms of probability models, the likelihood of the data �x given
latent variables �z is parametrized by a generative network. The
generative network outputs parameters to the likelihood
distribution �(�∣�)p(x∣z).
 Local latent variables: these are the ��zi for each
datapoint ��xi. There are no global latent variables. Because
there are only local latent variables, we can easily decompose the
ELBO into terms ��Li that depend only on a single
datapoint ��xi. This enables stochastic gradient descent.
 Inference: in neural nets, inference usually means
prediction of latent representations given new, never-before-seen
datapoints. In probability models, inference refers to inferring the
values of latent variables given observed data.

GENERATIVE ADVERSITIAL NETWORKS:

A Generative Adversarial Network (GAN) is a deep learning architecture that


consists of two neural networks competing against each other in a zero-sum
game framework. The goal of GANs is to generate new, synthetic data that
resembles some known data distribution.

1.Components:

 Generator network: creates synthetic data


 Discriminator network: evaluates the synthetic data and tries to
determine if it’s real or fake

2.Training:

 The generator network produces synthetic data and the discrim-


inator network evaluates it.
 The generator is trained to fool the discriminator and the dis-
criminator is trained to correctly identify real and fake data.
 This process continues until the generator produces data that is
indistinguishable from real data.

3.Applications:

 Image synthesis
 Text-to-Image synthesis
 Image-to-Image translation
 Anomaly detection
 Data augmentation

4.Limitations:
 Training can be unstable and prone to mode collapse, where the
generator produces limited variations of synthetic data.
 GANs can be difficult to train and require a lot of computational
resources.
 GANs can generate unrealistic or irrelevant synthetic data if the
generator and discriminator are not properly trained.
Generative Adversarial Networks (GANs) are a powerful class of neural
networks that are used for unsupervised learning. It was developed and
introduced by Ian J. Goodfellow in 2014. GANs are basically made up of a
system of two competing neural network models which compete with each
other and are able to analyze, capture and copy the variations within a
dataset. Why were GANs developed in the first place? It has been noticed most
of the mainstream neural nets can be easily fooled into misclassifying things by
adding only a small amount of noise into the original data. Surprisingly, the
model after adding noise has higher confidence in the wrong prediction than
when it predicted correctly. The reason for such adversary is that most machine
learning models learn from a limited amount of data, which is a huge drawback,
as it is prone to overfitting. Also, the mapping between the input and the output
is almost linear. Although, it may seem that the boundaries of separation
between the various classes are linear, but in reality, they are composed of
linearities and even a small change in a point in the feature space might lead to
misclassification of data. How does GANs work? Generative Adversarial
Networks (GANs) can be broken down into three parts:
 Generative: To learn a generative model, which describes how
data is generated in terms of a probabilistic model.
 Adversarial: The training of a model is done in an adversarial
setting.
 Networks: Use deep neural networks as the artificial intelli-
gence (AI) algorithms for training purpose.
In GANs, there is a generator and a discriminator. The Generator generates fake
samples of data(be it an image, audio, etc.) and tries to fool the Discriminator.
The Discriminator, on the other hand, tries to distinguish between the real and
fake samples. The Generator and the Discriminator are both Neural Networks
and they both run in competition with each other in the training phase. The steps
are repeated several times and in this, the Generator and Discriminator get
better and better in their respective jobs after each repetition. The working can
be visualized by the diagram given
below:

Advantages of Generative Adversarial Networks (GANs):

1. Synthetic data generation: GANs can generate new, synthetic


data that resembles some known data distribution, which can be useful
for data augmentation, anomaly detection, or creative applications.
2. High-quality results: GANs can produce high-quality,
photorealistic results in image synthesis, video synthesis, music syn-
thesis, and other tasks.
3. Unsupervised learning: GANs can be trained without labeled
data, making them suitable for unsupervised learning tasks, where
labeled data is scarce or difficult to obtain.
4. Versatility: GANs can be applied to a wide range of tasks, in-
cluding image synthesis, text-to-image synthesis, image-to-image
translation, anomaly detection, data augmentation, and others.

Disadvantages of Generative Adversarial Networks (GANs):

1. Training instability: GANs can be difficult to train, with the risk


of instability, mode collapse, or failure to converge.
2. Computational cost: GANs can require a lot of computational
resources and can be slow to train, especially for high-resolution im-
ages or large datasets.
3. Overfitting: GANs can overfit to the training data, producing
synthetic data that is too similar to the training data and lacking di-
versity.
4. Bias and fairness: GANs can reflect the biases and unfairness
present in the training data, leading to discriminatory or biased syn-
thetic data.
5. Interpretability and accountability: GANs can be opaque and
difficult to interpret or explain, making it challenging to ensure ac-
countability, transparency, or fairness in their applications.

AUTO ENCODERS FOR FEATURE EXTRACTION:


Autoencoder is a type of neural network architecture that can be used for feature
extraction in machine learning tasks. An autoencoder consists of two parts: an
encoder and a decoder. The encoder takes an input data point and maps it to a
lower-dimensional representation, while the decoder maps the lower-
dimensional representation back to the original input data.

Autoencoders can be trained to learn a compact and useful representation of the


input data by minimizing the difference between the original input data and the
reconstructed data. This process of learning a compressed representation of the
data is called feature extraction.

Once an autoencoder is trained, the encoder can be used to extract features from
new input data points. The lower-dimensional representation generated by the
encoder can be used as a new set of features for downstream machine learning
tasks, such as classification or clustering.

One advantage of using autoencoders for feature extraction is that they can learn
nonlinear relationships in the data, which traditional linear methods may not be
able to capture. Autoencoders can also be used to remove noise or redundancy
from the data, which can improve the performance of downstream machine
learning tasks.

In addition to traditional autoencoders, there are many variants of the


autoencoder architecture that can be used for feature extraction, including
denoising autoencoders, variational autoencoders, and convolutional
autoencoders. These variants can be used in different types of data, such as
images, text, or time-series data.

Overall, autoencoders are a powerful tool for feature extraction in machine


learning. They can learn compact and useful representations of the input data,
which can be used to improve the performance of downstream machine learning
tasks.
here are some additional details on using autoencoders for feature extraction:

Unsupervised learning: Autoencoders are a type of unsupervised learning


method, which means that they do not require labeled data to learn useful
representations. Instead, they can learn to extract features from raw data without
any prior knowledge of the data.

Nonlinear transformations: Autoencoders can learn nonlinear transformations of


the input data, which can be useful for capturing complex patterns and
relationships in the data. The nonlinear transformations are learned through the
hidden layers of the autoencoder, which can have multiple layers of nonlinear
activation functions.

Dimensionality reduction: Autoencoders can be used for dimensionality


reduction, which is the process of reducing the number of features or variables
in the data. By learning a compressed representation of the input data,
autoencoders can be used to reduce the dimensionality of the data while
retaining most of the information.

Transfer learning: Autoencoders can be used for transfer learning, which is the
process of reusing learned features from one task to another. By pretraining an
autoencoder on a large dataset, the learned features can be used to initialize the
weights of a neural network for a new task, which can improve the performance
of the network.

Variants of autoencoders: There are many variants of the autoencoder


architecture that can be used for feature extraction. For example, denoising
autoencoders can be used to learn features that are robust to noise in the data,
while convolutional autoencoders can be used for feature extraction in image
data.
Hyperparameter tuning: The performance of an autoencoder for feature
extraction depends on many hyperparameters, such as the number of hidden
layers, the size of the hidden layers, the learning rate, and the activation
functions. Hyperparameter tuning can be used to optimize the performance of
the autoencoder for a specific task.

Overall, autoencoders are a powerful tool for feature extraction in machine


learning. They can learn nonlinear transformations of the input data, reduce the
dimensionality of the data, and be used for transfer learning. By using
autoencoders for feature extraction, the performance of downstream machine
learning tasks can be improved.
Convolutional auto encoding:
Convolutional autoencoders are a type of autoencoder architecture that is
specifically designed for feature extraction in image data. They use
convolutional layers for the encoder and decoder instead of fully connected
layers, which can capture spatial features in the image data.

In a convolutional autoencoder, the encoder consists of one or more


convolutional layers followed by one or more fully connected layers. The
convolutional layers are used to extract features from the input image data by
sliding a small filter over the image and computing a dot product between the
filter weights and the local image patches. The resulting feature maps are then
downsampled to reduce the dimensionality of the data.

The decoder in a convolutional autoencoder consists of one or more fully


connected layers followed by one or more transpose convolutional layers. The
transpose convolutional layers are used to reconstruct the original image from
the lower-dimensional representation generated by the encoder.

The loss function used to train a convolutional autoencoder is typically the


mean squared error between the original image and the reconstructed image. By
minimizing this loss function, the convolutional autoencoder can learn to extract
useful features from the image data and reconstruct the original image with
minimal loss of information.
Convolutional autoencoders can be used for a variety of image processing tasks,
such as denoising, image inpainting, and image generation. By using
convolutional autoencoders for feature extraction, the performance of
downstream image processing tasks can be improved by learning a more
compact and useful representation of the input image data.
Encoder architecture: The encoder in a convolutional autoencoder consists of
one or more convolutional layers, followed by one or more fully connected
layers. The convolutional layers extract features from the input image data,
while the fully connected layers compress the extracted features into a lower-
dimensional representation.

Decoder architecture: The decoder in a convolutional autoencoder consists of


one or more fully connected layers, followed by one or more transpose
convolutional layers. The fully connected layers expand the compressed
representation generated by the encoder, while the transpose convolutional
layers reconstruct the original image from the expanded representation.

Spatial information preservation: Convolutional layers in the encoder help to


preserve spatial information in the input image data, which is important for
tasks like object recognition and segmentation. The transpose convolutional
layers in the decoder reconstruct the original image with the preserved spatial
information.

Pooling layers: Pooling layers are often used in convolutional autoencoders to


downsample the feature maps generated by the convolutional layers in the
encoder. This helps to reduce the dimensionality of the data and improve
computational efficiency.

Loss function: The loss function used to train a convolutional autoencoder is


typically the mean squared error between the original image and the
reconstructed image. However, other loss functions like binary cross-entropy
can be used for specific tasks like image segmentation or binary image
classification.

Pre-training: Convolutional autoencoders can be pre-trained on large datasets


like ImageNet to learn useful features from the input image data. The learned
features can then be fine-tuned for specific image processing tasks like image
classification or object detection.

Applications: Convolutional autoencoders are used in a variety of applications,


including image denoising, image inpainting, image compression, and
generative modeling.

Overall, convolutional autoencoders are a powerful tool for feature extraction in


image data. They help to preserve spatial information in the input image data
and can be used for a variety of image processing tasks. By using convolutional
autoencoders for feature extraction, the performance of downstream image
processing tasks can be improved by learning a more compact and useful
representation of the input image data.
DENOISING AUTOENCODER:

Autoencoders are Neural Networks which are commonly used for feature
selection and extraction. However, when there are more nodes in the hidden
layer than there are inputs, the Network is risking to learn the so-called “Identity
Function”, also called “Null Function”, meaning that the output equals the input,
marking the Autoencoder useless.

Denoising Autoencoders solve this problem by corrupting the data on purpose


by randomly turning some of the input values to zero. In general, the percentage
of input nodes which are being set to zero is about 50%. Other sources suggest a
lower count, such as 30%. It depends on the amount of data and input nodes you
have.
Architecture of a DAE. Copyright by Kirill Eremenko (Deep Learning A-Z™:
Hands-On Artificial Neural Networks)

When calculating the Loss function, it is important to compare the output values
with the original input, not with the corrupted input. That way, the risk of
learning the identity function instead of extracting features is eliminated.

A great implementation has been posted by opendeep.org where they use


Theano to build a very basic Denoising Autoencoder and train it on the MNIST
dataset. The OpenDeep articles are very basics and are made for beginners. So
even if you don’t have too much experience with Neural Networks, the article is
definitely worth checking out!

Original input, corrupted data and reconstructed data. Copyright


by opendeep.org.

Denoising Autoencoders are an important and crucial tool for feature selection
and extraction and now you know what it is! Enjoy and thanks for reading!
The structure of a DAE

First, let’s do a quick recap on a high-level structure of Autoencoders. The


critical components of Autoencoders are:

 Input layer — to pass input data into the network

 Hidden layer consisting of Encoder and Decoder — to process


information by applying weights, biases and activation functions

 Output layer — typically matches the input neurons

Here is an illustration of the above summary:

A high-level illustration of layers within an Autoencoder Neural Network.


Image by author.

The most common type of Autoencoder is an Undercomplete


Autocencoder which squeezes (encodes) data into fewer neurons (lower
dimension) while removing “unimportant” information. It achieves that by
training an encoder and decoder simultaneously, so the output neurons match
inputs as closely as possible.
Here is an example of what the network diagram would look like for an
Undercomplete Autoencoder:

Undercomplete Autoencoder Neural Network. Image by author, created


using AlexNail’s NN-SVG tool.

Denoising Autoencoder (DAE)

The purpose of a DAE is to remove noise. You can also think of it as


a customised denoising algorithm tuned to your data.

Note the emphasis on the word customised. Given that we train a DAE on a
specific set of data, it will be optimised to remove noise from similar data. For
example, if we train it to remove noise from a collection of images, it will work
well on similar images but will not be suitable for cleaning text data.
Unlike Undercomplete AE, we may use the same or higher number of neurons
within the hidden layer, making the DAE overcomplete.

The second difference comes from not using identical inputs and outputs.
Instead, the outputs are the original data (e.g., images), while the inputs contain
data with some added noise.

SPARSE AUTOENCODER:

sparse autoencoder is a type of autoencoder architecture that is designed to learn


a compressed representation of the input data with a constraint on the sparsity of
the learned features. In other words, the goal of a sparse autoencoder is to learn
a representation of the input data that is as compact as possible, while also
ensuring that the learned features are sparse, i.e., only a small subset of the
features are active at any given time.

The sparsity constraint is typically enforced using a regularization term in the


loss function. The regularization term encourages the model to learn a
compressed representation of the input data that has a small number of active
features. The sparsity constraint can be useful in cases where the input data has
a lot of redundant information, as it helps to identify the most important features
of the data.
Sparse autoencoders are often used for feature extraction and dimensionality
reduction tasks, where the goal is to extract the most important features from the
input data while discarding irrelevant or redundant information. They have been
used in a variety of applications, such as image recognition, speech recognition,
and natural language processing.

Overall, sparse autoencoders are a powerful tool for learning a compressed


representation of the input data with a constraint on the sparsity of the learned
features. They can be used for a variety of tasks, including feature extraction,
dimensionality reduction, and data compression.
egularization techniques: In order to enforce sparsity in the learned features, a
variety of regularization techniques can be used, such as L1 regularization, KL
divergence, or max pooling. L1 regularization adds a penalty term to the loss
function that encourages the model to learn a sparse representation, while KL
divergence compares the sparsity of the learned features to a desired level of
sparsity. Max pooling is another technique that can be used to enforce sparsity
by selecting the most active feature in each local region of the input.

Activation functions: Sparse autoencoders typically use activation functions that


are more sensitive to small changes in the input, such as the rectified linear unit
(ReLU) or the sigmoid function. These activation functions can help to generate
sparse features by saturating the output of some neurons and making them
inactive.

Initialization: The weights of the sparse autoencoder should be carefully


initialized to avoid vanishing or exploding gradients. Common initialization
techniques include Xavier initialization, which scales the weights based on the
number of inputs and outputs, or He initialization, which scales the weights
based on the number of inputs.

Applications: Sparse autoencoders have been used in a variety of applications,


such as image recognition, speech recognition, natural language processing, and
anomaly detection. They can be used to extract the most important features
from the input data while discarding irrelevant or redundant information,
making them a powerful tool for data compression and dimensionality
reduction.
Limitations: One limitation of sparse autoencoders is that they can be difficult
to train, as the sparsity constraint can cause the model to get stuck in local
optima. Additionally, they may not perform as well as other types of
autoencoders in tasks where the input data is highly structured and contains
non-redundant information.

Overall, sparse autoencoders are a useful tool for learning a compressed


representation of the input data with a constraint on the sparsity of the learned
features. By enforcing sparsity, sparse autoencoders can help to identify the
most important features of the data and discard redundant information, making
them a powerful tool for feature extraction and dimensionality reduction tasks.

You might also like