0% found this document useful (0 votes)
35 views

Reinforcement Learning

Reinforcement learning is a machine learning technique where an agent learns from interaction with an environment. The agent performs actions and receives rewards or penalties in response, learning over time which actions yield the most reward. There is no labeled training data - the agent learns through trial-and-error interactions with the environment. The goal of reinforcement learning is for the agent to discover behaviors that earn it the maximum reward by exploring and exploiting learned behaviors. Key elements include the agent's policy for choosing actions, the reward signal, and modeling the environment.

Uploaded by

Rishabh Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Reinforcement Learning

Reinforcement learning is a machine learning technique where an agent learns from interaction with an environment. The agent performs actions and receives rewards or penalties in response, learning over time which actions yield the most reward. There is no labeled training data - the agent learns through trial-and-error interactions with the environment. The goal of reinforcement learning is for the agent to discover behaviors that earn it the maximum reward by exploring and exploiting learned behaviors. Key elements include the agent's policy for choosing actions, the reward signal, and modeling the environment.

Uploaded by

Rishabh Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

What is Reinforcement Learning?

o Reinforcement Learning is a feedback-based Machine learning technique in


which an agent learns to behave in an environment by performing the actions
and seeing the results of actions. For each good action, the agent gets
positive feedback, and for each bad action, the agent gets negative feedback
or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience
only.
o RL solves a specific type of problem where decision making is sequential, and
the goal is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary
goal of an agent in reinforcement learning is to improve the performance by
getting the maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method where
an intelligent agent (computer program) interacts with the environment
and learns to act within that." How a Robotic dog learns the movement of
his arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept
of reinforcement learning. Here we do not need to pre-program the agent, as
it learns from its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment,
and his goal is to find the diamond. The agent interacts with the environment
by performing some actions, and based on those actions, the state of the
agent gets changed, and it also receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing these
actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards and
what actions lead to negative feedback penalty. As a positive reward, the
agent gets a positive point, and as a penalty, it gets a negative point.

Terms used in Reinforcement Learning


o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In
RL, we assume the stochastic environment, which means it is random in
nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action
taken by the agent.
o Reward(): A feedback returned to the agent from the environment to
evaluate the action of the agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on
the current state.
o Value(): It is expected long-term retuned with the discount factor and
opposite to the short-term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional
parameter as a current action (a).

Key Features of Reinforcement Learning


o In RL, the agent is not instructed about the environment and what actions
need to be taken.
o It is based on the hit and trial process.
o The agent takes the next action and changes states according to the feedback
of the previous action.
o The agent may get a delayed reward.
o The environment is stochastic, and the agent needs to explore it to reach to
get the maximum positive rewards.

Approaches to implement Reinforcement


Learning
There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which is
the maximum value at a state under any policy. Therefore, the agent expects
the long-term return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to
apply such a policy that the action performed in each step helps to maximize
the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any
state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for
the environment, and the agent explores that environment to learn it. There is
no particular solution or algorithm for this approach because the model
representation is different for each environment.

Elements of Reinforcement Learning


There are four main elements of Reinforcement Learning, which are given below:

1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment

1) Policy: A policy can be defined as a way how an agent behaves at a given time. It
maps the perceived states of the environment to the actions taken on those states. A
policy is the core element of the RL as it alone can define the behavior of the agent.
In some cases, it may be a simple function or a lookup table, whereas, for other
cases, it may involve general computation as a search process. It could be
deterministic or a stochastic policy:

Backward Skip 10sPlay VideoForward Skip 10s


For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]

2) Reward Signal: The goal of reinforcement learning is defined by the reward


signal. At each state, the environment sends an immediate signal to the learning
agent, and this signal is known as a reward signal. These rewards are given
according to the good and bad actions taken by the agent. The agent's main
objective is to maximize the total number of rewards for good actions. The reward
signal can change the policy, such as if an action selected by the agent leads to low
reward, then the policy may change to select other actions in the future.

3) Value Function: The value function gives information about how good the
situation and action are and how much reward an agent can expect. A reward
indicates the immediate signal for each good and bad action, whereas a value
function specifies the good state and action for the future. The value function
depends on the reward as, without reward, there could be no value. The goal of
estimating values is to achieve more rewards.

4) Model: The last element of reinforcement learning is the model, which mimics the
behavior of the environment. With the help of the model, one can make inferences
about how the environment will behave. Such as, if a state and an action are given,
then a model can predict the next state and reward.

The model is used for planning, which means it provides a way to take a course of
action by considering all future situations before actually experiencing those
situations. The approaches for solving the RL problems with the help of the
model are termed as the model-based approach. Comparatively, an
approach without using a model is called a model-free approach.

How does Reinforcement Learning Work?


To understand the working process of the RL, we need to consider two main things:

o Environment: It can be anything such as a room, maze, football ground, etc.


o Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to explore.
Consider the below image:

In the above image, the agent is at the very first block of the maze. The maze is
consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.

The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the
S4 block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward point.
It can take four actions: move up, move down, move left, and move right.

The agent can take any path to reach to the final point, but he needs to make it in
possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he
will get the +1-reward point.

The agent will try to remember the preceding steps that it has taken to reach the
final step. To memorize the steps, it assigns 1 value to each previous step. Consider
the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to
each previous block. But what will the agent do if he starts moving from the block,
which has 1 value block on both sides? Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each
block has the same value. So, the above approach is not suitable for the agent to
reach the destination. Hence to solve the problem, we will use the Bellman
equation, which is the main concept behind reinforcement learning.

The Bellman Equation


The Bellman equation was introduced by the Mathematician Richard Ernest
Bellman in the year 1953, and hence it is called as a Bellman equation. It is
associated with dynamic programming and used to calculate the values of a decision
problem at a certain point by including the values of previous states.

It is a way of calculating the value functions in dynamic programming or


environment that leads to modern reinforcement learning.

The key-elements used in Bellman equations are:

o Action performed by the agent is referred to as "a"


o State occurred by performing the action is "s."
o The reward/feedback obtained for each good and bad action is "R."
o A discount factor is Gamma "γ."

The Bellman equation can be written as:

1. V(s) = max [R(s,a) + γV(s`)]

Where,

V(s)= value calculated at a particular point.

R(s,a) = Reward at a particular state s by performing an action.

γ = Discount factor

V(s`) = The value at the previous state.

In the above equation, we are taking the max of the complete values because the
agent tries to find the optimal solution always.

So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.

For 1st block:

V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.

V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.

For 2nd block:

V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there
is no reward at this state.

V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9

For 3rd block:


V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because
there is no reward at this state also.

V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81

For 4th block:

V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because
there is no reward at this state also.

V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73

For 5th block:

V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because
there is no reward at this state also.

V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66

Consider the below image:


Now, we will move further to the 6th block, and here agent may change the route
because it always tries to find the optimal path. So now, let's consider from the block
next to the fire pit.

Now, the agent has three options to move; if he moves to the blue box, then he will
feel a bump if he moves to the fire pit, then he will get the -1 reward. But here we
are taking only positive rewards, so for this, he will move to upwards only. The
complete block values will be calculated using this formula. Consider the below
image:
Types of Reinforcement learning
There are mainly two types of reinforcement learning, which are:

o Positive Reinforcement
o Negative Reinforcement

Positive Reinforcement:

The positive reinforcement learning means adding something to increase the


tendency that expected behavior would occur again. It impacts positively on the
behavior of the agent and increases the strength of the behavior.

This type of reinforcement can sustain the changes for a long time, but too much
positive reinforcement may lead to an overload of states that can reduce the
consequences.

Negative Reinforcement:
The negative reinforcement learning is opposite to the positive reinforcement as it
increases the tendency that the specific behavior will occur again by avoiding the
negative condition.

It can be more effective than the positive reinforcement depending on situation and
behavior, but it provides reinforcement only to meet minimum behavior.

How to represent the agent state?


We can represent the agent state using the Markov State that contains all the
required information from the history. The State St is Markov state if it follows the
given condition:

P[St+1 | St ] = P[St +1 | S1,......, St]

The Markov state follows the Markov property, which says that the future is
independent of the past and can only be defined with the present. The RL works on
fully observable environments, where the agent can observe the environment and
act for the new state. The complete process is known as Markov Decision process,
which is explained below:

Markov Decision Process


Markov Decision Process or MDP, is used to formalize the reinforcement learning
problems. If the environment is completely observable, then its dynamic can be
modeled as a Markov Process. In MDP, the agent constantly interacts with the
environment and performs actions; at each action, the environment responds and
generates a new state.
MDP is used to describe the environment for the RL, and almost all the RL problem
can be formalized using MDP.

MDP contains a tuple of four elements (S, A, P a, Ra):

o A set of finite States S


o A set of finite Actions A
o Rewards received after transitioning from state S to state S', due to action a.
o Probability Pa.

MDP uses Markov property, and to better understand the MDP, we need to learn
about it.

Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends
on the current state and future action and states do not depend on past actions,
rewards, or states."

Or, in other words, as per Markov Property, the current state transition does not
depend on any past action or state. Hence, MDP is an RL problem that satisfies the
Markov property. Such as in a Chess game, the players only focus on the current
state and do not need to remember past actions or states.

Finite MDP:

A finite MDP is when there are finite states, finite rewards, and finite actions. In RL,
we consider only the finite MDP.

Markov Process:
Markov Process is a memoryless process with a sequence of random states S 1, S2, .....,
St that uses the Markov Property. Markov process is also known as Markov chain,
which is a tuple (S, P) on state S and transition function P. These two components (S
and P) can define the dynamics of the system.

Reinforcement Learning Algorithms


Reinforcement learning algorithms are mainly used in AI applications and gaming
applications. The main used algorithms are:

o Q-Learning:
o Q-learning is an Off policy RL algorithm, which is used for the
temporal difference Learning. The temporal difference learning
methods are the way of comparing temporally successive predictions.
o It learns the value function Q (S, a), which means how good to take
action "a" at a particular state "s."
o The below flowchart explains the working of Q- learning:
o State Action Reward State action (SARSA):
o SARSA stands for State Action Reward State action, which is an on-
policy temporal difference learning method. The on-policy control
method selects the action for each state while learning using a specific
policy.
o The goal of SARSA is to calculate the Q π (s, a) for the selected
current policy π and all pairs of (s-a).
o The main difference between Q-learning and SARSA algorithms is
that unlike Q-learning, the maximum reward for the next state is
not required for updating the Q-value in the table.
o In SARSA, new action and reward are selected using the same policy,
which has determined the original action.
o The SARSA is named because it uses the quintuple Q(s, a, r, s',
a'). Where,
s: original state

a: Original action
r: reward observed while following the states
s' and a': New state, action pair.

o Deep Q Neural Network (DQN):


o As the name suggests, DQN is a Q-learning using Neural networks.
o For a big state space environment, it will be a challenging and complex
task to define and update a Q-table.
o To solve such an issue, we can use a DQN algorithm. Where, instead of
defining a Q-table, neural network approximates the Q-values for each
action and state.

Now, we will expand the Q-learning.

Q-Learning Explanation:

o Q-learning is a popular model-free reinforcement learning algorithm based on


the Bellman equation.
o The main objective of Q-learning is to learn the policy which can inform
the agent that what actions should be taken for maximizing the reward
under what circumstances.
o It is an off-policy RL that attempts to find the best action to take at a current
state.
o The goal of the agent in Q-learning is to maximize the value of Q.
o The value of Q-learning can be derived from the Bellman equation. Consider
the Bellman equation given below:
In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider
the below image:

In the above image, we can see there is an agent who has three values options, V(s 1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state and the future
state. The agent can go to any direction (Up, Left, or Right), so he needs to decide
where to go for the optimal path. Here agent will take a move as per probability
bases and changes the state. But if we want some exact moves, so for this, we need
to make some changes in terms of Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at
each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that
which action is more lubricative than others, and according to the best Q-value, the
agent takes his next move. The Bellman equation can be used for deriving the Q-
value.

To perform any action, the agent will get a reward R(s, a), and also he will end up on
a certain state, so the Q -value equation will be:

Hence, we can say that, V(s) = max [Q(s, a)]

The above formula is used to estimate the Q-values in Q-Learning.

What is 'Q' in Q-learning?

The Q stands for quality in Q-learning, which means it specifies the quality of an
action taken by the agent.

Q-table:
A Q-table or matrix is created while performing the Q-learning. The table follows the
state and action pair, i.e., [s, a], and initializes the values to zero. After each action,
the table is updated, and the q-values are stored within the table.

The RL agent uses this Q-table as a reference table to select the best action based
on the q-values.

Difference between Reinforcement


Learning and Supervised Learning
The Reinforcement Learning and Supervised Learning both are the part of machine
learning, but both types of learnings are far opposite to each other. The RL agents
interact with the environment, explore it, take action, and get rewarded. Whereas
supervised learning algorithms learn from the labeled dataset and, on the basis of
the training, predict the output.

The difference table between RL and Supervised learning is given below:

Reinforcement Learning Supervised Learning

RL works by interacting with the environment. Supervised learning works on the existing dataset.

The RL algorithm works like the human brain Supervised Learning works as when a human
works when making some decisions. learns things in the supervision of a guide.

There is no labeled dataset is present The labeled dataset is present.

No previous training is provided to the learning Training is provided to the algorithm so that it can
agent. predict the output.

RL helps to take decisions sequentially. In Supervised learning, decisions are made when
input is given.

Reinforcement Learning Applications


1. Robotics:

a. RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.


2. Control:

a. RL can be used for adaptive control such as Factory processes, admission


control in telecommunication, and Helicopter pilot is an example of
reinforcement learning.
3. Game Playing:

a. RL can be used in Game playing such as tic-tac-toe, chess, etc.


4. Chemistry:

a. RL can be used for optimizing the chemical reactions.


5. Business:

a. RL is now used for business strategy planning.


6. Manufacturing:

a. In various automobile manufacturing companies, the robots use deep


reinforcement learning to pick goods and put them in some containers.
7. Finance Sector:

a. The RL is currently used in the finance sector for evaluating trading


strategies.

What is a Genetic Algorithm?


Before understanding the Genetic algorithm, let's first understand basic
terminologies to better understand this algorithm:

o Population: Population is the subset of all possible or probable solutions,


which can solve the given problem.
o Chromosomes: A chromosome is one of the solutions in the population for
the given problem, and the collection of gene generate a chromosome.
o Gene: A chromosome is divided into a different gene, or it is an element of
the chromosome.
o Allele: Allele is the value provided to the gene within a particular
chromosome.
o Fitness Function: The fitness function is used to determine the individual's
fitness level in the population. It means the ability of an individual to compete
with other individuals. In every iteration, individuals are evaluated based on
their fitness function.
o Genetic Operators: In a genetic algorithm, the best individual mate to
regenerate offspring better than parents. Here genetic operators play a role in
changing the genetic composition of the next generation.
o Selection

After calculating the fitness of every existent in the population, a selection process is
used to determine which of the individualities in the population will get to reproduce
and produce the seed that will form the coming generation.

Types of selection styles available

o Roulette wheel selection


o Event selection
o Rank- grounded selection

So, now we can define a genetic algorithm as a heuristic search algorithm to solve
optimization problems. It is a subset of evolutionary algorithms, which is used in
computing. A genetic algorithm uses genetic and natural selection concepts to solve
optimization problems.

How Genetic Algorithm Work?


The genetic algorithm works on the evolutionary generational cycle to generate
high-quality solutions. These algorithms use different operations that either enhance
or replace the population to give an improved fit solution.

It basically involves five phases to solve the complex optimization problems, which
are given as below:

o Initialization
o Fitness Assignment
o Selection
o Reproduction
o Termination

1. Initialization
The process of a genetic algorithm starts by generating the set of individuals, which
is called population. Here each individual is the solution for the given problem. An
individual contains or is characterized by a set of parameters called Genes. Genes are
combined into a string and generate chromosomes, which is the solution to the
problem. One of the most popular techniques for initialization is the use of random
binary strings.

2. Fitness Assignment
Fitness function is used to determine how fit an individual is? It means the ability of
an individual to compete with other individuals. In every iteration, individuals are
evaluated based on their fitness function. The fitness function provides a fitness
score to each individual. This score further determines the probability of being
selected for reproduction. The high the fitness score, the more chances of getting
selected for reproduction.

3. Selection
The selection phase involves the selection of individuals for the reproduction of
offspring. All the selected individuals are then arranged in a pair of two to increase
reproduction. Then these individuals transfer their genes to the next generation.

There are three types of Selection methods available, which are:


o Roulette wheel selection
o Tournament selection
o Rank-based selection

4. Reproduction
After the selection process, the creation of a child occurs in the reproduction step. In
this step, the genetic algorithm uses two variation operators that are applied to the
parent population. The two operators involved in the reproduction phase are given
below:

o Crossover: The crossover plays a most significant role in the reproduction


phase of the genetic algorithm. In this process, a crossover point is selected at
random within the genes. Then the crossover operator swaps genetic
information of two parents from the current generation to produce a new
individual representing the offspring.

The genes of parents are exchanged among themselves until the crossover
point is met. These newly generated offspring are added to the population.
This process is also called or crossover. Types of crossover styles available:
o One point crossover
o Two-point crossover
o Livery crossover
o Inheritable Algorithms crossover
o Mutation
The mutation operator inserts random genes in the offspring (new child) to
maintain the diversity in the population. It can be done by flipping some bits
in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances
diversification. The below image shows the mutation process:
Types of mutation styles available,
o Flip bit mutation
o Gaussian mutation
o Exchange/Swap mutation

5. Termination
After the reproduction phase, a stopping criterion is applied as a base for
termination. The algorithm terminates after the threshold fitness solution is reached.
It will identify the final solution as the best solution in the population.

General Workflow of a Simple Genetic


Algorithm
Advantages of Genetic Algorithm
o The parallel capabilities of genetic algorithms are best.
o It helps in optimizing various problems such as discrete functions, multi-
objective problems, and continuous functions.
o It provides a solution for a problem that improves over time.
o A genetic algorithm does not need derivative information.

Limitations of Genetic Algorithms


o Genetic algorithms are not efficient algorithms for solving simple problems.
o It does not guarantee the quality of the final solution to a problem.
o Repetitive calculation of fitness values may generate some computational
challenges.

Difference between Genetic Algorithms and


Traditional Algorithms
o A search space is the set of all possible solutions to the problem. In the
traditional algorithm, only one set of solutions is maintained, whereas, in a
genetic algorithm, several sets of solutions in search space can be used.
o Traditional algorithms need more information in order to perform a search,
whereas genetic algorithms need only one objective function to calculate the
fitness of an individual.
o Traditional Algorithms cannot work parallelly, whereas genetic Algorithms can
work parallelly (calculating the fitness of the individualities are independent).
o One big difference in genetic Algorithms is that rather of operating directly on
seeker results, inheritable algorithms operate on their representations (or
rendering), frequently appertained to as chromosomes.
o One of the big differences between traditional algorithm and genetic
algorithm is that it does not directly operate on candidate solutions.
o Traditional Algorithms can only generate one result in the end, whereas
Genetic Algorithms can generate multiple optimal results from different
generations.
o The traditional algorithm is not more likely to generate optimal results,
whereas Genetic algorithms do not guarantee to generate optimal global
results, but also there is a great possibility of getting the optimal result for a
problem as it uses genetic operators such as Crossover and Mutation.
o Traditional algorithms are deterministic in nature, whereas Genetic algorithms
are probabilistic and stochastic in nature.

You might also like