Reinforcement Learning
Reinforcement Learning
1. Value-based:
The value-based approach is about to find the optimal value function, which is
the maximum value at a state under any policy. Therefore, the agent expects
the long-term return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries to
apply such a policy that the action performed in each step helps to maximize
the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any
state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for
the environment, and the agent explores that environment to learn it. There is
no particular solution or algorithm for this approach because the model
representation is different for each environment.
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
1) Policy: A policy can be defined as a way how an agent behaves at a given time. It
maps the perceived states of the environment to the actions taken on those states. A
policy is the core element of the RL as it alone can define the behavior of the agent.
In some cases, it may be a simple function or a lookup table, whereas, for other
cases, it may involve general computation as a search process. It could be
deterministic or a stochastic policy:
3) Value Function: The value function gives information about how good the
situation and action are and how much reward an agent can expect. A reward
indicates the immediate signal for each good and bad action, whereas a value
function specifies the good state and action for the future. The value function
depends on the reward as, without reward, there could be no value. The goal of
estimating values is to achieve more rewards.
4) Model: The last element of reinforcement learning is the model, which mimics the
behavior of the environment. With the help of the model, one can make inferences
about how the environment will behave. Such as, if a state and an action are given,
then a model can predict the next state and reward.
The model is used for planning, which means it provides a way to take a course of
action by considering all future situations before actually experiencing those
situations. The approaches for solving the RL problems with the help of the
model are termed as the model-based approach. Comparatively, an
approach without using a model is called a model-free approach.
In the above image, the agent is at the very first block of the maze. The maze is
consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the
S4 block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward point.
It can take four actions: move up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in
possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he
will get the +1-reward point.
The agent will try to remember the preceding steps that it has taken to reach the
final step. To memorize the steps, it assigns 1 value to each previous step. Consider
the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to
each previous block. But what will the agent do if he starts moving from the block,
which has 1 value block on both sides? Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each
block has the same value. So, the above approach is not suitable for the agent to
reach the destination. Hence to solve the problem, we will use the Bellman
equation, which is the main concept behind reinforcement learning.
Where,
γ = Discount factor
In the above equation, we are taking the max of the complete values because the
agent tries to find the optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there
is no reward at this state.
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because
there is no reward at this state also.
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because
there is no reward at this state also.
Now, the agent has three options to move; if he moves to the blue box, then he will
feel a bump if he moves to the fire pit, then he will get the -1 reward. But here we
are taking only positive rewards, so for this, he will move to upwards only. The
complete block values will be calculated using this formula. Consider the below
image:
Types of Reinforcement learning
There are mainly two types of reinforcement learning, which are:
o Positive Reinforcement
o Negative Reinforcement
Positive Reinforcement:
This type of reinforcement can sustain the changes for a long time, but too much
positive reinforcement may lead to an overload of states that can reduce the
consequences.
Negative Reinforcement:
The negative reinforcement learning is opposite to the positive reinforcement as it
increases the tendency that the specific behavior will occur again by avoiding the
negative condition.
It can be more effective than the positive reinforcement depending on situation and
behavior, but it provides reinforcement only to meet minimum behavior.
The Markov state follows the Markov property, which says that the future is
independent of the past and can only be defined with the present. The RL works on
fully observable environments, where the agent can observe the environment and
act for the new state. The complete process is known as Markov Decision process,
which is explained below:
MDP uses Markov property, and to better understand the MDP, we need to learn
about it.
Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends
on the current state and future action and states do not depend on past actions,
rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not
depend on any past action or state. Hence, MDP is an RL problem that satisfies the
Markov property. Such as in a Chess game, the players only focus on the current
state and do not need to remember past actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In RL,
we consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S 1, S2, .....,
St that uses the Markov Property. Markov process is also known as Markov chain,
which is a tuple (S, P) on state S and transition function P. These two components (S
and P) can define the dynamics of the system.
o Q-Learning:
o Q-learning is an Off policy RL algorithm, which is used for the
temporal difference Learning. The temporal difference learning
methods are the way of comparing temporally successive predictions.
o It learns the value function Q (S, a), which means how good to take
action "a" at a particular state "s."
o The below flowchart explains the working of Q- learning:
o State Action Reward State action (SARSA):
o SARSA stands for State Action Reward State action, which is an on-
policy temporal difference learning method. The on-policy control
method selects the action for each state while learning using a specific
policy.
o The goal of SARSA is to calculate the Q π (s, a) for the selected
current policy π and all pairs of (s-a).
o The main difference between Q-learning and SARSA algorithms is
that unlike Q-learning, the maximum reward for the next state is
not required for updating the Q-value in the table.
o In SARSA, new action and reward are selected using the same policy,
which has determined the original action.
o The SARSA is named because it uses the quintuple Q(s, a, r, s',
a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
Q-Learning Explanation:
In the above image, we can see there is an agent who has three values options, V(s 1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state and the future
state. The agent can go to any direction (Up, Left, or Right), so he needs to decide
where to go for the optimal path. Here agent will take a move as per probability
bases and changes the state. But if we want some exact moves, so for this, we need
to make some changes in terms of Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at
each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that
which action is more lubricative than others, and according to the best Q-value, the
agent takes his next move. The Bellman equation can be used for deriving the Q-
value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on
a certain state, so the Q -value equation will be:
The Q stands for quality in Q-learning, which means it specifies the quality of an
action taken by the agent.
Q-table:
A Q-table or matrix is created while performing the Q-learning. The table follows the
state and action pair, i.e., [s, a], and initializes the values to zero. After each action,
the table is updated, and the q-values are stored within the table.
The RL agent uses this Q-table as a reference table to select the best action based
on the q-values.
RL works by interacting with the environment. Supervised learning works on the existing dataset.
The RL algorithm works like the human brain Supervised Learning works as when a human
works when making some decisions. learns things in the supervision of a guide.
No previous training is provided to the learning Training is provided to the algorithm so that it can
agent. predict the output.
RL helps to take decisions sequentially. In Supervised learning, decisions are made when
input is given.
After calculating the fitness of every existent in the population, a selection process is
used to determine which of the individualities in the population will get to reproduce
and produce the seed that will form the coming generation.
So, now we can define a genetic algorithm as a heuristic search algorithm to solve
optimization problems. It is a subset of evolutionary algorithms, which is used in
computing. A genetic algorithm uses genetic and natural selection concepts to solve
optimization problems.
It basically involves five phases to solve the complex optimization problems, which
are given as below:
o Initialization
o Fitness Assignment
o Selection
o Reproduction
o Termination
1. Initialization
The process of a genetic algorithm starts by generating the set of individuals, which
is called population. Here each individual is the solution for the given problem. An
individual contains or is characterized by a set of parameters called Genes. Genes are
combined into a string and generate chromosomes, which is the solution to the
problem. One of the most popular techniques for initialization is the use of random
binary strings.
2. Fitness Assignment
Fitness function is used to determine how fit an individual is? It means the ability of
an individual to compete with other individuals. In every iteration, individuals are
evaluated based on their fitness function. The fitness function provides a fitness
score to each individual. This score further determines the probability of being
selected for reproduction. The high the fitness score, the more chances of getting
selected for reproduction.
3. Selection
The selection phase involves the selection of individuals for the reproduction of
offspring. All the selected individuals are then arranged in a pair of two to increase
reproduction. Then these individuals transfer their genes to the next generation.
4. Reproduction
After the selection process, the creation of a child occurs in the reproduction step. In
this step, the genetic algorithm uses two variation operators that are applied to the
parent population. The two operators involved in the reproduction phase are given
below:
The genes of parents are exchanged among themselves until the crossover
point is met. These newly generated offspring are added to the population.
This process is also called or crossover. Types of crossover styles available:
o One point crossover
o Two-point crossover
o Livery crossover
o Inheritable Algorithms crossover
o Mutation
The mutation operator inserts random genes in the offspring (new child) to
maintain the diversity in the population. It can be done by flipping some bits
in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances
diversification. The below image shows the mutation process:
Types of mutation styles available,
o Flip bit mutation
o Gaussian mutation
o Exchange/Swap mutation
5. Termination
After the reproduction phase, a stopping criterion is applied as a base for
termination. The algorithm terminates after the threshold fitness solution is reached.
It will identify the final solution as the best solution in the population.