Unit V Reinforcement Learning and Genetic Algorithm
Unit V Reinforcement Learning and Genetic Algorithm
Introduction
Reinforcement learning is an area of Machine Learning. It is about taking suitable
action to maximize reward in a particular situation. It is employed by various software
and machines to find the best possible behavior or path it should take in a specific
situation. Reinforcement learning differs from supervised learning in a way that in
supervised learning the training data has the answer key with it so the model is trained
with the correct answer itself whereas in reinforcement learning, there is no answer
but the reinforcement agent decides what to do to perform the given task. In the
absence of a training dataset, it is bound to learn from its experience.
Introduction(continued)
Reinforcement Learning (RL) is the science of decision making. It is about learning the
optimal behavior in an environment to obtain maximum reward. In RL, the data is
accumulated from machine learning systems that use a trial-and-error method. Data is not
part of the input that we would find in supervised or unsupervised machine learning.
Reinforcement learning uses algorithms that learn from outcomes and decide which action
to take next. After each action, the algorithm receives feedback that helps it determine
whether the choice it made was correct, neutral or incorrect. It is a good technique to use
for automated systems that have to make a lot of small decisions without human guidance.
Reinforcement learning is an autonomous, self-teaching system that essentially learns by
trial and error. It performs actions with the aim of maximizing rewards, or in other words, it
is learning by doing in order to achieve the best outcomes.
Example:
•Input: The input should be an initial state from which the model will start
•Output: There are many possible outputs as there are a variety of solutions to a
particular problem
•Training: The training is based upon the input, The model will return a state and the
user will decide to reward or punish the model based on its output.
•The model keeps continues to learn.
•The best solution is decided based on the maximum reward.
Difference between Reinforcement learning and
Supervised learning:
In Reinforcement learning decision is dependent, So we give labels In supervised learning the decisions are independent of each other
to sequences of dependent decisions so labels are given to each decision.
1.A model of the environment is known, but an analytic solution is not available;
2.Only a simulation model of the environment is given (the subject of simulation-based optimization)
3.The only way to collect information about the environment is to interact with it.
Advantages and Disadvantages of Reinforcement
Learning
Reinforcement Learning :
Reinforcement Learning is a type of Machine Learning. It allows machines and software agents to automatically determine
the ideal behavior within a specific context, in order to maximize its performance. Simple reward feedback is required for the
agent to learn its behavior; this is known as the reinforcement signal.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement Learning is defined by a
specific type of problem, and all its solutions are classed as Reinforcement Learning algorithms. In the problem, an agent is
supposed to decide the best action to select based on his current state. When this step is repeated, the problem is known as
a Markov Decision Process.
A Markov Decision Process (MDP) model contains:
What is a Reward?
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the state S.
R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’) indicates the reward
for being in a state S, taking an action ‘a’ and ending up in a state S’.
What is a Policy?
A Policy is a solution to the Markov Decision
Process. A policy is a mapping from S to a. It
indicates the action ‘a’ to be taken while in state
S.
Let us take the example of a grid world:
An agent lives in the grid. The above example is
a 3*4 grid. The grid has a START state(grid no
1,1). The purpose of the agent is to wander
around the grid to finally reach the Blue
Diamond (grid no 4,3). Under all circumstances,
the agent should avoid the Fire grid (orange
color, grid no 4,2). Also the grid no 2,2 is a
blocked grid, it acts as a wall hence the agent
cannot enter it.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such
The agent can take any one of these sequences can be found:
actions: UP, DOWN, LEFT, RIGHT
•RIGHT RIGHT UP UPRIGHT
Walls block the agent path, i.e., if there is a wall •UP UP RIGHT RIGHT RIGHT
in the direction the agent would have taken, the Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.
agent stays in the same place. So for example, The move is now noisy. 80% of the time the intended action works correctly. 20% of the time
the action agent takes causes it to move at right angles. For example, if the agent says UP the
if the agent says LEFT in the START grid he probability of going UP is 0.8 whereas the probability of going LEFT is 0.1, and the probability
would stay put in the START grid. of going RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).
The agent receives rewards each time step:-
•Small reward each step (can be negative when can also be term as punishment, in the above
example entering the Fire can have a reward of -1).
•Big rewards come at the end (good or bad).
•The goal is to Maximize the sum of rewards.
Q-Learning
Q-learning is a kind of reinforcement learning algorithm that enables machines to
discover via trial and error the best behaviors to adopt in a given environment. The
quality value, also known as the Q-value or quality, is an estimate of the expected
reward for doing a certain action in a specific condition and is the “Q” in Q-learning.
Finding the best course of action that accelerates the long-term benefit is the aim of Q-
learning. Starting with a database of Q-values for each state-action combination, the Q-
learning algorithm operates. These parameters are initially set at random or to zero. The
agent then investigates the surroundings, acting and earning rewards.
A mathematical formula that considers the present Q-value, the reward received, and
the anticipated value of the following state-action combination is used to update the Q-
values based on these rewards.
The Q-values, which reflect the ideal actions to perform in each state as the agent
continues to investigate the environment, converge to their optimal values. By doing so,
the agent may make choices in complicated contexts with a wide range of alternative
behaviors that will optimize its long-term value.
Why do we need Q-Learning?
In the field of Machine Learning, machines may learn the best course of action in
challenging circumstances with the help of the potential technique known as Q-learning.
But why do we actually require Q-learning? There are various factors that make Q-
learning crucial:
First, without explicit programming, computers may learn from new settings and adapt
to them. This is called Q-learning. Explicit instructions would need to be written for each
vital circumstance the computer may experience in traditional programming. The
computer is more versatile and adaptive to new scenarios because of Q-learning, which
allows it to learn independently via trial and error.
Last but not least, Q-learning has the power to change a wide range of industries,
including manufacturing, transportation, and healthcare. Automating various operations
using Q-learning may boost productivity and cut costs by allowing robots to learn and
adapt on their own to make work more swift and seamless.
How Q-Learning Works?
Q-learning is a form of reinforcement learning algorithm that enables an agent to discover the best course of action by
maximizing a reward signal. Here’s how it functions:
•Q-values: The algorithm creates a table of Q-values, which indicate the anticipated reward for doing a certain action in
a specific condition. These Q-values are first chosen at random.
•State: The agent keeps track of the environment’s condition, which reveals details about the scenario as it stands.
•Action: Depending on the situation, the agent decides which action to take. This can be accomplished via an
exploration strategy that chooses a random action with some probability or a straightforward greedy policy that
chooses the action with the greatest Q-value for the current state.
•Reward: In the current state, the agent is given a reward for the activity it took.
•Update Q-value: Using the Bellman equation, the agent changes the Q-value for the current state-action pair.
According to this equation, the immediate reward received plus the discounted expected future reward, which is
calculated using the Q-values for the following state-action pairs, equals the expected Q-value for a state-action pair.
•Repeat: As the agent accumulates experience with the environment, it repeats processes 2 through 5, progressively
updating the Q-values. The objective is to discover the best course of action or the one that maximizes the predicted
cumulative benefit over time.
•Converge: The agent learns the best behaviors to perform in each state as it explores more of the environment,
causing the Q-values to converge to the ideal values.
Bellman Equation in Q-Learning
A key idea in reinforcement learning, including Q-learning, is the Bellman equation. Based on the
incentives received and the anticipated Q-values for the subsequent state-action pairs, the
Bellman equation is employed in Q-learning to update the Q-values for state-action pairings. The
following is the Bellman equation:
The genetic algorithm works on the evolutionary generational cycle to generate high-quality
solutions. These algorithms use different operations that either enhance or replace the population to
give an improved fit solution.
It basically involves five phases to solve the complex optimization problems, which are given as
below:
•Initialization
•Fitness Assignment
•Selection
•Reproduction
•Termination
1. Initialization
The process of a genetic algorithm starts by
generating the set of individuals, which is
called population. Here each individual is the
solution for the given problem. An individual
contains or is characterized by a set of
parameters called Genes. Genes are
combined into a string and generate
chromosomes, which is the solution to the
problem. One of the most popular
techniques for initialization is the use of
random binary strings.
2. Fitness Assignment
3. Selection
Fitness function is used to determine how fit an individual is? It means the ability of an
individual to compete with other individuals. In every iteration, individuals are evaluated based
on their fitness function. The fitness function provides a fitness score to each individual. This
score further determines the probability of being selected for reproduction. The high the fitness
score, the more chances of getting selected for reproduction.
The selection phase involves the selection of individuals for the reproduction of offspring. All
the selected individuals are then arranged in a pair of two to increase reproduction. Then these
individuals transfer their genes to the next generation.
There are three types of Selection methods available, which are:
•Roulette wheel selection
•Tournament selection
•Rank-based selection
4. Reproduction
After the selection process, the creation of a child occurs in the reproduction step. In this step, the genetic algorithm uses two variation operators that
are applied to the parent population. The two operators involved in the reproduction phase are given below:
•Crossover: The crossover plays a most significant role in the reproduction phase of the genetic algorithm. In this process, a crossover point is selected
at random within the genes. Then the crossover operator swaps genetic information of two parents from the current generation to produce a new
individual representing the offspring. The genes of parents are exchanged among themselves until the crossover point is met. These newly generated
offspring are added to the population. This process is also called or crossover. Types of crossover styles available:One point crossover
•Two-point crossover
•Livery crossover
•Inheritable Algorithms crossover
Mutation
The mutation operator inserts random genes in the offspring (new child) to maintain the diversity in the
population. It can be done by flipping some bits in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances diversification. The below image
shows the mutation process:
Types of mutation styles available,
•Flip bit mutation
•Gaussian mutation
•Exchange/Swap mutation
Termination
After the reproduction phase, a stopping criterion is applied as a base for termination. The
algorithm terminates after the threshold fitness solution is reached. It will identify the final
solution as the best solution in the population.
General Workflow of a Simple Genetic Algorithm
Advantages of Genetic Algorithm
•The parallel capabilities of genetic algorithms are best.
•It helps in optimizing various problems such as discrete functions, multi-objective problems,
and continuous functions.
•It provides a solution for a problem that improves over time.
•A genetic algorithm does not need derivative information.
Limitations of Genetic Algorithms
•Genetic algorithms are not efficient algorithms for solving simple problems.
•It does not guarantee the quality of the final solution to a problem.
•Repetitive calculation of fitness values may generate some computational challenges
Difference between Genetic Algorithms and Traditional
Algorithms
•A search space is the set of all possible solutions to the problem. In the traditional algorithm, only one set of solutions
is maintained, whereas, in a genetic algorithm, several sets of solutions in search space can be used.
•Traditional algorithms need more information in order to perform a search, whereas genetic algorithms need only one
objective function to calculate the fitness of an individual.
•Traditional Algorithms cannot work parallelly, whereas genetic Algorithms can work parallelly (calculating the fitness of
the individualities are independent).
•One big difference in genetic Algorithms is that rather of operating directly on seeker results, inheritable algorithms
operate on their representations (or rendering), frequently appertained to as chromosomes.
•One of the big differences between traditional algorithm and genetic algorithm is that it does not directly operate on
candidate solutions.
•Traditional Algorithms can only generate one result in the end, whereas Genetic Algorithms can generate multiple
optimal results from different generations.
•The traditional algorithm is not more likely to generate optimal results, whereas Genetic algorithms do not guarantee
to generate optimal global results, but also there is a great possibility of getting the optimal result for a problem as it
uses genetic operators such as Crossover and Mutation.
•Traditional algorithms are deterministic in nature, whereas Genetic algorithms are probabilistic and stochastic in
nature.