0% found this document useful (0 votes)
25 views

ML Unit 4

Reinforcement learning allows machines to learn optimal behavior through trial-and-error interactions with an environment. Three key elements in reinforcement learning problems are: states, actions, and rewards. Many algorithms have been developed to solve reinforcement learning problems modeled as Markov decision processes. Value iteration and policy iteration are two classic methods for solving MDPs using the Bellman equation to find the optimal value function and policy. Recent advances in deep reinforcement learning combine neural networks with reinforcement learning algorithms to achieve superhuman performance on complex problems.

Uploaded by

themojlvl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

ML Unit 4

Reinforcement learning allows machines to learn optimal behavior through trial-and-error interactions with an environment. Three key elements in reinforcement learning problems are: states, actions, and rewards. Many algorithms have been developed to solve reinforcement learning problems modeled as Markov decision processes. Value iteration and policy iteration are two classic methods for solving MDPs using the Bellman equation to find the optimal value function and policy. Recent advances in deep reinforcement learning combine neural networks with reinforcement learning algorithms to achieve superhuman performance on complex problems.

Uploaded by

themojlvl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

MACHINE LEARNING

UNIT-4
Reinforcement Learning and Control
Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of
systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This
review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control
engineer. We explain how approximate representations of the solution make RL feasible for
problems with continuous states and control actions. Stability is a central concern in control, and
we argue that while the control-theoretic RL subfield called adaptive dynamic programming is
dedicated to it, stability of RL largely remains an open question. We also cover in detail the case
where deep neural networks are used for approximation, leading to the field of deep RL, which
has shown great success in recent years. With the control practitioner in mind, we outline
opportunities and pitfalls of deep RL; and we close the survey with an outlook that – among
other things – points out some avenues for bridging the gap between control and artificial-
intelligence RL techniques.

MDPs (Markov Decision Processes)


Reinforcement Learning is a type of Machine Learning. It allows machines and software agents to
automatically determine the ideal behavior within a specific context, in order to maximize its
performance. Simple reward feedback is required for the agent to learn its behavior; this is known as the
reinforcement signal.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement Learning is
defined by a specific type of problem, and all its solutions are classed as Reinforcement Learning
algorithms. In the problem, an agent is supposed to decide the best action to select based on his current
state. When this step is repeated, the problem is known as a Markov Decision Process.
A Markov Decision Process (MDP) model contains:
• A set of possible world states S.
• A set of Models.
• A set of possible actions A.
• A real valued reward function R(s,a).
• A policy the solution of Markov Decision Process.
What is a State?
A State is a set of tokens that represent every state that the agent can be in.

What is a Model?
A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular, T(S, a, S’)
defines a transition T where being in state S and taking an action ‘a’ takes us to state S’ (S and S’ may be
same). For stochastic actions (noisy, non-deterministic) we also define a probability P(S’|S,a) which
represents the probability of reaching a state S’ if action ‘a’ is taken in state S. Note Markov property
states that the effects of an action taken in a state depend only on that state and not on the prior history.

What is Actions?
An Action A is set of all possible actions. A(s) defines the set of actions that can be taken being in state
S.

What is a Reward?
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the state S.
R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’) indicates the reward
for being in a state S, taking an action ‘a’ and ending up in a state S’.

What is a Policy?
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It indicates the
action ‘a’ to be taken while in state S.
Let us take the example of a grid world:

An agent lives in the grid. The above example is a 3*4 grid. The grid has a START state(grid no 1,1). The
purpose of the agent is to wander around the grid to finally reach the Blue Diamond (grid no 4,3). Under
all circumstances, the agent should avoid the Fire grid (orange color, grid no 4,2). Also the grid no 2,2 is a
blocked grid, it acts like a wall hence the agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent would have taken, the agent
stays in the same place. So for example, if the agent says LEFT in the START grid he would stay put in
the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such sequences can
be found:

• RIGHT RIGHT UP UP RIGHT


• UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.
The move is now noisy. 80% of the time the intended action works correctly. 20% of the time the action
agent takes causes it to move at right angles. For example, if the agent says UP the probability of going
UP is 0.8 whereas the probability of going LEFT is 0.1 and probability of going RIGHT is 0.1 (since
LEFT and RIGHT is right angles to UP).
The agent receives rewards each time step:-
• Small reward each step (can be negative when can also be term as punishment, in the above
example entering the Fire can have a reward of -1).
• Big rewards come at the end (good or bad).
• The goal is to Maximize sum of rewards.

Bellman equations
If you have read anything related to reinforcement learning you must have encountered bellman equation
somewhere. Bellman equation is the basic block of solving reinforcement learning and is omnipresent in
RL. It helps us to solve MDP. To solve means finding the optimal policy and value functions.

The optimal value function V*(S) is one that yields maximum value.

The value of a given state is equal to the max action (action which maximizes the value) of the reward of
the optimal action in the given state and add a discount factor multiplied by the next state’s Value from the
Bellman Equation.

Bellman equation for deterministic environment

Let's understand this equation, V(s) is the value for being in a certain state. V(s’) is the value for being in
the next state that we will end up in after taking action a. R(s, a) is the reward we get after taking action a
in state s. As we can take different actions so we use maximum because our agent wants to be in the
optimal state. γ is the discount factor as discussed earlier. This is the bellman equation in the deterministic
environment (discussed in part 1). It will be slightly different for a non-deterministic environment or
stochastic environment.

Bellman equation for stochastic environment

In a stochastic environment when we take an action it is not confirmed that we will end up in a particular
next state and there is a probability of ending in a particular state. P(s, a,s’) is the probability of ending is
states’ from s by taking action a. This is summed up to a total number of future states. For eg, if by taking
an action we can end up in 3 states s₁, s₂ and s₃ from states’ with a probability of 0.2, 0.2 and 0.6. The
Bellman equation will be:
V(s) = maxₐ(R(s,a) + γ(0.2*V(s₁) + 0.2*V(s₂) + 0.6*V(s₃) )

We can solve the Bellman equation using a special technique called dynamic programming.
Value iteration and policy iteration
Both value and policy iteration work around The Bellman Equations where we find the optimal
utility.

The Bellman Equation is given as :

Now , Both value iteration and policy iteration compute the same thing (all optimal values) i.e
they work with Bellman updates.

Then how they are different

1. Value iteration is simpler but its computationally heavy.


2. Policy iteration is complicated but its computationally cheap w.r.t value iteration.
3. Value iteration includes: finding optimal value function + one policy extraction.
There is no repeat of the two because once the value function is optimal, then the
policy out of it should also be optimal (i.e. converged).
4. Policy iteration includes: policy evaluation + policy improvement, and the two are
repeated iteratively until policy converges.

Value Iteration :

We start with a random value function and then find a new (improved) value function in a
iterative process, until reaching the optimal value function.

Value iteration computation of Bellman Equation:


Problems with Value Iteration :

1. It’s slow – O(S^2A) per iteration


2. The “max” at each state rarely changes
3. The policy often converges long before the values
Policy Iteration :

Policy iteration computations of Bellman Equation:

1. Evaluation: For fixed current policy p, find values with policy evaluation.

2. Improvement: For fixed values, get a better policy using policy extraction (One-step
look-ahead).

Now overcoming with problems of value iteration , We first start with policy iteration where we
have fixed policy at start ( don’t worry whether it is optimal or not.)

Now since we have fixed our polity the actions that we took in value iteration aren’t necessary
now (i.e. complexity of policy iteration is O(S^2) per iteration). NOTE : we have relaxed A(i.e
actions here).

Now for policy iteration :

1. We evaluate the policy evaluation : i.e. we calculate utilities for some fixed policy
(not optimal utilities!) until convergence
2. We improve the policy: i.e update policy using one-step look-ahead with resulting
converged (but not optimal!) utilities as future values
3. Repeat step 1 and 2 until convergence
Now this convergence of policy iteration is much faster than value iteration.
CONCLUSION :

In value iteration:

• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:

• We do several passes that update utilities with fixed policy (each pass is fast because
we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
• The new policy will be better (or we’re done)
Both are dynamic programs for solving MDPs.

Linear quadratic regulation (LQR)


The theory of optimal control is concerned with operating a dynamic system at minimum cost. The
case where the system dynamics are described by a set of linear differential equations and the cost
is described by a quadratic function is called the LQ problem. One of the main results in the theory is
that the solution is provided by the linear–quadratic regulator (LQR), a feedback controller whose
equations are given below. The LQR is an important part of the solution to the LQG (linear–
quadratic–Gaussian) problem. Like the LQR problem itself, the LQG problem is one of the most
fundamental problems in control theory.
The settings of a (regulating) controller governing either a machine or process (like an airplane or
chemical reactor) are found by using a mathematical algorithm that minimizes a cost function with
weighting factors supplied by a human (engineer). The cost function is often defined as a sum of the
deviations of key measurements, like altitude or process temperature, from their desired values. The
algorithm thus finds those controller settings that minimize undesired deviations. The magnitude of
the control action itself may also be included in the cost function.
The LQR algorithm reduces the amount of work done by the control systems engineer to optimize
the controller. However, the engineer still needs to specify the cost function parameters, and
compare the results with the specified design goals. Often this means that controller construction will
be an iterative process in which the engineer judges the "optimal" controllers produced through
simulation and then adjusts the parameters to produce a controller more consistent with design
goals.
The LQR algorithm is essentially an automated way of finding an appropriate state-feedback
controller. As such, it is not uncommon for control engineers to prefer alternative methods, like full
state feedback, also known as pole placement, in which there is a clearer relationship between
controller parameters and controller behavior. Difficulty in finding the right weighting factors limits the
application of the LQR based controller synthesis.

LQG (Linear Quadratic Gaussian control)


In control theory, the linear–quadratic–Gaussian (LQG) control problem is one of the most
fundamental optimal control problems. It concerns linear systems driven by additive white Gaussian
noise. The problem is to determine an output feedback law that is optimal in the sense of minimizing
the expected value of a quadratic cost criterion. Output measurements are assumed to be corrupted
by Gaussian noise and the initial state, likewise, is assumed to be a Gaussian random vector.
Under these assumptions an optimal control scheme within the class of linear control laws can be
derived by a completion-of-squares argument.[1] This control law which is known as
the LQG controller, is unique and it is simply a combination of a Kalman filter (a linear–quadratic
state estimator (LQE)) together with a linear–quadratic regulator (LQR). The separation
principle states that the state estimator and the state feedback can be designed independently. LQG
control applies to both linear time-invariant systems as well as linear time-varying systems, and
constitutes a linear dynamic feedback control law that is easily computed and implemented: the LQG
controller itself is a dynamic system like the system it controls. Both systems have the same state
dimension.
A deeper statement of the separation principle is that the LQG controller is still optimal in a wider
class of possibly nonlinear controllers. That is, utilizing a nonlinear control scheme will not improve
the expected value of the cost functional. This version of the separation principle is a special case of
the separation principle of stochastic control which states that even when the process and output
noise sources are possibly non-Gaussian martingales, as long as the system dynamics are linear,
the optimal control separates into an optimal state estimator (which may no longer be a Kalman
filter) and an LQR regulator.
In the classical LQG setting, implementation of the LQG controller may be problematic when the
dimension of the system state is large. The reduced-order LQG problem (fixed-order LQG
problem) overcomes this by fixing a priori the number of states of the LQG controller. This problem is
more difficult to solve because it is no longer separable. Also, the solution is no longer unique.
Despite these facts numerical algorithms are available to solve the associated optimal projection
equations which constitute necessary and sufficient conditions for a locally optimal reduced-order
LQG controller.
LQG optimality does not automatically ensure good robustness properties. The robust stability of the
closed loop system must be checked separately after the LQG controller has been designed. To
promote robustness some of the system parameters may be assumed stochastic instead of
deterministic. The associated more difficult control problem leads to a similar optimal controller of
which only the controller parameters are different.
It is possible to compute the expected value of the cost function for the optimal gains, as well as any
other set of stable gains.
Finally, the LQG controller is also used to control perturbed non-linear systems.

Q-learning
Q-learning is a model-free reinforcement learning algorithm to learn a policy telling an agent what
action to take under what circumstances. It does not require a model (hence the connotation "model-
free") of the environment, and it can handle problems with stochastic transitions and rewards,
without requiring adaptations.
For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in the sense of
maximizing the expected value of the total reward over any and all successive steps, starting from
the current state. Q-learning can identify an optimal action-selection policy for any given FMDP,
given infinite exploration time and a partly-random policy.[1] "Q" names the function that returns the
reward used to provide the reinforcement and can be said to stand for the "quality" of an action taken
in a given state.
Value function approximation
In general, a function approximation problem asks us to select a function among a well-defined
class that closely matches ("approximates") a target function in a task-specific way. The need
for function approximations arises in many branches of applied mathematics, and computer
science in particular.
One can distinguish two major classes of function approximation problems:
First, for known target functions approximation theory is the branch of numerical analysis that
investigates how certain known functions (for example, special functions) can be approximated by a
specific class of functions (for example, polynomials or rational functions) that often have desirable
properties (inexpensive computation, continuity, integral and limit values, etc.).
Second, the target function, call it g, may be unknown; instead of an explicit formula, only a set of
points of the form (x, g(x)) is provided. Depending on the structure of the domain and codomain of g,
several techniques for approximating g may be applicable. For example, if g is an operation on
the real numbers, techniques of interpolation, extrapolation, regression analysis, and curve
fitting can be used. If the codomain (range or target set) of g is a finite set, one is dealing with
a classification problem instead.
To some extent, the different problems (regression, classification, fitness approximation) have
received a unified treatment in statistical learning theory, where they are viewed as supervised
learning problems.

Policy Search
Reinforce
REINFORCE is a Monte-Carlo variant of policy gradients (Monte-Carlo: taking random
samples). The agent collects a trajectory τ of one episode using its current policy, and uses it to
update the policy parameter. Since one full trajectory must be completed to construct a sample
space, REINFORCE is updated in an off-policy way.

Here is the pseudo code for REINFORCE :

So, the flow of the algorithm is:


1. Perform a trajectory roll-out using the current policy
2. Store log probabilities (of policy) and reward values at each step
3. Calculate discounted cumulative future reward at each step
4. Compute policy gradient and update policy parameter
5. Repeat 1–4

POMDPs (Partially Observable Markov Decision Processes)


A partially observable Markov decision process (POMDP) is a generalization of a Markov
decision process (MDP). A POMDP models an agent decision process in which it is assumed that
the system dynamics are determined by an MDP, but the agent cannot directly observe the
underlying state. Instead, it must maintain a probability distribution over the set of possible states,
based on a set of observations and observation probabilities, and the underlying MDP.
The POMDP framework is general enough to model a variety of real-world sequential decision
processes. Applications include robot navigation problems, machine maintenance, and planning
under uncertainty in general. The general framework of Markov decision processes with imperfect
information was described by Karl Johan Åström in 1965 in the case of a discrete state space, and it
was further studied in the operations research community where the acronym POMDP was coined. It
was later adapted for problems in artificial intelligence and automated planning by Leslie P.
Kaelbling and Michael L. Littman.
An exact solution to a POMDP yields the optimal action for each possible belief over the world
states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over
a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the
agent for interacting with its environment.

You might also like