ML Unit 4
ML Unit 4
UNIT-4
Reinforcement Learning and Control
Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of
systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This
review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control
engineer. We explain how approximate representations of the solution make RL feasible for
problems with continuous states and control actions. Stability is a central concern in control, and
we argue that while the control-theoretic RL subfield called adaptive dynamic programming is
dedicated to it, stability of RL largely remains an open question. We also cover in detail the case
where deep neural networks are used for approximation, leading to the field of deep RL, which
has shown great success in recent years. With the control practitioner in mind, we outline
opportunities and pitfalls of deep RL; and we close the survey with an outlook that – among
other things – points out some avenues for bridging the gap between control and artificial-
intelligence RL techniques.
What is a Model?
A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular, T(S, a, S’)
defines a transition T where being in state S and taking an action ‘a’ takes us to state S’ (S and S’ may be
same). For stochastic actions (noisy, non-deterministic) we also define a probability P(S’|S,a) which
represents the probability of reaching a state S’ if action ‘a’ is taken in state S. Note Markov property
states that the effects of an action taken in a state depend only on that state and not on the prior history.
What is Actions?
An Action A is set of all possible actions. A(s) defines the set of actions that can be taken being in state
S.
What is a Reward?
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the state S.
R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’) indicates the reward
for being in a state S, taking an action ‘a’ and ending up in a state S’.
What is a Policy?
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It indicates the
action ‘a’ to be taken while in state S.
Let us take the example of a grid world:
An agent lives in the grid. The above example is a 3*4 grid. The grid has a START state(grid no 1,1). The
purpose of the agent is to wander around the grid to finally reach the Blue Diamond (grid no 4,3). Under
all circumstances, the agent should avoid the Fire grid (orange color, grid no 4,2). Also the grid no 2,2 is a
blocked grid, it acts like a wall hence the agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent would have taken, the agent
stays in the same place. So for example, if the agent says LEFT in the START grid he would stay put in
the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such sequences can
be found:
Bellman equations
If you have read anything related to reinforcement learning you must have encountered bellman equation
somewhere. Bellman equation is the basic block of solving reinforcement learning and is omnipresent in
RL. It helps us to solve MDP. To solve means finding the optimal policy and value functions.
The optimal value function V*(S) is one that yields maximum value.
The value of a given state is equal to the max action (action which maximizes the value) of the reward of
the optimal action in the given state and add a discount factor multiplied by the next state’s Value from the
Bellman Equation.
Let's understand this equation, V(s) is the value for being in a certain state. V(s’) is the value for being in
the next state that we will end up in after taking action a. R(s, a) is the reward we get after taking action a
in state s. As we can take different actions so we use maximum because our agent wants to be in the
optimal state. γ is the discount factor as discussed earlier. This is the bellman equation in the deterministic
environment (discussed in part 1). It will be slightly different for a non-deterministic environment or
stochastic environment.
In a stochastic environment when we take an action it is not confirmed that we will end up in a particular
next state and there is a probability of ending in a particular state. P(s, a,s’) is the probability of ending is
states’ from s by taking action a. This is summed up to a total number of future states. For eg, if by taking
an action we can end up in 3 states s₁, s₂ and s₃ from states’ with a probability of 0.2, 0.2 and 0.6. The
Bellman equation will be:
V(s) = maxₐ(R(s,a) + γ(0.2*V(s₁) + 0.2*V(s₂) + 0.6*V(s₃) )
We can solve the Bellman equation using a special technique called dynamic programming.
Value iteration and policy iteration
Both value and policy iteration work around The Bellman Equations where we find the optimal
utility.
Now , Both value iteration and policy iteration compute the same thing (all optimal values) i.e
they work with Bellman updates.
Value Iteration :
We start with a random value function and then find a new (improved) value function in a
iterative process, until reaching the optimal value function.
1. Evaluation: For fixed current policy p, find values with policy evaluation.
2. Improvement: For fixed values, get a better policy using policy extraction (One-step
look-ahead).
Now overcoming with problems of value iteration , We first start with policy iteration where we
have fixed policy at start ( don’t worry whether it is optimal or not.)
Now since we have fixed our polity the actions that we took in value iteration aren’t necessary
now (i.e. complexity of policy iteration is O(S^2) per iteration). NOTE : we have relaxed A(i.e
actions here).
1. We evaluate the policy evaluation : i.e. we calculate utilities for some fixed policy
(not optimal utilities!) until convergence
2. We improve the policy: i.e update policy using one-step look-ahead with resulting
converged (but not optimal!) utilities as future values
3. Repeat step 1 and 2 until convergence
Now this convergence of policy iteration is much faster than value iteration.
CONCLUSION :
In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:
• We do several passes that update utilities with fixed policy (each pass is fast because
we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
• The new policy will be better (or we’re done)
Both are dynamic programs for solving MDPs.
Q-learning
Q-learning is a model-free reinforcement learning algorithm to learn a policy telling an agent what
action to take under what circumstances. It does not require a model (hence the connotation "model-
free") of the environment, and it can handle problems with stochastic transitions and rewards,
without requiring adaptations.
For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in the sense of
maximizing the expected value of the total reward over any and all successive steps, starting from
the current state. Q-learning can identify an optimal action-selection policy for any given FMDP,
given infinite exploration time and a partly-random policy.[1] "Q" names the function that returns the
reward used to provide the reinforcement and can be said to stand for the "quality" of an action taken
in a given state.
Value function approximation
In general, a function approximation problem asks us to select a function among a well-defined
class that closely matches ("approximates") a target function in a task-specific way. The need
for function approximations arises in many branches of applied mathematics, and computer
science in particular.
One can distinguish two major classes of function approximation problems:
First, for known target functions approximation theory is the branch of numerical analysis that
investigates how certain known functions (for example, special functions) can be approximated by a
specific class of functions (for example, polynomials or rational functions) that often have desirable
properties (inexpensive computation, continuity, integral and limit values, etc.).
Second, the target function, call it g, may be unknown; instead of an explicit formula, only a set of
points of the form (x, g(x)) is provided. Depending on the structure of the domain and codomain of g,
several techniques for approximating g may be applicable. For example, if g is an operation on
the real numbers, techniques of interpolation, extrapolation, regression analysis, and curve
fitting can be used. If the codomain (range or target set) of g is a finite set, one is dealing with
a classification problem instead.
To some extent, the different problems (regression, classification, fitness approximation) have
received a unified treatment in statistical learning theory, where they are viewed as supervised
learning problems.
Policy Search
Reinforce
REINFORCE is a Monte-Carlo variant of policy gradients (Monte-Carlo: taking random
samples). The agent collects a trajectory τ of one episode using its current policy, and uses it to
update the policy parameter. Since one full trajectory must be completed to construct a sample
space, REINFORCE is updated in an off-policy way.