0% found this document useful (0 votes)
73 views37 pages

AI (IT) UNIT-4

The document discusses Markov Decision Processes (MDPs) as a framework for reinforcement learning, detailing the components such as agents, environments, states, and actions. It explains the Markov property, transition probabilities, and the formulation of MDPs, including examples like grid worlds to illustrate optimal policy implementation. Additionally, it touches on utility theory and the principle of maximum expected utility as a basis for rational decision-making in AI.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views37 pages

AI (IT) UNIT-4

The document discusses Markov Decision Processes (MDPs) as a framework for reinforcement learning, detailing the components such as agents, environments, states, and actions. It explains the Markov property, transition probabilities, and the formulation of MDPs, including examples like grid worlds to illustrate optimal policy implementation. Additionally, it touches on utility theory and the principle of maximum expected utility as a basis for rational decision-making in AI.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Department of Information Technology

Department of Computer Science


Artificial Intelligence (PE 511 IT)
V SEM

Faculty Name: MOHAMMED IRSHAD


UNIT-4
Markov Decision Process
MDP Formulation:
Reinforcement Learning is a type of Machine Learning. It
allows machines and software agents to automatically determine
the ideal behavior within a specific context, in order to maximize
its performance. Simple reward feedback is required for the agent
to learn its behavior; this is known as the reinforcement signal.
There are many different algorithms that tackle this issue. As a
matter of fact, Reinforcement Learning is defined by a specific
type of problem, and all its solutions are classed as Reinforcement
Learning algorithms. In the problem, an agent is supposed to
decide the best action to select based on his current state. When
this step is repeated, the problem is known as a Markov Decision
Process.
Typical Reinforcement Learning cycle
Markov Decision Process
MDP Formulation:
Agent : Software programs that make intelligent decisions and
they are the learners in RL. These agents interact with the
environment by actions and receive rewards based on there
actions.
Environment :It is the demonstration of the problem to be solved.
Now, we can have a real-world environment or a simulated
environment with which our agent will interact.
State : This is the position of the agents at a specific time-step in
the environment. So whenever an agent performs a action the
environment gives the agent reward and a new state where the
agent reached by performing the action.
 Anything that the agent cannot change arbitrarily is considered to be
part of the environment.
 In simple terms, actions can be any decision we want the agent to
learn and state can be anything which can be useful in choosing
actions.
 We do not assume that everything in the environment is unknown to the
agent, for example, reward calculation is considered to be the part of the
environment even though the agent knows a bit on how it’s reward is
calculated as a function of its actions and states in which they are taken.
 This is because rewards cannot be arbitrarily changed by the agent.
Sometimes, the agent might be fully aware of its environment but still
finds it difficult to maximize the reward . So, we can safely say that the
agent-environment relationship represents the limit of the agent control
and not it’s knowledge.
The Markov Property
 Transition : Moving from one state to another is called
Transition.
 Transition Probability: The probability that the agent will
move from one state to another is called transition
probability.
 The Markov Property state that :
“Future is Independent of the past given the present”

Mathematically we can express this statement as :


The Markov Property
 S[t] denotes the current state of the agent and s[t+1]
denotes the next state.
 What this equation means is that the transition from state
S[t] to S[t+1] is entirely independent of the past. So,
the RHS of the Equation means the same as LHS if the
system has a Markov Property.
 Intuitively meaning that our current state already captures
the information of the past states.
State Transition Probability
 As we now know about transition probability we can define state
Transition Probability as follows :
 For Markov State from S[t] to S[t+1] i.e. any other successor state , the
state transition probability is given by

State Transition Probability


We can formulate the State Transition probability into a State Transition
probability matrix by :

State Transition Probability Matrix


Each row in the matrix represents the probability from moving from our
original or starting state to any successor state. Sum of each row is equal
to 1.
Markov Process or Markov Chains
 Markov Process is the memory less random process i.e. a
sequence of a random state S[1],S[2],….S[n] with a Markov
Property. So, it’s basically a sequence of states with the Markov
Property. It can be defined using a set of states(S) and transition
probability matrix (P).The dynamics of the environment can be
fully defined using the States(S) and Transition Probability
matrix(P).
Markov Process or Markov
Chains
But what random process means ?

The edges of the tree denote transition probability. From this


chain let’s take some sample. Now, suppose that we were
sleeping and the according to the probability distribution there
is a 0.6 chance that we will Run and 0.2 chance we sleep
more and again 0.2 that we will eat ice-cream.
Markov Process or Markov
Chains
 Similarly, we can think of other sequences that we can
sample from this chain.
 Some samples from the chain :
Sleep — Run — Ice-cream — Sleep
Sleep — Ice-cream — Ice-cream — Run
 In the above two sequences what we see is we get random
set of States(S) (i.e. Sleep,Ice-cream,Sleep ) every time we
run the chain. Hope, it’s now clear why Markov process is
called random set of sequences.
 Rewards are the numerical values that the agent receives on
performing some action at some state(s) in the environment.
The numerical value can be positive or negative based on the
actions of the agent.
 In Reinforcement learning, we care about maximizing the
cumulative reward (all the rewards agent receives from the
environment) instead of, the reward agent receives from the
current state(also called immediate reward). This total sum of
reward the agent receives from the environment is
called returns.
 We can define Returns as :
Markov Reward Process
 Till now we have seen how Markov chain defined the dynamics
of a environment using set of states(S) and Transition
Probability Matrix(P).But, we know that Reinforcement
Learning is all about goal to maximize the reward. So, let’s add
reward to our Markov Chain. This gives us Markov Reward
Process.
 Markov Reward Process : As the name suggests, MDPs are
the Markov chains with values judgement. Basically, we get a
value from every state our agent is in.
 Mathematically, we define Markov Reward Process as :
Markov Reward Process
 Mathematically, we define Markov Reward Process as :

 What this equation means is how much reward (Rs) we get


from a particular state S[t].
 This tells us the immediate reward from that particular state our
agent is in. The goal is to maximize these rewards from each
state our agent is in. In simple terms, maximizing the
cumulative reward we get from each state.
Markov Decision Process
 It is a sequential decision problem for a fully observable, stochastic
environment with a Markovian transition model and additive
rewards is called a Markov decision process.
 It consists of a set of states (with an initial state s0);
a set ACTIONS(s) of actions in each state;
a transition model P(s | s, a);
a reward function R(s);
and a policy the solution of Markov Decision Process.
A Policy is a solution to the Markov Decision Process. A policy is a
mapping from S to a. It indicates the action ‘a’ to be taken while in
state S.
It is traditional to denote a policy by π, and π(s) is the action
recommended by the policy π for state s.
Markov Decision Process
Let us take the example of a grid world:
MDP: Grid World Example
 An agent lives in the grid. The above example is a 4*3 grid.
The grid has a START state(grid no 1,1). The purpose of the
agent is to wander around the grid to finally reach the FINAL
state (grid no 4,3). Under all circumstances, the agent should
avoid the state (grid no 4,2). Also the grid no 2,2 is a blocked
grid, it acts like a wall hence the agent cannot enter it.
 The agent can take any one of these actions: UP, DOWN,
LEFT, RIGHT
 Walls block the agent path, i.e., if there is a wall in the direction
the agent would have taken, the agent stays in the same place.
So for example, if the agent says LEFT in the START grid he
would stay put in the START grid.
MDP: Grid World Example
 First Aim: To find the shortest sequence getting from START to
the Diamond. Two such sequences can be found:
RIGHT RIGHT UP UP RIGHT
UP UP RIGHT RIGHT RIGHT
 Let us take the second one (UP UP RIGHT RIGHT RIGHT) for
the subsequent discussion.
 The move is now noisy. 80% of the time the intended action
works correctly. 20% of the time the action agent takes causes it
to move at right angles.
 For example, if the agent says UP the probability of going UP is
0.8 whereas the probability of going LEFT is 0.1 and probability
of going RIGHT is 0.1 (since LEFT and RIGHT is right angles
to UP).
MDP: Grid World Example
The agent receives rewards each time step:-
Small reward each step (can be negative when can also be term
as punishment, in the above example entering the Fire can have a
reward of -1).
Big rewards come at the end (good or bad).
The goal is to Maximize sum of rewards.
MDP: Implementing Optimal
Policy in Grid World Example
 Each time a given policy is executed starting from the
initial state, the stochastic nature of the environment may
lead to a different environment history.
 The quality of a policy is therefore measured by the
expected utility of the possible environment histories
generated by that policy.
 An optimal policy is a policy that yields the highest
expected utility. We use π∗ to denote an optimal policy.
Given π∗, the agent decides what to do by consulting its
current percept, which tells it the current state s, and then
executing the action π∗(s).
MDP: Implementing Optimal
Policy in Grid World Example
MDP: Implementing Optimal
Policy in Grid World Example
 The policy recommends taking the long way round, rather than
taking the shortcut and thereby risking entering (4,2).
 The balance of risk and reward changes depending on the value of
R(s) for the nonterminal states.
 Figure 17.2(b) shows optimal policies for four different ranges of
R(s).
 When R(s) ≤ −1.6284, life is so painful that the agent heads
straight for the nearest exit, even if the exit is worth –1.
 When −0.4278 ≤ R(s) ≤ −0.0850, life is quite unpleasant; the agent
takes the shortest route to the +1 state and is willing to risk falling
into the –1 state by accident. In particular, the agent takes the
shortcut from (3,1).
MDP: Implementing Optimal
Policy in Grid World Example
 When life is only slightly dreary (−0.0221 < R(s) < 0), the optimal
policy takes no risks at all. In (4,1) and (3,2), the agent heads
directly away from the –1 state so that it cannot fall in by accident,
even though this means banging its head against the wall quite a
few times.
 Finally, if R(s) > 0, then life is positively enjoyable and the agent
avoids both exits.
 The careful balancing of risk and reward is a characteristic of MDPs
that does not arise in deterministic search problems; moreover, it is
a characteristic of many real-world decision problems.
 For this reason, MDPs have been studied in several fields, including
AI, operations research, economics, and control theory.
Utility Theory and utility functions
 Decision theory, in its simplest form, deals with choosing among
actions based on the desirability of their immediate outcomes
 If agent may not know the current state and define RESULT(a)
as a random variable whose values are the possible outcome
states. The probability of outcome s , given evidence
observations e, is written
P(RESULT(a) = s ‘| a, e)
where the a on the right-hand side of the conditioning bar
stands for the event that action a is executed
 The agent’s preferences are captured by a utility function, U(s),
which assigns a single number to express the desirability of a
state.
 The expected utility of an action given the evidence, EU (a|e),
is just the average utility value of the outcomes, weighted by
the probability that the outcome occurs:
EU (a|e) = ∑ P(RESULT(a) = s’ | a, e)U(s’)
The principle of maximum expected utility (MEU) says that a
rational agent should choose the action that maximizes the
agent’s expected utility:
action = argmax EU (a|e)
In a sense, the MEU principle could be seen as defining all of AI.
All an intelligent agent has to do is calculate the various
quantities, maximize utility over its actions, and away it goes.
Basis of Utility Theory
 Intuitively, the principle of Maximum Expected Utility
(MEU) seems like a reasonable way to make decisions, but it
is by no means obvious that it is the only rational way.

 Why should maximizing the average utility be so special?


 What’s wrong with an agent that maximizes the weighted
sum of the cubes of the possible utilities, or tries to minimize
the worst possible loss?
 Could an agent act rationally just by expressing preferences
between states, without giving them numeric values?
 Finally, why should a utility function with the required
properties exist at all?
Constraints on rational preferences
 These questions can be answered by writing down some constraints
on the preferences that a rational agent should have and then
showing that the MEU principle can be derived from the constraints
A B the agent prefers A over B.
A ∼ B the agent is indifferent between A and B.
A ∼ B the agent prefers A over B or is indifferent between them.
We can think of the set of outcomes for each action as a lottery—think
of each action as a ticket. A lottery L with possible outcomes
S1,...,Sn that occur with probabilities p1,...,pn is written
L = [p1, S1; p2, S2; ... pn, Sn] .
Constraints on rational preferences
 In general, each outcome Si of a lottery can be either an atomic
state or another lottery. The primary issue for utility theory is to
understand how preferences between complex lotteries are
related to preferences between the underlying states in those
lotteries.
To address this issue we list six constraints that we require any
reasonable preference relation to obey:
 Orderability: Given any two lotteries, a rational agent must
either prefer one to the other or else rate the two as equally
preferable. That is, the agent cannot avoid deciding.
Exactly one of (A B), (B A), or (A ∼ B) holds.
Constraints on rational preferences
 Transitivity: Given any three lotteries, if an agent prefers A to B and
prefers B to C, then the agent must prefer A to C.
(A B) ∧ (B C) ⇒ (A C)
 Continuity: If some lottery B is between A and C in preference, then
there is some probability p for which the rational agent will be indifferent
between getting B for sure and the lottery that yields A with probability p
and C with probability 1 − p.
A B C ⇒ ∃ p [p, A; 1 − p, C] ∼ B .
 Substitutability: If an agent is indifferent between two lotteries A and B,
then the agent is indifferent between two more complex lotteries that are
the same except that B is substituted for A in one of them. This holds
regardless of the probabilities and the other outcome(s) in the lotteries.
A ∼ B ⇒ [p, A; 1 − p, C] ∼ [p, B; 1 − p, C] .
This also holds if we substitute for ∼ in this axiom.
Constraints on rational preferences
 Monotonicity: Suppose two lotteries have the same two possible
outcomes, A and B. If an agent prefers A to B, then the agent must
prefer the lottery that has a higher probability for A (and vice versa).
A B ⇒ (p > q ⇔ [p, A; 1 − p, B] [q, A; 1 − q, B])

 Decomposability: Compound lotteries can be reduced to simpler


ones using the laws of probability. This has been called the “no fun
in gambling” rule because it says that two consecutive lotteries can
be compressed into a single equivalent lottery.
[p, A; 1 − p, [q,B; 1 − q, C]] ∼ [p, A; (1 − p)q,B; (1 − p)(1 − q), C] .
Expected Utilities
Value Iteration
Calculation of Bellman Equation:
Value Iteration
 In this algorithm, the optimal policy (i.e., Optimal action
for a given state) is obtained by choosing the action that
maximizes the optimal state value function for the given
state.
 In value iteration, we start with a random value function
and the find a new(improved) value function in an iterative
process, until reaching the optimal value function then
derives optimal policy from that optimal value function.
 As here we find optimal state value function using an
iterative algorithm. Hence it is called as value iteration.
 Value iteration is a method of computing an optimal MDP
policy and its value.
Value Iteration
 Algorithm:

Purpose: This algorithm computes an Optimal Markov Decision process policy and its value.
Step 1: [Initialize value function by zero or random values for all states]
set v(s) for all states to s to zero values.
Step 2: [Find a new ( improved) value function in an iterative process until reaching the
optimal value function]
Repeat
for all s ∈ S
for all a ∈ A
Q(s,a) = R(s,a) + r Σ s ∈ S T(s,a,s’) v(s’)
V(s) = max Q(s,a)
until v(s) converge
Step 3:[ Calculate optimal policy from optimal value function ]
for all s ∈ S
π(s) = argmaxa [R(s,a) + r Σ s ∈ S T(s,a,s’) v(s’) ]
Value Iteration
 Value iteration algorithm keeps improving the value
function at each iteration until the value function
converges.
 Value iteration computes the optimal state value function
by iteratively improving the estimate of v(s) using Bellamn
equation.
 The algorithm initializes v(s) to arbitrary random value or
by zeros.
 It repeatedly updates v(s) values until they converges.
 Value Iteration is guaranteed to converge to the optimal
values.
Value Iteration
 Value iteration algorithm keeps improving the value
function at each iteration until the value function
converges.
 But as we know the main goal of an agent is to find optimal
policy.
 Using value iteration algorithm, it is sometimes possible
that optimal policy will converge before the value function.
 So it takes more iteration to find optimal policy.
 So we can use another method of dynamic programming to
find optimal policy using Policy iteration method.

You might also like