AI (IT) UNIT-4

The document discusses Markov Decision Processes (MDPs) as a framework for reinforcement learning, detailing the components such as agents, environments, states, and actions. It explains the Markov property, transition probabilities, and the formulation of MDPs, including examples like grid worlds to illustrate optimal policy implementation. Additionally, it touches on utility theory and the principle of maximum expected utility as a basis for rational decision-making in AI.

Uploaded by

abdulrasheedshaik996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views37 pages

AI (IT) UNIT-4

Uploaded by

abdulrasheedshaik996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

Department of Information Technology

Department of Computer Science

Artificial Intelligence (PE 511 IT)
V SEM

Faculty Name: MOHAMMED IRSHAD

UNIT-4
Markov Decision Process
MDP Formulation:
Reinforcement Learning is a type of Machine Learning. It
allows machines and software agents to automatically determine
the ideal behavior within a specific context, in order to maximize
its performance. Simple reward feedback is required for the agent
to learn its behavior; this is known as the reinforcement signal.
There are many different algorithms that tackle this issue. As a
matter of fact, Reinforcement Learning is defined by a specific
type of problem, and all its solutions are classed as Reinforcement
Learning algorithms. In the problem, an agent is supposed to
decide the best action to select based on his current state. When
this step is repeated, the problem is known as a Markov Decision
Process.
Typical Reinforcement Learning cycle
Markov Decision Process
MDP Formulation:
Agent : Software programs that make intelligent decisions and
they are the learners in RL. These agents interact with the
environment by actions and receive rewards based on there
actions.
Environment :It is the demonstration of the problem to be solved.
Now, we can have a real-world environment or a simulated
environment with which our agent will interact.
State : This is the position of the agents at a specific time-step in
the environment. So whenever an agent performs a action the
environment gives the agent reward and a new state where the
agent reached by performing the action.
 Anything that the agent cannot change arbitrarily is considered to be
part of the environment.
 In simple terms, actions can be any decision we want the agent to
learn and state can be anything which can be useful in choosing
actions.
 We do not assume that everything in the environment is unknown to the
agent, for example, reward calculation is considered to be the part of the
environment even though the agent knows a bit on how it’s reward is
calculated as a function of its actions and states in which they are taken.
 This is because rewards cannot be arbitrarily changed by the agent.
Sometimes, the agent might be fully aware of its environment but still
finds it difficult to maximize the reward . So, we can safely say that the
agent-environment relationship represents the limit of the agent control
and not it’s knowledge.
The Markov Property
 Transition : Moving from one state to another is called
Transition.
 Transition Probability: The probability that the agent will
move from one state to another is called transition
probability.
 The Markov Property state that :
“Future is Independent of the past given the present”

Mathematically we can express this statement as :

The Markov Property
 S[t] denotes the current state of the agent and s[t+1]
denotes the next state.
 What this equation means is that the transition from state
S[t] to S[t+1] is entirely independent of the past. So,
the RHS of the Equation means the same as LHS if the
system has a Markov Property.
 Intuitively meaning that our current state already captures
the information of the past states.
State Transition Probability
 As we now know about transition probability we can define state
Transition Probability as follows :
 For Markov State from S[t] to S[t+1] i.e. any other successor state , the
state transition probability is given by

State Transition Probability

We can formulate the State Transition probability into a State Transition
probability matrix by :

State Transition Probability Matrix

Each row in the matrix represents the probability from moving from our
original or starting state to any successor state. Sum of each row is equal
to 1.
Markov Process or Markov Chains
 Markov Process is the memory less random process i.e. a
sequence of a random state S[1],S[2],….S[n] with a Markov
Property. So, it’s basically a sequence of states with the Markov
Property. It can be defined using a set of states(S) and transition
probability matrix (P).The dynamics of the environment can be
fully defined using the States(S) and Transition Probability
matrix(P).
Markov Process or Markov
Chains
But what random process means ?

The edges of the tree denote transition probability. From this

chain let’s take some sample. Now, suppose that we were
sleeping and the according to the probability distribution there
is a 0.6 chance that we will Run and 0.2 chance we sleep
more and again 0.2 that we will eat ice-cream.
Markov Process or Markov
Chains
 Similarly, we can think of other sequences that we can
sample from this chain.
 Some samples from the chain :
Sleep — Run — Ice-cream — Sleep
Sleep — Ice-cream — Ice-cream — Run
 In the above two sequences what we see is we get random
set of States(S) (i.e. Sleep,Ice-cream,Sleep ) every time we
run the chain. Hope, it’s now clear why Markov process is
called random set of sequences.
 Rewards are the numerical values that the agent receives on
performing some action at some state(s) in the environment.
The numerical value can be positive or negative based on the
actions of the agent.
 In Reinforcement learning, we care about maximizing the
cumulative reward (all the rewards agent receives from the
environment) instead of, the reward agent receives from the
current state(also called immediate reward). This total sum of
reward the agent receives from the environment is
called returns.
 We can define Returns as :
Markov Reward Process
 Till now we have seen how Markov chain defined the dynamics
of a environment using set of states(S) and Transition
Probability Matrix(P).But, we know that Reinforcement
Learning is all about goal to maximize the reward. So, let’s add
reward to our Markov Chain. This gives us Markov Reward
Process.
 Markov Reward Process : As the name suggests, MDPs are
the Markov chains with values judgement. Basically, we get a
value from every state our agent is in.
 Mathematically, we define Markov Reward Process as :
Markov Reward Process
 Mathematically, we define Markov Reward Process as :

 What this equation means is how much reward (Rs) we get

from a particular state S[t].
 This tells us the immediate reward from that particular state our
agent is in. The goal is to maximize these rewards from each
state our agent is in. In simple terms, maximizing the
cumulative reward we get from each state.
Markov Decision Process
 It is a sequential decision problem for a fully observable, stochastic
environment with a Markovian transition model and additive
rewards is called a Markov decision process.
 It consists of a set of states (with an initial state s0);
a set ACTIONS(s) of actions in each state;
a transition model P(s | s, a);
a reward function R(s);
and a policy the solution of Markov Decision Process.
A Policy is a solution to the Markov Decision Process. A policy is a
mapping from S to a. It indicates the action ‘a’ to be taken while in
state S.
It is traditional to denote a policy by π, and π(s) is the action
recommended by the policy π for state s.
Markov Decision Process
Let us take the example of a grid world:
MDP: Grid World Example
 An agent lives in the grid. The above example is a 4*3 grid.
The grid has a START state(grid no 1,1). The purpose of the
agent is to wander around the grid to finally reach the FINAL
state (grid no 4,3). Under all circumstances, the agent should
avoid the state (grid no 4,2). Also the grid no 2,2 is a blocked
grid, it acts like a wall hence the agent cannot enter it.
 The agent can take any one of these actions: UP, DOWN,
LEFT, RIGHT
 Walls block the agent path, i.e., if there is a wall in the direction
the agent would have taken, the agent stays in the same place.
So for example, if the agent says LEFT in the START grid he
would stay put in the START grid.
MDP: Grid World Example
 First Aim: To find the shortest sequence getting from START to
the Diamond. Two such sequences can be found:
RIGHT RIGHT UP UP RIGHT
UP UP RIGHT RIGHT RIGHT
 Let us take the second one (UP UP RIGHT RIGHT RIGHT) for
the subsequent discussion.
 The move is now noisy. 80% of the time the intended action
works correctly. 20% of the time the action agent takes causes it
to move at right angles.
 For example, if the agent says UP the probability of going UP is
0.8 whereas the probability of going LEFT is 0.1 and probability
of going RIGHT is 0.1 (since LEFT and RIGHT is right angles
to UP).
MDP: Grid World Example
The agent receives rewards each time step:-
Small reward each step (can be negative when can also be term
as punishment, in the above example entering the Fire can have a
reward of -1).
Big rewards come at the end (good or bad).
The goal is to Maximize sum of rewards.
MDP: Implementing Optimal
Policy in Grid World Example
 Each time a given policy is executed starting from the
initial state, the stochastic nature of the environment may
lead to a different environment history.
 The quality of a policy is therefore measured by the
expected utility of the possible environment histories
generated by that policy.
 An optimal policy is a policy that yields the highest
expected utility. We use π∗ to denote an optimal policy.
Given π∗, the agent decides what to do by consulting its
current percept, which tells it the current state s, and then
executing the action π∗(s).
MDP: Implementing Optimal
Policy in Grid World Example
MDP: Implementing Optimal
Policy in Grid World Example
 The policy recommends taking the long way round, rather than
taking the shortcut and thereby risking entering (4,2).
 The balance of risk and reward changes depending on the value of
R(s) for the nonterminal states.
 Figure 17.2(b) shows optimal policies for four different ranges of
R(s).
 When R(s) ≤ −1.6284, life is so painful that the agent heads
straight for the nearest exit, even if the exit is worth –1.
 When −0.4278 ≤ R(s) ≤ −0.0850, life is quite unpleasant; the agent
takes the shortest route to the +1 state and is willing to risk falling
into the –1 state by accident. In particular, the agent takes the
shortcut from (3,1).
MDP: Implementing Optimal
Policy in Grid World Example
 When life is only slightly dreary (−0.0221 < R(s) < 0), the optimal
policy takes no risks at all. In (4,1) and (3,2), the agent heads
directly away from the –1 state so that it cannot fall in by accident,
even though this means banging its head against the wall quite a
few times.
 Finally, if R(s) > 0, then life is positively enjoyable and the agent
avoids both exits.
 The careful balancing of risk and reward is a characteristic of MDPs
that does not arise in deterministic search problems; moreover, it is
a characteristic of many real-world decision problems.
 For this reason, MDPs have been studied in several fields, including
AI, operations research, economics, and control theory.
Utility Theory and utility functions
 Decision theory, in its simplest form, deals with choosing among
actions based on the desirability of their immediate outcomes
 If agent may not know the current state and define RESULT(a)
as a random variable whose values are the possible outcome
states. The probability of outcome s , given evidence
observations e, is written
P(RESULT(a) = s ‘| a, e)
where the a on the right-hand side of the conditioning bar
stands for the event that action a is executed
 The agent’s preferences are captured by a utility function, U(s),
which assigns a single number to express the desirability of a
state.
 The expected utility of an action given the evidence, EU (a|e),
is just the average utility value of the outcomes, weighted by
the probability that the outcome occurs:
EU (a|e) = ∑ P(RESULT(a) = s’ | a, e)U(s’)
The principle of maximum expected utility (MEU) says that a
rational agent should choose the action that maximizes the
agent’s expected utility:
action = argmax EU (a|e)
In a sense, the MEU principle could be seen as defining all of AI.
All an intelligent agent has to do is calculate the various
quantities, maximize utility over its actions, and away it goes.
Basis of Utility Theory
 Intuitively, the principle of Maximum Expected Utility
(MEU) seems like a reasonable way to make decisions, but it
is by no means obvious that it is the only rational way.

 Why should maximizing the average utility be so special?

 What’s wrong with an agent that maximizes the weighted
sum of the cubes of the possible utilities, or tries to minimize
the worst possible loss?
 Could an agent act rationally just by expressing preferences
between states, without giving them numeric values?
 Finally, why should a utility function with the required
properties exist at all?
Constraints on rational preferences
 These questions can be answered by writing down some constraints
on the preferences that a rational agent should have and then
showing that the MEU principle can be derived from the constraints
A B the agent prefers A over B.
A ∼ B the agent is indifferent between A and B.
A ∼ B the agent prefers A over B or is indifferent between them.
We can think of the set of outcomes for each action as a lottery—think
of each action as a ticket. A lottery L with possible outcomes
S1,...,Sn that occur with probabilities p1,...,pn is written
L = [p1, S1; p2, S2; ... pn, Sn] .
Constraints on rational preferences
 In general, each outcome Si of a lottery can be either an atomic
state or another lottery. The primary issue for utility theory is to
understand how preferences between complex lotteries are
related to preferences between the underlying states in those
lotteries.
To address this issue we list six constraints that we require any
reasonable preference relation to obey:
 Orderability: Given any two lotteries, a rational agent must
either prefer one to the other or else rate the two as equally
preferable. That is, the agent cannot avoid deciding.
Exactly one of (A B), (B A), or (A ∼ B) holds.
Constraints on rational preferences
 Transitivity: Given any three lotteries, if an agent prefers A to B and
prefers B to C, then the agent must prefer A to C.
(A B) ∧ (B C) ⇒ (A C)
 Continuity: If some lottery B is between A and C in preference, then
there is some probability p for which the rational agent will be indifferent
between getting B for sure and the lottery that yields A with probability p
and C with probability 1 − p.
A B C ⇒ ∃ p [p, A; 1 − p, C] ∼ B .
 Substitutability: If an agent is indifferent between two lotteries A and B,
then the agent is indifferent between two more complex lotteries that are
the same except that B is substituted for A in one of them. This holds
regardless of the probabilities and the other outcome(s) in the lotteries.
A ∼ B ⇒ [p, A; 1 − p, C] ∼ [p, B; 1 − p, C] .
This also holds if we substitute for ∼ in this axiom.
Constraints on rational preferences
 Monotonicity: Suppose two lotteries have the same two possible
outcomes, A and B. If an agent prefers A to B, then the agent must
prefer the lottery that has a higher probability for A (and vice versa).
A B ⇒ (p > q ⇔ [p, A; 1 − p, B] [q, A; 1 − q, B])

 Decomposability: Compound lotteries can be reduced to simpler

ones using the laws of probability. This has been called the “no fun
in gambling” rule because it says that two consecutive lotteries can
be compressed into a single equivalent lottery.
[p, A; 1 − p, [q,B; 1 − q, C]] ∼ [p, A; (1 − p)q,B; (1 − p)(1 − q), C] .
Expected Utilities
Value Iteration
Calculation of Bellman Equation:
Value Iteration
 In this algorithm, the optimal policy (i.e., Optimal action
for a given state) is obtained by choosing the action that
maximizes the optimal state value function for the given
state.
 In value iteration, we start with a random value function
and the find a new(improved) value function in an iterative
process, until reaching the optimal value function then
derives optimal policy from that optimal value function.
 As here we find optimal state value function using an
iterative algorithm. Hence it is called as value iteration.
 Value iteration is a method of computing an optimal MDP
policy and its value.
Value Iteration
 Algorithm:

Purpose: This algorithm computes an Optimal Markov Decision process policy and its value.
Step 1: [Initialize value function by zero or random values for all states]
set v(s) for all states to s to zero values.
Step 2: [Find a new ( improved) value function in an iterative process until reaching the
optimal value function]
Repeat
for all s ∈ S
for all a ∈ A
Q(s,a) = R(s,a) + r Σ s ∈ S T(s,a,s’) v(s’)
V(s) = max Q(s,a)
until v(s) converge
Step 3:[ Calculate optimal policy from optimal value function ]
for all s ∈ S
π(s) = argmaxa [R(s,a) + r Σ s ∈ S T(s,a,s’) v(s’) ]
Value Iteration
 Value iteration algorithm keeps improving the value
function at each iteration until the value function
converges.
 Value iteration computes the optimal state value function
by iteratively improving the estimate of v(s) using Bellamn
equation.
 The algorithm initializes v(s) to arbitrary random value or
by zeros.
 It repeatedly updates v(s) values until they converges.
 Value Iteration is guaranteed to converge to the optimal
values.
Value Iteration
 Value iteration algorithm keeps improving the value
function at each iteration until the value function
converges.
 But as we know the main goal of an agent is to find optimal
policy.
 Using value iteration algorithm, it is sometimes possible
that optimal policy will converge before the value function.
 So it takes more iteration to find optimal policy.
 So we can use another method of dynamic programming to
find optimal policy using Policy iteration method.

UNIT-4 OF AI
No ratings yet
UNIT-4 OF AI
9 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
AI Lec4 MarkovDecisionProcess&RL
No ratings yet
AI Lec4 MarkovDecisionProcess&RL
34 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
Unit 4
No ratings yet
Unit 4
49 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
ReinforcementLearning-Algos
No ratings yet
ReinforcementLearning-Algos
77 pages
RL-DQN-PG
No ratings yet
RL-DQN-PG
65 pages
ml presentation final
No ratings yet
ml presentation final
26 pages
06 MDP
No ratings yet
06 MDP
89 pages
Understanding the Markov Decision Process (MDP) _ Built In
No ratings yet
Understanding the Markov Decision Process (MDP) _ Built In
18 pages
RL
No ratings yet
RL
62 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
3 Markov Decision Processes
No ratings yet
3 Markov Decision Processes
70 pages
Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
3f0739aa808805e51c445195485a7ebb_16-412s16ResourceFile
No ratings yet
3f0739aa808805e51c445195485a7ebb_16-412s16ResourceFile
56 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Lecture#2_Markov Decision Process MDP An Introduction 2023
No ratings yet
Lecture#2_Markov Decision Process MDP An Introduction 2023
36 pages
Mdp
No ratings yet
Mdp
21 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Module 2
No ratings yet
Module 2
73 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
2
No ratings yet
2
23 pages
119686
No ratings yet
119686
24 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
RL_UNIT-II (1)
No ratings yet
RL_UNIT-II (1)
14 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
AS02
No ratings yet
AS02
16 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
UNIT 4 (2)
No ratings yet
UNIT 4 (2)
6 pages
RL Unit 2
No ratings yet
RL Unit 2
11 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
RL Frra
No ratings yet
RL Frra
10 pages
Markov Decision Process
No ratings yet
Markov Decision Process
11 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
InTech-Multi Automata Learning
No ratings yet
InTech-Multi Automata Learning
21 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Types of Reinforcement Learning MDP
No ratings yet
Types of Reinforcement Learning MDP
3 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
The Elements of Stochastic Processes - With Applications To The Natural Sciences Norman T J
100% (1)
The Elements of Stochastic Processes - With Applications To The Natural Sciences Norman T J
130 pages
Time and Work All Previous Year Questions PYQs Asked in SSC 2018
No ratings yet
Time and Work All Previous Year Questions PYQs Asked in SSC 2018
37 pages
Markov Decision Processes (MDP) : Sudeshna Sarkar
No ratings yet
Markov Decision Processes (MDP) : Sudeshna Sarkar
14 pages
Social Network Analysis Con Python PDF
No ratings yet
Social Network Analysis Con Python PDF
80 pages
AI (IT) UNIT-1
No ratings yet
AI (IT) UNIT-1
16 pages
PQT Notes
100% (1)
PQT Notes
337 pages
AI (IT) UNIT-3-converted
No ratings yet
AI (IT) UNIT-3-converted
85 pages
Syllogism Notes + Solved PYQs For RRB NTPC 2024
No ratings yet
Syllogism Notes + Solved PYQs For RRB NTPC 2024
17 pages
Transition Probability Matrix
No ratings yet
Transition Probability Matrix
10 pages
Markov Chains - Lectures - CMC - 2024
No ratings yet
Markov Chains - Lectures - CMC - 2024
168 pages
Forecasting and Decisions in The Birth-Death-Suppression Markov Model For Wildfires
No ratings yet
Forecasting and Decisions in The Birth-Death-Suppression Markov Model For Wildfires
18 pages
Search Models and Applied Labor Economics 1st Edition Nicholas M. Kiefer pdf download
100% (2)
Search Models and Applied Labor Economics 1st Edition Nicholas M. Kiefer pdf download
74 pages
HRM-Assignment 1(Final)-2
No ratings yet
HRM-Assignment 1(Final)-2
47 pages
A Statistical Physics Perspective On Criticality in Financial Markets
No ratings yet
A Statistical Physics Perspective On Criticality in Financial Markets
27 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
51 pages
MAST20004 Probability: Student Number
No ratings yet
MAST20004 Probability: Student Number
19 pages
2024(Automatica Re)Robust Q-learning algorithm for Markov decision processes under Wasserstein uncertainty
No ratings yet
2024(Automatica Re)Robust Q-learning algorithm for Markov decision processes under Wasserstein uncertainty
13 pages
Synopsis On Employee Lifecycle
No ratings yet
Synopsis On Employee Lifecycle
12 pages
Depmix S4
No ratings yet
Depmix S4
17 pages
A Stochastics Branching Process Model_formatted
No ratings yet
A Stochastics Branching Process Model_formatted
5 pages
Sliding Mode Control For Networked Control Systems - A Brief Survey
No ratings yet
Sliding Mode Control For Networked Control Systems - A Brief Survey
11 pages
ESM3a: Advanced Linear Algebra and Stochastic Processes
No ratings yet
ESM3a: Advanced Linear Algebra and Stochastic Processes
12 pages
Bahl, Cocke, Jelinek and Raviv (BCJR) Algorithm: Markov Source Discrete Memoryles S Channel Receiver
No ratings yet
Bahl, Cocke, Jelinek and Raviv (BCJR) Algorithm: Markov Source Discrete Memoryles S Channel Receiver
12 pages
Synthetic Trace Driven Simulation
No ratings yet
Synthetic Trace Driven Simulation
8 pages
The OZ Machine
No ratings yet
The OZ Machine
11 pages
Chapter 8 Markov Chain Model
No ratings yet
Chapter 8 Markov Chain Model
3 pages
Worksheet 1 - Stoch Mod - ENSIA 2024-2025
No ratings yet
Worksheet 1 - Stoch Mod - ENSIA 2024-2025
2 pages
Lecture 19: Stationary Markov Chains
No ratings yet
Lecture 19: Stationary Markov Chains
4 pages
B - Tech CSE 1st Sem
No ratings yet
B - Tech CSE 1st Sem
7 pages
Lec 5
No ratings yet
Lec 5
3 pages
Final Assignment "On Theory"
No ratings yet
Final Assignment "On Theory"
2 pages
1 Regenerative Processes: 1.1 Examples
No ratings yet
1 Regenerative Processes: 1.1 Examples
10 pages
Sementation HTK
No ratings yet
Sementation HTK
3 pages
hw1 2017 (2140)
No ratings yet
hw1 2017 (2140)
1 page
Probability Assignment 5
No ratings yet
Probability Assignment 5
3 pages
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
No ratings yet
Simulated Annealing: Fundamentals and Applications
From Everand
Simulated Annealing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet

AI (IT) UNIT-4

Uploaded by

AI (IT) UNIT-4

Uploaded by

Department of Information Technology

Department of Computer Science

Faculty Name: MOHAMMED IRSHAD

Mathematically we can express this statement as :

State Transition Probability

State Transition Probability Matrix

The edges of the tree denote transition probability. From this

 What this equation means is how much reward (Rs) we get

 Why should maximizing the average utility be so special?

 Decomposability: Compound lotteries can be reduced to simpler

You might also like