AI (IT) UNIT-4
AI (IT) UNIT-4
Purpose: This algorithm computes an Optimal Markov Decision process policy and its value.
Step 1: [Initialize value function by zero or random values for all states]
set v(s) for all states to s to zero values.
Step 2: [Find a new ( improved) value function in an iterative process until reaching the
optimal value function]
Repeat
for all s ∈ S
for all a ∈ A
Q(s,a) = R(s,a) + r Σ s ∈ S T(s,a,s’) v(s’)
V(s) = max Q(s,a)
until v(s) converge
Step 3:[ Calculate optimal policy from optimal value function ]
for all s ∈ S
π(s) = argmaxa [R(s,a) + r Σ s ∈ S T(s,a,s’) v(s’) ]
Value Iteration
Value iteration algorithm keeps improving the value
function at each iteration until the value function
converges.
Value iteration computes the optimal state value function
by iteratively improving the estimate of v(s) using Bellamn
equation.
The algorithm initializes v(s) to arbitrary random value or
by zeros.
It repeatedly updates v(s) values until they converges.
Value Iteration is guaranteed to converge to the optimal
values.
Value Iteration
Value iteration algorithm keeps improving the value
function at each iteration until the value function
converges.
But as we know the main goal of an agent is to find optimal
policy.
Using value iteration algorithm, it is sometimes possible
that optimal policy will converge before the value function.
So it takes more iteration to find optimal policy.
So we can use another method of dynamic programming to
find optimal policy using Policy iteration method.