0% found this document useful (0 votes)
18 views

Markov Decision Process

Uploaded by

Hemanth
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Markov Decision Process

Uploaded by

Hemanth
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Markov Decision Process

In MDP, The Markov decision process (MDP) is a mathematical tool used for decision-making
problems where the outcomes are partially random and partially controllable.

MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP.

MDP contains a tuple of four elements (S, A, Pa, Ra):


o A set of finite States S

o A set of finite Actions A

o Rewards received after transitioning from state S to state S', due to action a.

o Probability Pa.

Markov Decision Process Terminology

MDP uses Markov property, and to better understand the MDP, we need to know about it.

Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1 and move to
the state s2, then the state transition from s1 to s2 only depends on the current state and future
action and states do not depend on past actions, rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not depend on any
past action or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in
a Chess game, the players only focus on the current state and do not need to remember past
actions or states.

Finite MDP:
A MDP is finite MDP when there are finite states, finite rewards, and finite actions. In RL, we
consider only the finite MDP.

Markov Process:
Markov Process is a memoryless process with a sequence of random states S 1, S2, ....., St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S,
P) on state S and transition function P. These two components (S and P) can define the dynamics
of the system.

The difference between Q learning and SARSA

Reinforcement learning algorithms are mainly used in AI applications and gaming applications.
The main used algorithms are:

o Q-Learning:
o Q-learning is an Off policy RL algorithm, which is used for the temporal
difference Learning. The temporal difference learning methods are the way of
comparing temporally successive predictions.

o It learns the value function Q (S, a), which means how good to take action "a" at a
particular state "s."

o The below flowchart explains the working of Q- learning:


o State Action Reward State action (SARSA):
o SARSA stands for State Action Reward State action, which is an on-
policy temporal difference learning method. The on-policy control method selects
the action for each state while learning using a specific policy.

o The goal of SARSA is to calculate the Q π (s, a) for the selected current policy
π and all pairs of (s-a).

o The main difference between Q-learning and SARSA algorithms is that unlike Q-
learning, the maximum reward for the next state is not required for updating
the Q-value in the table.

o In SARSA, new action and reward are selected using the same policy, which has
determined the original action.

o The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.

You might also like