ML Unit-5
ML Unit-5
Reinforcement Learning: overview, example: getting lost, State and Action Spaces, The
Reward Function, Discounting, Action Selection, Policy, Markov decision processes, Q-
learning, uses of Reinforcement learning,
Applications of Machine Learning in various fields: Text classification, Image
Classification, Speech Recognition.
---------------------------------------------------------------------------------------------------------------
o The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
o The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.
--------------------------------------------------------------------------------------------------------
Example:
The problem is as follows: We have an agent and a reward, with many hurdles in between.
The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.
Unit-5 NEHA UNNISA
Machine Learning Asst Prof
The above image shows the robot, diamond, and fire. The goal of the robot is to get the
reward that is the diamond and avoid the hurdles that are fired. The robot learns by trying
all the possible paths and then choosing the path which gives him the reward with the least
hurdles. Each right step will give the robot a reward and each wrong step will subtract the
reward of the robot. The total reward will be calculated when it reaches the final reward
that is the diamond.
-------------------------------------------------------------------------------------------
Getting Lost
Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the
model's prediction was on a single example. If the model's prediction is perfect, the loss is
zero; otherwise, the loss is greater.
----------------------------------------------------------------------------------------------------------
Unit-5 NEHA UNNISA
Machine Learning Asst Prof
The state space S is a set of all the states that the agent can transition to and action space A is
a set of all actions the agent can act out in a certain environment. There are also Partial
Observable cases, where the agent is unable to observe the complete state information of the
environment
----------------------------------------------------------------------------------------------------------
Reward function :
The goal of reinforcement learning is defined by the reward signal. At each state, the
environment sends an immediate signal to the learning agent, and this signal is known as
a reward signal. These rewards are given according to the good and bad actions taken by the
agent. The agent's main objective is to maximize the total number of rewards for good
actions. The reward signal can change the policy, such as if an action selected by the agent
leads to low reward, then the policy may change to select other actions in the future.
---------------------------------------------------------------------------------------------------------
MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP.
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1 and move
to the state s2, then the state transition from s1 to s2 only depends on the current state and
future action and states do not depend on past actions, rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not depend on any
past action or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as
in a Chess game, the players only focus on the current state and do not need to remember
past actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we
consider only the finite MDP.
Unit-5 NEHA UNNISA
Machine Learning Asst Prof
Markov Process:
Markov Process is a memoryless process with a sequence of random states S 1, S2, ....., St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S,
P) on state S and transition function P. These two components (S and P) can define the
dynamics of the system.
-----------------------------------------------------------------------------------------------------------
Q-Learning
o The main difference between Q-learning and SARSA algorithms is that unlike
Q-learning, the maximum reward for the next state is not required for
updating the Q-value in the table.
o In SARSA, new action and reward are selected using the same policy, which
has determined the original action.
o The SARSA is named because it uses the quintuple
o Q(s, a, r, s',a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
o Deep Q Neural Network (DQN):
o As the name suggests, DQN is a Q-learning using Neural networks.
o For a big state space environment, it will be a challenging and complex task to
define and update a Q-table.
o To solve such an issue, we can use a DQN algorithm. Where, instead of defining
a Q-table, neural network approximates the Q-values for each action and state.
Q-Learning Explanation:
o Q-learning is a popular model-free reinforcement learning algorithm based on the
Bellman equation.
o The main objective of Q-learning is to learn the policy which can inform the agent
that what actions should be taken for maximizing the reward under what
circumstances.
o It is an off-policy RL that attempts to find the best action to take at a current state.
o The goal of the agent in Q-learning is to maximize the value of Q.
o The value of Q-learning can be derived from the Bellman equation. Consider the
Bellman equation given below:
In the equation, we have various components, including reward, discount factor (γ), probability,
and end states s'. But there is no any Q-value is given so first consider the below image:
Unit-5 NEHA UNNISA
Machine Learning Asst Prof
In the above image, we can see there is an agent who has three values options, V(s 1), V(s2),
V(s3). As this is MDP, so agent only cares for the current state and the future state. The agent
can go to any direction (Up, Left, or Right), so he needs to decide where to go for the optimal
path. Here agent will take a move as per probability bases and changes the state. But if we want
some exact moves, so for this, we need to make some changes in terms of Q-value. Consider
the below image:
Q- represents the quality of the actions at each state. So instead of using a value at each state,
we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that which action is more
lubricative than others, and according to the best Q-value, the agent takes his next move. The
Bellman equation can be used for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on a certain
state, so the Q -value equation will be:
Unit-5 NEHA UNNISA
Machine Learning Asst Prof
The Q stands for quality in Q-learning, which means it specifies the quality of an action taken
by the agent.
Q-table:
A Q-table or matrix is created while performing the Q-learning. The table follows the state and
action pair, i.e., [s, a], and initializes the values to zero. After each action, the table is updated,
and the q-values are stored within the table.
The RL agent uses this Q-table as a reference table to select the best action based on the q-
values.
----------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------