0% found this document useful (0 votes)
2 views

ML Unit-5

The document provides an overview of Reinforcement Learning (RL), explaining its principles, components such as state and action spaces, reward functions, and algorithms like Q-learning and SARSA. It highlights the agent's learning process through feedback and experience in various environments, as well as applications of RL in fields like robotics, natural language processing, and gaming. Additionally, it discusses the Markov Decision Process (MDP) and the significance of Q-values in optimizing agent actions.

Uploaded by

ssummaya2911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML Unit-5

The document provides an overview of Reinforcement Learning (RL), explaining its principles, components such as state and action spaces, reward functions, and algorithms like Q-learning and SARSA. It highlights the agent's learning process through feedback and experience in various environments, as well as applications of RL in fields like robotics, natural language processing, and gaming. Additionally, it discusses the Markov Decision Process (MDP) and the significance of Q-values in optimizing agent actions.

Uploaded by

ssummaya2911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit-5 NEHA UNNISA

Machine Learning Asst Prof

Reinforcement Learning: overview, example: getting lost, State and Action Spaces, The
Reward Function, Discounting, Action Selection, Policy, Markov decision processes, Q-
learning, uses of Reinforcement learning,
Applications of Machine Learning in various fields: Text classification, Image
Classification, Speech Recognition.

---------------------------------------------------------------------------------------------------------------

What is Reinforcement Learning?

o Reinforcement Learning is a feedback-based Machine learning technique in which an


agent learns to behave in an environment by performing the actions and seeing the
results of actions. For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback or penalty.
o In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience only.
o RL solves a specific type of problem where decision making is sequential, and the goal
is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary goal of
an agent in reinforcement learning is to improve the performance by getting the
maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the experience, it learns
to perform the task in a better way. Hence, we can say that "Reinforcement learning
is a type of machine learning method where an intelligent agent (computer program)
interacts with the environment and learns to act within that." How a Robotic dog
learns the movement of his arms is an example of Reinforcement learning.
o It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from
its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment, and his
goal is to find the diamond. The agent interacts with the environment by performing
some actions, and based on those actions, the state of the agent gets changed, and it also
receives a reward or penalty as feedback.
Unit-5 NEHA UNNISA
Machine Learning Asst Prof

o The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
o The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.

--------------------------------------------------------------------------------------------------------

Example:
The problem is as follows: We have an agent and a reward, with many hurdles in between.
The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.
Unit-5 NEHA UNNISA
Machine Learning Asst Prof

The above image shows the robot, diamond, and fire. The goal of the robot is to get the
reward that is the diamond and avoid the hurdles that are fired. The robot learns by trying
all the possible paths and then choosing the path which gives him the reward with the least
hurdles. Each right step will give the robot a reward and each wrong step will subtract the
reward of the robot. The total reward will be calculated when it reaches the final reward
that is the diamond.
-------------------------------------------------------------------------------------------

Getting Lost
Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the
model's prediction was on a single example. If the model's prediction is perfect, the loss is
zero; otherwise, the loss is greater.

----------------------------------------------------------------------------------------------------------
Unit-5 NEHA UNNISA
Machine Learning Asst Prof

State/ Observation Spaces and Action Spaces

The state space S is a set of all the states that the agent can transition to and action space A is
a set of all actions the agent can act out in a certain environment. There are also Partial
Observable cases, where the agent is unable to observe the complete state information of the
environment

----------------------------------------------------------------------------------------------------------

Reward function :
The goal of reinforcement learning is defined by the reward signal. At each state, the
environment sends an immediate signal to the learning agent, and this signal is known as
a reward signal. These rewards are given according to the good and bad actions taken by the
agent. The agent's main objective is to maximize the total number of rewards for good
actions. The reward signal can change the policy, such as if an action selected by the agent
leads to low reward, then the policy may change to select other actions in the future.

---------------------------------------------------------------------------------------------------------

Markov Decision Process

Markov Decision Process or MDP, is used to formalize the reinforcement learning


problems. If the environment is completely observable, then its dynamic can be modeled as
a Markov Process. In MDP, the agent constantly interacts with the environment and performs
actions; at each action, the environment responds and generates a new state.
Unit-5 NEHA UNNISA
Machine Learning Asst Prof

MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP.

MDP contains a tuple of four elements (S, A, Pa, Ra):

o A set of finite States S


o A set of finite Actions A
o Rewards received after transitioning from state S to state S', due to action a.
o Probability Pa.

MDP uses Markov property, and to better understand the MDP, we need to learn about it.

Markov Property:

It says that "If the agent is present in the current state S1, performs an action a1 and move
to the state s2, then the state transition from s1 to s2 only depends on the current state and
future action and states do not depend on past actions, rewards, or states."

Or, in other words, as per Markov Property, the current state transition does not depend on any
past action or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as
in a Chess game, the players only focus on the current state and do not need to remember
past actions or states.

Finite MDP:

A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we
consider only the finite MDP.
Unit-5 NEHA UNNISA
Machine Learning Asst Prof

Markov Process:

Markov Process is a memoryless process with a sequence of random states S 1, S2, ....., St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S,
P) on state S and transition function P. These two components (S and P) can define the
dynamics of the system.

-----------------------------------------------------------------------------------------------------------

Q-Learning

o Q-learning is an Off policy RL algorithm, which is used for the temporal


difference Learning. The temporal difference learning methods are the way of
comparing temporally successive predictions.
o It learns the value function Q (S, a), which means how good to take action "a"
at a particular state "s."
o The below flowchart explains the working of Q- learning:

o State Action Reward State action (SARSA):


o SARSA stands for State Action Reward State action, which is an on-
policy temporal difference learning method. The on-policy control method
selects the action for each state while learning using a specific policy.
o The goal of SARSA is to calculate the Q π (s, a) for the selected current policy
π and all pairs of (s-a).
Unit-5 NEHA UNNISA
Machine Learning Asst Prof

o The main difference between Q-learning and SARSA algorithms is that unlike
Q-learning, the maximum reward for the next state is not required for
updating the Q-value in the table.
o In SARSA, new action and reward are selected using the same policy, which
has determined the original action.
o The SARSA is named because it uses the quintuple
o Q(s, a, r, s',a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
o Deep Q Neural Network (DQN):
o As the name suggests, DQN is a Q-learning using Neural networks.
o For a big state space environment, it will be a challenging and complex task to
define and update a Q-table.
o To solve such an issue, we can use a DQN algorithm. Where, instead of defining
a Q-table, neural network approximates the Q-values for each action and state.

Now, we will expand the Q-learning.

Q-Learning Explanation:
o Q-learning is a popular model-free reinforcement learning algorithm based on the
Bellman equation.
o The main objective of Q-learning is to learn the policy which can inform the agent
that what actions should be taken for maximizing the reward under what
circumstances.
o It is an off-policy RL that attempts to find the best action to take at a current state.
o The goal of the agent in Q-learning is to maximize the value of Q.
o The value of Q-learning can be derived from the Bellman equation. Consider the
Bellman equation given below:

In the equation, we have various components, including reward, discount factor (γ), probability,
and end states s'. But there is no any Q-value is given so first consider the below image:
Unit-5 NEHA UNNISA
Machine Learning Asst Prof

In the above image, we can see there is an agent who has three values options, V(s 1), V(s2),
V(s3). As this is MDP, so agent only cares for the current state and the future state. The agent
can go to any direction (Up, Left, or Right), so he needs to decide where to go for the optimal
path. Here agent will take a move as per probability bases and changes the state. But if we want
some exact moves, so for this, we need to make some changes in terms of Q-value. Consider
the below image:

Q- represents the quality of the actions at each state. So instead of using a value at each state,
we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that which action is more
lubricative than others, and according to the best Q-value, the agent takes his next move. The
Bellman equation can be used for deriving the Q-value.

To perform any action, the agent will get a reward R(s, a), and also he will end up on a certain
state, so the Q -value equation will be:
Unit-5 NEHA UNNISA
Machine Learning Asst Prof

Hence, we can say that, V(s) = max [Q(s, a)]

The above formula is used to estimate the Q-values in Q-Learning.

What is 'Q' in Q-learning?

The Q stands for quality in Q-learning, which means it specifies the quality of an action taken
by the agent.

Q-table:

A Q-table or matrix is created while performing the Q-learning. The table follows the state and
action pair, i.e., [s, a], and initializes the values to zero. After each action, the table is updated,
and the q-values are stored within the table.

The RL agent uses this Q-table as a reference table to select the best action based on the q-
values.

----------------------------------------------------------------------------------------------------------

Applications of Reinforcement Learning


• Automated Robots. While most robots don't look like pop culture has led us to
believe, their capabilities are just as impressive. ...
• Natural Language Processing. ...
• Marketing and Advertising. ...
• Image Processing. ...
• Recommendation Systems. ...
• Gaming. ...
• Energy Conservation. ...
• Traffic Control.

------------------------------------------------------------------------------------------------------

You might also like