0% found this document useful (0 votes)
20 views

Unit 5 Notes

Uploaded by

raneemjihan5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Unit 5 Notes

Uploaded by

raneemjihan5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit-5

Reinforcement learning is a type of machine learning where an agent learns to take


actions in an environment to maximize a reward signal. The agent receives feedback
in the form of a reward or penalty for each action it takes, and uses this feedback to
adjust its actions in order to achieve the goal of maximizing the total reward over time.
It is often used in settings where there is no labeled data available, and the agent must
learn through trial and error.

Here are a few examples of reinforcement learning:

1. Game playing: Reinforcement learning has been used to train agents to


play games such as chess, Go, and video games. The agent learns to take
actions that maximize its score or win rate, and the game environment
provides the reward signal.
2. Robotics: Reinforcement learning has been used to train robots to
perform tasks such as grasping objects, walking, and navigating. The
robot receives a reward signal for completing the task, and learns to
adjust its actions to maximize the reward.
3. Advertising: Reinforcement learning has been used to optimize online
advertising campaigns. The agent learns to select ads that are most likely
to result in clicks or conversions, and the reward signal is based on the
actual performance of the ad.
4. Autonomous driving: Reinforcement learning has been used to train
autonomous vehicles to navigate complex environments such as city
streets. The agent receives a reward signal for safe and efficient driving,
and learns to adjust its actions based on the feedback.
5. Recommendation systems: Reinforcement learning has been used to
optimize recommendation systems, such as those used by online retailers
or streaming services. The agent learns to recommend products or
content that are most likely to be of interest to the user, and the reward
signal is based on user engagement or purchases.
Key Concepts and Terminology

Agent-environment interaction loop.

The main characters of RL are the agent and the environment. The environment is the
world that the agent lives in and interacts with. At every step of interaction, the agent
sees a (possibly partial) observation of the state of the world, and then decides on an
action to take. The environment changes when the agent acts on it, but may also change
on its own.

The agent also perceives a reward signal from the environment, a number that tells it
how good or bad the current world state is. The goal of the agent is to maximize its
cumulative reward, called return. Reinforcement learning methods are ways that the
agent can learn behaviors to achieve its goal.

To talk more specifically what RL does, we need to introduce additional terminology. We


need to talk about

• states and observations,


• action spaces,
• policies,
• trajectories,
• different formulations of return,
• the RL optimization problem,
• and value functions.
States and Observations

A state is a complete description of the state of the world. There is no information about
the world which is hidden from the state. An observation is a partial description of a
state, which may omit information.

In deep RL, we almost always represent states and observations by a real-valued vector,
matrix, or higher-order tensor. For instance, a visual observation could be represented
by the RGB matrix of its pixel values; the state of a robot might be represented by its
joint angles and velocities.

When the agent is able to observe the complete state of the environment, we say that
the environment is fully observed. When the agent can only see a partial observation,
we say that the environment is partially observed.

Passive reinforcement learning


Passive reinforcement learning is a type of reinforcement learning where the agent
observes the environment and learns from the experiences, but does not actively take
actions to influence the environment. In other words, the agent is only a passive
observer and does not take any action to maximize its rewards.

Passive reinforcement learning can be used to estimate the value function of a given
policy, which is the expected cumulative reward that an agent would receive when
following that policy. This information can be used to evaluate the performance of the
policy and to improve it.

Passive reinforcement learning is often used in offline settings, where the agent has
access to a pre-collected dataset of experiences. The agent can use this dataset to
learn about the environment and to improve its policy, without interacting with the
environment in real-time.

Passive reinforcement learning can be contrasted with active reinforcement learning,


where the agent takes actions in the environment to maximize its rewards.

Passive reinforcement learning has several applications in machine learning and


artificial intelligence. Here are a few examples:

1. Evaluation of policies: Passive reinforcement learning can be used to evaluate


the performance of a given policy in a given environment. The agent observes
the environment and learns about the expected cumulative reward that would
be obtained by following the policy.
2. Offline policy improvement: Passive reinforcement learning can be used to
improve a given policy using a pre-collected dataset of experiences. The agent
learns from the dataset and tries to optimize the policy, without interacting with
the environment in real-time.
3. Model-based reinforcement learning: Passive reinforcement learning can be
used to learn a model of the environment, which can be used for planning and
decision-making. The agent observes the environment and learns a model of
the state transition dynamics and reward function.
4. Real-world applications: Passive reinforcement learning has been used in
several real-world applications such as robotics, autonomous driving, and
recommendation systems. In these applications, passive reinforcement learning
is used to improve the performance of the system by learning from pre-
collected datasets.

Overall, passive reinforcement learning is a useful technique for learning about the
environment and improving policies in a variety of settings, especially when real-time
interaction with the environment is not possible or practical

Direct Utility Estimation:


Direct utility estimation is a technique used in reinforcement learning and decision-
making that involves estimating the expected utility of taking a particular action in a
given state. The expected utility is a measure of how beneficial an action is likely to be,
given the current state and the potential future states.

Direct utility estimation can be used in situations where the environment is stochastic,
meaning that the outcomes of actions are not deterministic, and the rewards
associated with those outcomes are uncertain. In such cases, the expected utility of an
action is used to make decisions that maximize the expected cumulative reward over
time.

The process of direct utility estimation involves using statistical techniques to estimate
the value function or Q-function, which is a function that maps states and actions to
expected cumulative rewards. The value function can be estimated using a variety of
methods, including Monte Carlo sampling, temporal difference learning, and model-
based methods.

Direct utility estimation is a widely used technique in reinforcement learning and


decision-making, and has been applied in many areas, including robotics, finance, and
healthcare. It is a powerful tool for making optimal decisions in situations with
uncertain outcomes, and can be used to optimize a wide range of complex systems
Adaptive Dynamic Programming
Adaptive dynamic programming (ADP) is a type of reinforcement learning that focuses
on the development of adaptive control and decision-making systems. ADP aims to
solve complex decision-making problems by iteratively learning from data, and
adapting to changes in the environment over time.

The key idea behind ADP is to learn a control policy that can make decisions based on
the current state of the system, and optimize that policy using feedback from the
environment. ADP is often used in systems where the dynamics of the environment
are uncertain, and the system must adapt to changes in the environment over time.

ADP algorithms typically use a model-based approach to learn the dynamics of the
environment and to estimate the value function or Q-function, which is a function that
maps states and actions to expected cumulative rewards. These models are then used
to simulate the behavior of the system under different conditions, and to optimize the
control policy using iterative methods.

ADP has been used in a wide range of applications, including control of complex
systems such as robots and autonomous vehicles, energy management systems, and
finance. ADP algorithms have also been used to develop intelligent decision-making
systems in healthcare, where they can be used to optimize patient care and treatment
plans.

Overall, ADP is a powerful approach to solving complex decision-making problems,


and has the potential to revolutionize many fields by enabling the development of
intelligent, adaptive systems that can learn from experience and adapt to changes in
the environment over time

Here are a few examples of ADP in action:

1. Control of complex systems: ADP has been used to develop control systems for
complex processes such as chemical plants, power grids, and manufacturing
processes. In these applications, ADP is used to learn optimal control policies
that can adapt to changes in the environment and improve the performance of
the system.
2. Robotics: ADP has been used to develop control policies for robots that can
learn from experience and adapt to changes in the environment. For example,
ADP has been used to develop controllers for legged robots that can learn to
walk and navigate uneven terrain.
3. Finance: ADP has been used to develop intelligent trading systems that can
learn from market data and adapt to changes in market conditions. ADP
algorithms have also been used to optimize investment portfolios and to
develop risk management strategies.
4. Healthcare: ADP has been used to develop decision-making systems for
healthcare applications, such as personalized medicine and treatment planning.
ADP algorithms can be used to optimize treatment plans for individual patients,
taking into account their medical history, genetic information, and other factors.

Overall, ADP is a powerful technique for developing adaptive decision-making systems


that can learn from experience and adapt to changes in the environment. ADP has the
potential to revolutionize many fields, by enabling the development of intelligent,
autonomous systems that can learn and adapt over time

Temporal Difference Learning


Temporal difference (TD) learning is a type of reinforcement learning algorithm that
learns to predict the expected reward of an action based on the current state of the
environment. The TD algorithm learns to estimate the value function or Q-function,
which is a function that maps states and actions to expected cumulative rewards.

The TD algorithm works by comparing the expected reward of an action to the actual
reward received by the agent, and adjusting the estimated value of the action
accordingly. The difference between the expected and actual reward is called the TD
error, and is used to update the value function.

TD learning has several advantages over other reinforcement learning algorithms,


including its ability to learn online, i.e., in real-time, and its ability to learn from partial
information. TD learning has been applied to a wide range of applications, including
robotics, gaming, and finance.

One of the most popular TD algorithms is Q-learning, which is a model-free algorithm


that learns the optimal policy directly from the observed data, without requiring a
model of the environment. Q-learning has been used in a variety of applications,
including game playing and control of autonomous vehicles.

Overall, TD learning is a powerful tool for solving decision-making problems in


dynamic and uncertain environments, and has the potential to revolutionize many
fields by enabling the development of intelligent, adaptive systems that can learn from
experience and adapt to changes in the environment over time
Active Reinforcement Learning
Active reinforcement learning is a type of reinforcement learning in which an agent
interacts with the environment by selecting actions to take, in order to maximize the
expected reward. In contrast to passive reinforcement learning, in which the agent
observes the environment without taking any actions, in active reinforcement learning,
the agent actively explores the environment to learn about the rewards associated with
different actions and states.

In active reinforcement learning, the agent uses a policy, which is a mapping from
states to actions, to select the next action to take. The policy can be either deterministic
or stochastic, and can be learned using a variety of techniques, such as Q-learning,
SARSA, or policy gradient methods.

One of the key challenges in active reinforcement learning is the exploration-


exploitation trade-off. In order to learn about the environment, the agent needs to
explore different actions and states, but at the same time, it also needs to exploit the
knowledge it has already gained in order to maximize the expected reward.

Active reinforcement learning has been applied to a wide range of applications,


including robotics, gaming, and control systems. One example of active reinforcement
learning in action is the development of autonomous vehicles, in which the vehicle
needs to learn how to navigate complex environments and make decisions in real-time
based on sensor data.

Overall, active reinforcement learning is a powerful technique for developing


intelligent, adaptive systems that can learn from experience and interact with the
environment in real-time, and has the potential to revolutionize many fields by
enabling the development of autonomous systems that can learn and adapt over time.

Q Learning
Q-learning is a type of model-free reinforcement learning algorithm, meaning that it
doesn't require any knowledge of the environment's dynamics or how the agent's
actions affect the environment. Instead, the algorithm tries to learn the optimal policy
by directly interacting with the environment and observing the rewards that result from
its actions.

The Q-learning algorithm involves building a Q-table, which is a lookup table that
stores the expected rewards for each state-action pair in the environment. At each
step, the agent observes the current state of the environment, selects an action based
on the maximum expected reward in that state, and receives a reward based on its
action. The Q-table is updated based on the observed reward and the transition to the
next state.

The update rule for the Q-table is based on the Bellman equation, which expresses the
expected reward of a state-action pair as the sum of the immediate reward and the
expected reward of the next state-action pair. The Q-learning algorithm updates the
Q-value of the current state-action pair by adding the difference between the observed
reward and the estimated expected reward, scaled by a learning rate.

Over time, the Q-table is updated based on the agent's experience, and the agent
gradually learns the optimal policy for the environment. Q-learning is known to
converge to the optimal policy under certain conditions, such as when the
environment is deterministic and the agent explores all possible state-action pairs.

Q-learning has been applied to a wide range of domains, including robotics, game
playing, recommendation systems, and more. Its simplicity and effectiveness make it a
popular algorithm in the field of reinforcement learning.

In Q-learning, the agent tries to learn the optimal policy for an environment by directly
interacting with it and observing the rewards it receives. The problem of decision-
making in the environment is framed as a Markov Decision Process (MDP), which is a
mathematical framework used to model sequential decision-making under
uncertainty.

An MDP includes a set of states, a set of actions, a transition function, and a reward
function. The transition function specifies the probability of moving from one state to
another based on the selected action, while the reward function provides a scalar value
for each state-action pair. The goal of the agent is to learn a policy that maximizes the
expected cumulative reward over time.

The Q-learning algorithm involves building a Q-table, which is a lookup table that
stores the expected rewards for each state-action pair in the environment. At each
step, the agent observes the current state of the environment, selects an action based
on the maximum expected reward in that state, and receives a reward based on its
action. The Q-table is updated based on the observed reward and the transition to the
next state.

The update rule for the Q-table is based on the Bellman equation, which expresses the
expected reward of a state-action pair as the sum of the immediate reward and the
expected reward of the next state-action pair. The Q-learning algorithm updates the
Q-value of the current state-action pair by adding the difference between the observed
reward and the estimated expected reward, scaled by a learning rate.
Over time, the Q-table is updated based on the agent's experience, and the agent
gradually learns the optimal policy for the environment. Q-learning is known to
converge to the optimal policy under certain conditions, such as when the
environment is deterministic and the agent explores all possible state-action pairs.

Q-learning is a powerful tool for decision-making in complex and uncertain


environments, and has been applied to a wide range of domains, including robotics,
game playing, recommendation systems, and more. The mathematical concepts
underlying Q-learning make it possible to reason about decision-making in a
principled and rigorous way, and to design agents that can learn from experience and
improve their performance over time

You might also like