Unit 5 Notes
Unit 5 Notes
The main characters of RL are the agent and the environment. The environment is the
world that the agent lives in and interacts with. At every step of interaction, the agent
sees a (possibly partial) observation of the state of the world, and then decides on an
action to take. The environment changes when the agent acts on it, but may also change
on its own.
The agent also perceives a reward signal from the environment, a number that tells it
how good or bad the current world state is. The goal of the agent is to maximize its
cumulative reward, called return. Reinforcement learning methods are ways that the
agent can learn behaviors to achieve its goal.
A state is a complete description of the state of the world. There is no information about
the world which is hidden from the state. An observation is a partial description of a
state, which may omit information.
In deep RL, we almost always represent states and observations by a real-valued vector,
matrix, or higher-order tensor. For instance, a visual observation could be represented
by the RGB matrix of its pixel values; the state of a robot might be represented by its
joint angles and velocities.
When the agent is able to observe the complete state of the environment, we say that
the environment is fully observed. When the agent can only see a partial observation,
we say that the environment is partially observed.
Passive reinforcement learning can be used to estimate the value function of a given
policy, which is the expected cumulative reward that an agent would receive when
following that policy. This information can be used to evaluate the performance of the
policy and to improve it.
Passive reinforcement learning is often used in offline settings, where the agent has
access to a pre-collected dataset of experiences. The agent can use this dataset to
learn about the environment and to improve its policy, without interacting with the
environment in real-time.
Overall, passive reinforcement learning is a useful technique for learning about the
environment and improving policies in a variety of settings, especially when real-time
interaction with the environment is not possible or practical
Direct utility estimation can be used in situations where the environment is stochastic,
meaning that the outcomes of actions are not deterministic, and the rewards
associated with those outcomes are uncertain. In such cases, the expected utility of an
action is used to make decisions that maximize the expected cumulative reward over
time.
The process of direct utility estimation involves using statistical techniques to estimate
the value function or Q-function, which is a function that maps states and actions to
expected cumulative rewards. The value function can be estimated using a variety of
methods, including Monte Carlo sampling, temporal difference learning, and model-
based methods.
The key idea behind ADP is to learn a control policy that can make decisions based on
the current state of the system, and optimize that policy using feedback from the
environment. ADP is often used in systems where the dynamics of the environment
are uncertain, and the system must adapt to changes in the environment over time.
ADP algorithms typically use a model-based approach to learn the dynamics of the
environment and to estimate the value function or Q-function, which is a function that
maps states and actions to expected cumulative rewards. These models are then used
to simulate the behavior of the system under different conditions, and to optimize the
control policy using iterative methods.
ADP has been used in a wide range of applications, including control of complex
systems such as robots and autonomous vehicles, energy management systems, and
finance. ADP algorithms have also been used to develop intelligent decision-making
systems in healthcare, where they can be used to optimize patient care and treatment
plans.
1. Control of complex systems: ADP has been used to develop control systems for
complex processes such as chemical plants, power grids, and manufacturing
processes. In these applications, ADP is used to learn optimal control policies
that can adapt to changes in the environment and improve the performance of
the system.
2. Robotics: ADP has been used to develop control policies for robots that can
learn from experience and adapt to changes in the environment. For example,
ADP has been used to develop controllers for legged robots that can learn to
walk and navigate uneven terrain.
3. Finance: ADP has been used to develop intelligent trading systems that can
learn from market data and adapt to changes in market conditions. ADP
algorithms have also been used to optimize investment portfolios and to
develop risk management strategies.
4. Healthcare: ADP has been used to develop decision-making systems for
healthcare applications, such as personalized medicine and treatment planning.
ADP algorithms can be used to optimize treatment plans for individual patients,
taking into account their medical history, genetic information, and other factors.
The TD algorithm works by comparing the expected reward of an action to the actual
reward received by the agent, and adjusting the estimated value of the action
accordingly. The difference between the expected and actual reward is called the TD
error, and is used to update the value function.
In active reinforcement learning, the agent uses a policy, which is a mapping from
states to actions, to select the next action to take. The policy can be either deterministic
or stochastic, and can be learned using a variety of techniques, such as Q-learning,
SARSA, or policy gradient methods.
Q Learning
Q-learning is a type of model-free reinforcement learning algorithm, meaning that it
doesn't require any knowledge of the environment's dynamics or how the agent's
actions affect the environment. Instead, the algorithm tries to learn the optimal policy
by directly interacting with the environment and observing the rewards that result from
its actions.
The Q-learning algorithm involves building a Q-table, which is a lookup table that
stores the expected rewards for each state-action pair in the environment. At each
step, the agent observes the current state of the environment, selects an action based
on the maximum expected reward in that state, and receives a reward based on its
action. The Q-table is updated based on the observed reward and the transition to the
next state.
The update rule for the Q-table is based on the Bellman equation, which expresses the
expected reward of a state-action pair as the sum of the immediate reward and the
expected reward of the next state-action pair. The Q-learning algorithm updates the
Q-value of the current state-action pair by adding the difference between the observed
reward and the estimated expected reward, scaled by a learning rate.
Over time, the Q-table is updated based on the agent's experience, and the agent
gradually learns the optimal policy for the environment. Q-learning is known to
converge to the optimal policy under certain conditions, such as when the
environment is deterministic and the agent explores all possible state-action pairs.
Q-learning has been applied to a wide range of domains, including robotics, game
playing, recommendation systems, and more. Its simplicity and effectiveness make it a
popular algorithm in the field of reinforcement learning.
In Q-learning, the agent tries to learn the optimal policy for an environment by directly
interacting with it and observing the rewards it receives. The problem of decision-
making in the environment is framed as a Markov Decision Process (MDP), which is a
mathematical framework used to model sequential decision-making under
uncertainty.
An MDP includes a set of states, a set of actions, a transition function, and a reward
function. The transition function specifies the probability of moving from one state to
another based on the selected action, while the reward function provides a scalar value
for each state-action pair. The goal of the agent is to learn a policy that maximizes the
expected cumulative reward over time.
The Q-learning algorithm involves building a Q-table, which is a lookup table that
stores the expected rewards for each state-action pair in the environment. At each
step, the agent observes the current state of the environment, selects an action based
on the maximum expected reward in that state, and receives a reward based on its
action. The Q-table is updated based on the observed reward and the transition to the
next state.
The update rule for the Q-table is based on the Bellman equation, which expresses the
expected reward of a state-action pair as the sum of the immediate reward and the
expected reward of the next state-action pair. The Q-learning algorithm updates the
Q-value of the current state-action pair by adding the difference between the observed
reward and the estimated expected reward, scaled by a learning rate.
Over time, the Q-table is updated based on the agent's experience, and the agent
gradually learns the optimal policy for the environment. Q-learning is known to
converge to the optimal policy under certain conditions, such as when the
environment is deterministic and the agent explores all possible state-action pairs.