0% found this document useful (0 votes)
38 views57 pages

Lecture 05

The document provides an overview of reinforcement learning, highlighting its distinction from other machine learning approaches and its applications in various fields such as robotics and game playing. It explains key concepts such as Markov Decision Processes (MDPs), the Bellman equation, and the components involved in the reinforcement learning process, including agents, environments, states, actions, and rewards. The document emphasizes the iterative nature of learning through trial and error to maximize cumulative rewards.

Uploaded by

jacksonlachi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views57 pages

Lecture 05

The document provides an overview of reinforcement learning, highlighting its distinction from other machine learning approaches and its applications in various fields such as robotics and game playing. It explains key concepts such as Markov Decision Processes (MDPs), the Bellman equation, and the components involved in the reinforcement learning process, including agents, environments, states, actions, and rewards. The document emphasizes the iterative nature of learning through trial and error to maximize cumulative rewards.

Uploaded by

jacksonlachi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 57

CLASS OF 2022/2023

Artificial Intelligence
Problem Solving and Search Strategies

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 1
CLASS OF 2022/2023

This lectures summary



Our talk today: Introduction to reinforcement learning
 Introduction and examples
 Learning from experiences
 Reinforcement learning basics
 Markov Decision Processes (MDPs)
 The Bellman equation
 The Bellman equation and value iteration
 Introduction to Policy iteration
 Exploration and exploitation
 Q-learning as a model free RL algorithm

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 2
CLASS OF 2022/2023

Introduction to Reinforcement Learning

• Reinforcement learning is a type of machine learning that involves an


agent learning how to interact with an environment in order to maximize a
cumulative reward signal.

• Reinforcement learning differs from other machine learning approaches


such as supervised learning and unsupervised learning, in that it involves
learning from feedback in the form of rewards rather than from labeled
data or patterns in data.

• Reinforcement learning has been successfully applied to a variety of


problems, including robotics, game playing, recommendation systems, and
self-driving cars.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 3
CLASS OF 2022/2023

Reinforcement Learning in Real-World

• For example, reinforcement learning has been used to train robots to


perform complex tasks such as assembly and object manipulation.

• It has also been used to train agents to play games such as chess, Go,
and poker at a superhuman level.

• In addition, reinforcement learning has been used to personalize


recommendations in online platforms such as Netflix and Spotify, as well
as to develop self-driving cars that can navigate complex environments.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 4
CLASS OF 2022/2023

Learning from Experience

• Humans and animals are able to learn from experience through trial and
error.
They receive feedback in the form of rewards or
punishments based on their actions, which informs
their future decision-making

• Reinforcement learning is a machine learning approach that models this


process.
In reinforcement learning, an agent interacts with
an environment and receives a reward signal based
on its actions. The agent's goal is to learn a
policy, or a mapping from states to actions, that
maximizes its cumulative reward over time.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 5
CLASS OF 2022/2023

Learning from Experience

• The reinforcement learning process involves three main components:


the agent, the environment, and the reward signal.
The agent takes actions in the environment based
on its current policy, and the environment
responds with a new state and a reward signal.
The agent uses this information to update its
policy, and the process repeats until the agent
learns an optimal policy.

• Reinforcement learning is particularly useful in situations where the optimal


policy is not known beforehand, and the agent must learn through trial and
error. It has been successfully applied to a wide range of problems,
including robotics, game playing, and decision-making in complex
environments.
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 6
CLASS OF 2022/2023

Reinforcement learning basics

• Reinforcement learning is a machine learning approach that involves an


agent interacting with an environment to maximize a cumulative reward
signal. There are several basic components of reinforcement learning that
are critical to understanding how the agent learns:
 Agent: The agent is the entity that interacts
with the environment. It takes actions based on
the current state and its current policy.
 Environment: The environment is the world in
which the agent operates. It provides feedback
to the agent in the form of a new state and a
reward signal based on the agent's actions.
 State: The state is a representation of the
current situation in the environment. The
agent's actions are determined by its current
state and its policy.
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 7
CLASS OF 2022/2023

Reinforcement learning basics

• Reinforcement learning is a machine learning approach that involves an


agent interacting with an environment to maximize a cumulative reward
signal. There are several basic components of reinforcement learning that
are critical to understanding how the agent learns:
 Action: The action is the decision made by the
agent based on its current state and policy.
 Reward: The reward is a scalar signal that the
agent receives from the environment based on
its actions. The agent's goal is to maximize
its cumulative reward over time.
 Policy: The policy is the mapping from states
to actions. It determines the agent's behavior
in the environment.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 8
CLASS OF 2022/2023

Reinforcement learning basics

• The basic interaction between the agent and the environment in a


reinforcement learning setting is as follows:
 The agent observes the current state of the
environment.
 Based on its current policy, the agent selects
an action to take.
 The environment responds with a new state and a
reward signal.
 The agent updates its policy based on the
observed state, action, and reward, and the
process repeats. Over time, the agent learns an
optimal policy that maximizes its cumulative
reward.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 9
CLASS OF 2022/2023

Markov Decision Processes (MDPs)

• Markov Decision Processes (MDPs) are a key concept in reinforcement


learning. They are used to model sequential decision-making problems,
where an agent takes actions in an environment to maximize long-term
rewards.
• The MDPs provide a framework to analyze and solve such problems
mathematically.
• Formalization of Interaction:
 In an MDP, the agent interacts with the environment over a series of
discrete time steps.
 At each time step, the agent observes the current state of the
environment and takes an action based on its current policy.
 The environment then transitions to a new state and provides the
agent with a reward based on this transition.
 The process continues until a terminal state is reached, and the agent
aims to maximize the cumulative rewards obtained over time.
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 10
CLASS OF 2022/2023

Markov Decision Processes (MDPs)

• Key Components of an MDP:


 States: A set of possible states that the environment can be in. These
states capture the relevant information about the environment for
decision making.
 Actions: A set of actions that the agent can take. The available actions
depend on the current state of the environment.
 Transition Probabilities: The probabilities of transitioning from one
state to another when taking a specific action. These probabilities
define the dynamics of the environment.
 Rewards: The immediate rewards obtained by the agent when it takes
a specific action in a particular state. The agent's objective is to
maximize the cumulative rewards over time.
 Policy: A policy determines the agent's behavior by specifying the
action to take in each state. It maps states to actions and guides the
agent's decision-making process.
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 11
CLASS OF 2022/2023

Markov Decision Processes (MDPs)

• Consider a simple grid world environment represented by a 3x3 grid. The


agent can navigate through this grid by taking actions of moving up, down,
left, or right. The objective of the agent is to reach a specific goal state
while avoiding obstacles.
States: In this example, the set of states S = {S1, S2, S3, S4, S5, S6, S7, S8, S9},
where each state corresponds to a specific position in the grid. S1 S2 S3
Actions: The set of actions A = {Up, Down, Left, Right}. The agent
can choose one of these actions at each state to move in the desired S4 S5 S6
direction.
Transition Probabilities: Let's define the transition probabilities S7 S8 S9
P(s'|s, a), which represent the probability of transitioning to state s' when taking action a in state s.
For example, let's consider the transition from state S1 (top-left corner) when the agent takes the
action "Right". In this case, there are three possibilities:
If the agent successfully moves to state S2, the transition probability P(S2|S1, Right) = 1.
If the agent hits a wall and remains in state S1, the transition probability P(S1|S1, Right) = 0.
If the agent moves to state S4 (downward), the transition probability P(S4|S1, Right) = 0.
Similarly, transition probabilities for other actions and states can be defined based on the specific
dynamics of the environment.
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 12
CLASS OF 2022/2023

Markov Decision Processes (MDPs)

• Consider a simple grid world environment represented by a 3x3 grid. The


agent can navigate through this grid by taking actions of moving up, down,
left, or right. The objective of the agent is to reach a specific goal state
while avoiding obstacles.
Rewards: Assigning rewards to states and actions is another important aspect of MDPs. Let's
define the reward function R(s, a, s'), which gives the immediate
reward obtained when transitioning from state s to s' by taking action a. S1 S2 S3
In our grid world example, we can assign a positive reward of +10 to the
goal state and a negative reward of -1 to obstacle states. The rest of the S4 S5 S6
states can have a small negative reward of -0.1 to encourage the agent
to reach the goal quickly while avoiding obstacles. S7 S8 S9

Policy: The policy π(s) determines the agent's behavior by specifying the action to take in each
state. It can be deterministic or stochastic. For example, a deterministic policy could be defined as
follows:
π(S1) = Right π(S6) = Right
π(S2) = Down π(S7) = Up
π(S3) = Down π(S8) = Up
π(S4) = Right π(S9) = -
π(S5) = Right
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 13
CLASS OF 2022/2023

Markov Decision Processes (MDPs)

• Consider a simple grid


world environment
represented by a 3x3 grid
world.
• Example, Let's consider a
non-deterministic transition
matrix for the above world
when performing the action
“Up, Down, Right and Left"
to transition from state Si to
Si+1. Here's the transition
probability matrix:
S1 S2 S3

S4 S5 S6

S7 S8 S9

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 14
CLASS OF 2022/2023

Markov Decision Processes (MDPs)

• Preceding example
 In that matrix, the rows represent the current states (S1, S2, S3,
S4, S5, S6, S7, S8, S9), and the columns represent the actions
(Up, Down, Left, Right) that the agent can take in each state.
 The values in the matrix represent the probabilities of transitioning
from the current state to the new state after performing the
corresponding action.
 Like, when the agent is in state S1 and takes the action “Right”,
there is a 100% probability of transitioning to state S2 (indicated as 1.0
(S2)). The probabilities for other actions (Up, Down, Left) are
set to 0.0, indicating that transitioning to those states is not possible
when performing the action "Right" from state S1.
 This can be written as, P(S2 |S1, Right)=1.0: refer slide 12

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 15
CLASS OF 2022/2023

The Bellman equation

• The Bellman equation is a fundamental concept in reinforcement learning


that relates to the value function.
It provides a way to express the value of a state or
state-action pair in terms of the expected
cumulative reward obtained by following a specific
policy

• The value function represents the expected cumulative reward an agent


can achieve when starting from a particular state and following a specific
policy.
It quantifies the desirability or utility of being in
a given state or taking a specific action in a
state.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 16
CLASS OF 2022/2023

The Bellman equation

• The Bellman equation expresses the relationship between the value of a


state or state-action pair and the values of its successor states.
It takes into account the immediate reward obtained
from transitioning from the current state to the
successor state, as well as the expected future
rewards discounted by a factor γ (gamma).

• For a state-action pair (s, a), the Bellman equation is defined as:
 
V s, a   s ' P s ' | s, a * R s, a, s '   *V s '   

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 17
CLASS OF 2022/2023

The Bellman equation

• For a state-action pair (s, a), the Bellman equation is defined as:

 
V s, a   s ' P s ' | s, a * R s, a, s '   *V s '   
• Where:
V s, a  is the value of state-action pair (s, a).
P s ' | s, a  is the probability of transitioning to state s' from state s when
action a is taken.


R s, a, s '  isstate
the immediate reward obtained when transitioning from
s to state s' by taking action a.

 (gamma) is the discount factor, which determines the relative


importance of immediate and future rewards.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 18
CLASS OF 2022/2023

The Bellman equation

• The Bellman equation essentially states that the value of a state-action


pair is the sum of the immediate reward and the discounted value of the
successor states, weighted by their probabilities of being reached.
It considers the expected future rewards based on the
agent's policy and accounts for the uncertainty in the
transitions.
• It is a recursive relationship that allows us to iteratively update the value
function until it converges to the optimal values. By repeatedly applying the
Bellman equation, we can compute the values of all state-action pairs and
ultimately determine the optimal policy by selecting the actions that
maximize the value function.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 19
CLASS OF 2022/2023

The Bellman equation

• Example
 Assuming the agent is currently
in state S1 and performs the
action “Down" to transition to
state S4. We'll use the
transition probabilities and
rewards mentioned earlier:
 NB: entries to this transition
probability matrix are slightly
different from the earlier one

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 20
CLASS OF 2022/2023

The Bellman equation and value iteration

Immediate Rewards:
R(S1,Up,S4)=-0.1 (for all transitions except the goal state)
R(S4,Up,S1)=-0.1 (for all transitions except the goal state)
R(S4,Up,S9)=+10 (reward for reaching the goal state)
Discount Factor: γ=0.9

Now, let's compute the value of the state-action pair (S1, Up) using the
Bellman equation:

  
V s1 , up, s4   s ' P s ' | s1 , up * R s1 , a, s '   *V s '   
V s1 , up, s4  0.8 *  0.1  0.9 *V s4  0.1*  0.1  0.9 *V s1 
 0.1*  0.1  0.9 *V s2 

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 21
CLASS OF 2022/2023

The Bellman equation and value iteration

• In the preceding computation, V(S4), V(S1), and V(S2) represent


the values of states S4, S1, and S2, respectively.

• To compute the optimal value function and policy, we would repeat this
process for each state-action pair, updating the value function iteratively
until convergence. The optimal policy would be determined by selecting
the actions with the highest value for each state.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 22
CLASS OF 2022/2023

The Bellman equation and value iteration

• To demonstrate the value iteration algorithm until convergence is reached,


let's use the same example with the grid world scenario and the Bellman
equation we discussed earlier.
• We'll start with an initial estimate of the value function for each state:
1. Initialize
Now, we can perform the value
iteration algorithm until convergence
is reached.

The algorithm consists of iteratively


updating the value function for each
state until the values stabilize.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 23
CLASS OF 2022/2023

The Bellman equation and value iteration

2. Set a convergence threshold (ε) to determine when to stop iterating.

3. Repeat until convergence:

For each state s in S:


Calculate the value for each possible action a:

 
V s  maxa  s ' P s ' | s, a * R s, a, s '   *V s '   
4. Update the above value function for the state
5. Check for convergence:
If the maximum change in any value V(s) is less than the convergence
threshold (i.e., max|V(s)- V_prev(s)|<ε), stop iterating.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 24
CLASS OF 2022/2023

The Bellman equation and value iteration

• By repeatedly updating the value function using the Bellman equation, we


can compute the optimal values for each state.
• The optimal policy can be derived by selecting the action with the highest
value for each state.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 25
CLASS OF 2022/2023

The Bellman equation and value iteration

• Example.. Using the initialized values for V(S1)=V(S2)=V(S3)…………


=V(S9)=0, see slide 23

Using the Bellman equation for For S1 to S9, we update the value function
for each as state:
V(S1)=max[Up][0.8*(R(S1,Up,S2)+γ*V(S2))
+0.1*(R(S1,Up,S1)+γ*V(S1))
Iteration 1 for S1

+0.1*(R(S1,Up,S4)+γ*V(S4))]
=max[Up][0.8*(-0.1+0.9*V(S2))+0.1*(0.0+0.9*V(S1))+
0.1*(-0.1+0.9*V(S4))]
=max[Up][0.8*(-0.1+0.9*0.0)+ 0.1*(0.0+0.9*0.0)+
0.1*(-0.1+0.9*0.0)]
= max[Up][-0.08 + 0.0 + (-0.01)]
=max[Up] [-0.09]
= 0.0

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 26
CLASS OF 2022/2023

The Bellman equation and value iteration

• Example.. Using the initialized values for V(S1)=V(S2)=V(S3)…………


=V(S9)=0, see slide 23

Using the Bellman equation for For S1 to S9, we update the value function
for each as state:
V(S2)=max[Up][0.1*(R(S2,Up,S1)+γ*V(S1))
+0.1*(R(S2 ,Up,S2)+γ*V(S2))
Iteration 1 for S2

+0.8*(R(S2,Up,S3)+γ*V(S3))]
=max[Up][0.1*(0.0+0.9*V(S1))+0.1*(0.0+0.9*V(S2))+
0.8*(0.0+0.9*V(S3))]
=max[Up][0.1*(0.+0.9*0.0)+0.1*(0.0+0.9*0.0)+
0.8*(0.0+0.9*0.0)]
= max[Up][0.0 + 0.0 + 0.0]
= max[Up][0.0]
= 0.0

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 27
CLASS OF 2022/2023

The Bellman equation and value iteration

• Similarly, we update the value function for S3, S4, S5, S6, S7,
S8, and S9. ie. Is V(S3)…………=V(S9)

Again… Using the updated value function from Iteration 1, we repeat the
process:
V(S1)=max[Up][0.8*(R(S1,Up,S2)+γ*V(S2))
+0.1*(R(S1,Up,S1)+γ*V(S1))
Iteration 2 for S1

+0.1*(R(S1,Up,S4)+γ*V(S4))]
=max[Up][0.8*(-0.1+0.9*V(S2))+ 0.1*(0.0+0.9*V(S1))
+0.1*(-0.1+0.9*V(S4))]
=max[Up][0.1*(0.9+0.9*0.0)+0.1*(0.0+0.9*0.0)+
0.8*(0.0+0.9*0.0)]
=max[Up][-0.08+0.0+(-0.01)]
=max[Up][-0.09]
=-0.09

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 28
CLASS OF 2022/2023

The Bellman equation and value iteration

Again… Using the updated value function from Iteration 1, we repeat the
process:
V(S2)=max[Up][0.1*(R(S2,Up,S1)+γ*V(S1))
+0.1*(R(S2,Up,S2)+γ*V(S2))
Iteration 2 for S2

+0.8*(R(S2,Up,S3)+γ*V(S3))]
= max[Up][0.1*(0.0+0.9*V(S1))+0.1*(0.0+0.9*V(S2))+
0.8*(0.0+0.9*V(S3))]
=max[Up][0.1*(0.9+0.9*0.0)+0.1*(0.0+0.9*0.0)+
0.8*(0.0+0.9*0.0)]
= max[Up] [0.0 + 0.0 + 0.0]
= max[Up] [0.0]
=0.0
Likewise.. Similarly, we update the value function for S3, S4, S5, S6,
Presenter: S8, and
S7, Thomas S9.Intelligence (AI) and Problem Solving and Search Strategies Slide 29
Artificial
CLASS OF 2022/2023

Introduction to Policy iteration

• Policy iteration is an alternative approach to finding an optimal policy in


reinforcement learning. It involves an iterative process that combines
policy evaluation and policy improvement to converge towards the optimal
policy.
• The main idea behind policy iteration is to start with an initial policy and
iteratively improve it until convergence. Here is an overview of the steps
involved:
1. Initialize
2. Policy Evaluation
3. Policy Improvement
4. Convergence Check

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 30
CLASS OF 2022/2023

Introduction to Policy iteration

• Initialize: Start with an initial policy π_0.


• Policy Evaluation: Given the current policy π_i, evaluate its value function
V_i(s) for all states s. This step calculates the expected cumulative
rewards that an agent can achieve by following the policy π_i. The value
function V_i(s) represents the long-term expected reward starting from
state s and following policy π_i. This evaluation can be done using
techniques like iterative policy evaluation or solving the Bellman
equations.
• Policy Improvement: With the current value function V_i(s), update the
policy to get a new policy π_{i+1}. This step involves selecting the best
action for each state based on the value function. The new policy π_{i+1}
is determined by choosing the action that maximizes the expected
cumulative reward in each state.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 31
CLASS OF 2022/2023

Introduction to Policy iteration

• Convergence Check: Check if the new policy π_{i+1} is equal to the


previous policy π_i. If the policies are the same, the process has
converged, and the optimal policy π* has been found. Otherwise, go back
to step 2 and repeat the process with the updated policy.

NB: By iteratively alternating between policy evaluation and policy


improvement, policy iteration guarantees convergence to the optimal
policy. The policy evaluation step ensures that the value function
accurately represents the expected cumulative rewards of following the
current policy. The policy improvement step then updates the policy based
on the improved value function, leading to a better policy in each iteration.

• Policy iteration is a robust method for finding optimal policies, especially in


environments with complex dynamics and large state spaces. However, it
can be computationally expensive due to the repeated policy evaluation
steps.
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 32
CLASS OF 2022/2023

Policy iteration algorithm step-wise

Step 1: Initialization
Initialize the value function V(s) for all states to 0.

Step 2: Policy Evaluation


Calculate the new values V'(s) for each state using the Bellman expectation
equation:
For each state s:
For each action a:
Calculate the expected value for taking action a in state s: V'(s) =
Σ_s' [P(s'|s, a) * (R(s, a, s') + γ * V(s'))]
Repeat the policy evaluation process until the values converge:
Update V(s) to V'(s) for all states.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 33
CLASS OF 2022/2023

Policy iteration algorithm step-wise

Step 3: Policy Improvement


For each state s, calculate the new policy π'(s) by selecting the action that
maximizes the expected value:
For each state s:
For each action a:
Calculate the expected value for taking action a in state s: Q(s, a)
= Σ_s' [P(s'|s, a) * (R(s, a, s') + γ * V(s'))]
Update the policy to select the action with the highest expected
value: π'(s) = argmax[a] Q(s, a)
If the policy π'(s) is different from the current policy π(s) for any state, update
the policy π(s) to π'(s).
Step 4: Convergence Check
Repeat steps 2 and 3 until the policy π(s) and value function V(s) converge
and do not change significantly between iterations.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 34
CLASS OF 2022/2023

Exploration and exploitation

• Exploration and exploitation are two essential concepts in reinforcement


learning and decision-making processes.

• The exploration-exploitation tradeoff refers to the dilemma of choosing


between exploring unknown options (exploration) and exploiting the
currently known best option (exploitation).

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 35
CLASS OF 2022/2023

Exploration and exploitation

• Exploration: Exploration involves taking actions that are not yet well
understood or have uncertain outcomes. It aims to gather more
information about the environment and discover potentially better
strategies or rewards. By exploring, an agent can learn about new states,
actions, and their corresponding rewards. However, exploration can be
risky and may lead to suboptimal immediate rewards.

• Exploitation: Exploitation involves taking actions that are known to yield


the best current results based on the agent's current knowledge or policy.
It aims to maximize the immediate rewards by selecting actions that have
previously shown to be successful. Exploitation exploits the existing
knowledge to make decisions that are likely to yield the highest expected
rewards. However, relying solely on exploitation may cause the agent to
miss out on better actions or strategies that it has not yet explored.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 36
CLASS OF 2022/2023

Exploration and exploitation

• To balance exploration and exploitation, a common approach is to use an


epsilon-greedy policy.
• In an epsilon-greedy policy, the agent selects the best-known action
(exploitation) most of the time but occasionally chooses a random action
(exploration) with a small probability (epsilon). This allows the agent to
exploit the currently known best action while still exploring other
possibilities.
• For example, with an epsilon value of 0.1, the agent
would select the best action with a probability of 0.9
(1 - epsilon) and choose a random action with a
probability of 0.1 (epsilon). This way, the agent can
continuously update its knowledge and potentially
discover better actions while exploiting the current
best actions most of the time.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 37
CLASS OF 2022/2023

Model-based and model-free

• Model-based and model-free methods are two distinct approaches in


reinforcement learning that differ in how they interact with and learn from
the environment.
• Model-based methods: Model-based methods involve
building an explicit model of the environment, which
includes capturing the dynamics of the system. This
model represents the transition probabilities and
rewards associated with different states and actions.
With this model, the agent can simulate and plan its
actions ahead of time, predicting the outcomes of
different actions in different states. The agent then
uses these predictions to make informed decisions.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 38
CLASS OF 2022/2023

Model-based and model-free

• The key steps involved in model-based methods are:


a) Learning the model: The agent collects data from the environment
and learns the transition probabilities and rewards using techniques
such as direct observation, system identification, or statistical
methods.
b) Planning: The agent uses the learned model to simulate and explore
potential sequences of actions, evaluating the expected rewards for
each sequence. c) Decision-making: Based on the simulated
sequences and their expected rewards, the agent selects the best
action to take in the current state.
• Example: One example of a model-based method is
Monte Carlo Tree Search (MCTS), used in games like
AlphaGo. It builds a tree structure of possible
actions and their outcomes based on a learned model
of the game's rules. It then uses this tree to
guide its decision-making by exploring different
paths
Presenter: and estimating
Thomas Artificial Intelligence (AI) and the
Problemexpected rewards.
Solving and Search Strategies Slide 39
CLASS OF 2022/2023

Model-based and model-free

• Model-free methods, on the other hand, do not require an explicit model of


the environment. Instead, they directly learn from interactions with the
environment by trial and error. These methods focus on learning the
optimal policy without explicitly representing the transition probabilities and
rewards.
• The key steps involved in model-free methods are:
a) Exploration: The agent interacts with the environment, taking actions
and observing the resulting states and rewards.
b) Value estimation: The agent learns to estimate the value or expected
return of different states or state-action pairs. This can be done
through techniques like temporal difference learning or Q-learning.
c) Policy improvement: Based on the value estimates, the agent
updates its policy to select actions that maximize the expected
return.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 40
CLASS OF 2022/2023

Model-based and model-free

• Example: Q-learning is a well-known model-free method.


It iteratively updates the Q-values, which represent
the expected rewards for taking a particular action in
a specific state. The agent learns the optimal policy
by repeatedly exploring the environment, updating Q-
values based on observed rewards, and improving its
decision-making over time.

• The main difference between model-based and model-free methods is the


presence or absence of an explicit model of the environment. Model-
based methods rely on the learned model to simulate and plan actions,
while model-free methods directly learn from interactions without explicitly
representing the environment's dynamics.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 41
CLASS OF 2022/2023

Model-based and model-free

• Both approaches have their advantages and drawbacks. Model-based


methods can plan ahead and make more informed decisions but require
accurate models, which may be challenging to obtain. Model-free
methods, on the other hand, can handle complex and unknown
environments but may require more interactions to learn the optimal
policy.

• In practice, the choice between model-based and model-free methods


depends on the specific problem, the availability of a reliable model, and
the trade-off between computation time and learning performance.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 42
CLASS OF 2022/2023

Reinforcement Learning Algorithms

• Q-Learning: Q-Learning is a model-free RL algorithm that


learns an action-value function, called Q-function, to
determine the value of taking a specific action in a given
state. It uses an exploration-exploitation strategy to
balance between exploring new actions and exploiting the
current knowledge.
• Deep Q-Network (DQN): DQN is an extension of Q-Learning that
utilizes deep neural networks to approximate the Q-function.
It employs a technique called experience replay to train the
network using a replay buffer, which stores past experiences
to break the correlation between consecutive samples.
• Policy Gradient Methods: Policy Gradient methods directly
optimize the policy, which is a mapping from states to
actions, by estimating the gradient of expected rewards with
respect to the policy parameters. Examples include REINFORCE,
Proximal Policy Optimization (PPO), and Trust Region Policy
Optimization (TRPO).
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 43
CLASS OF 2022/2023

Reinforcement Learning Algorithms

• Actor-Critic Methods: Actor-Critic methods combine policy-based


and value-based approaches. They have an actor network that
learns the policy and a critic network that estimates the
value function. The actor takes actions based on the policy,
and the critic evaluates the actions and provides feedback to
update the actor. Examples include Advantage Actor-Critic
(A2C) and Asynchronous Advantage Actor-Critic (A3C).
• Proximal Policy Optimization (PPO): PPO is a state-of-the-art
policy optimization algorithm that belongs to the family of
policy gradient methods. It optimizes a surrogate objective
function, which ensures that policy updates do not deviate
too far from the previous policy distribution, preventing
large policy changes.
• Deep Deterministic Policy Gradient (DDPG): DDPG is an off-policy RL
algorithm that combines the actor-critic framework with deep
neural networks. It is designed for continuous action spaces
and uses a separate target network to stabilize learning.
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 44
CLASS OF 2022/2023

Reinforcement Learning Algorithms

• Twin Delayed Deep Deterministic Policy Gradient (TD3): TD3 is an


improvement over DDPG that addresses issues of overestimation
and exploration. It uses twin critics and delayed updates to
improve stability and learning efficiency.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 45
CLASS OF 2022/2023

Q-learning in RL

• Definition: Q-Learning is a model-free reinforcement learning algorithm


that learns the action-value function to determine the value of taking an
action in a given state.

• Objective: Find the optimal policy by maximizing the cumulative reward.

• Q-Learning is an off-policy algorithm that learns independently of the


policy being followed.
 It is considered an off-policy algorithm because it
learns independently of the policy being followed
during exploration.
 It maintains a separate policy, often called the
behavior policy, to explore and collect experiences.
However, it updates its Q-values based on the
optimal policy, known as the target policy, which is
the policy it aims to converge to.
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 46
CLASS OF 2022/2023

Q-learning in RL

• Key Concepts
 State (S): Represents the current state of the environment.

 Action (A): Possible actions that the agent can take in a given state.

 Reward (R): Feedback signal indicating the desirability of an action in


a specific state.

 Q-Value (Q): The value of taking an action in a given state.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 47
CLASS OF 2022/2023

Q-learning in RL

• Q-Value Function
 Q-Value Function (Q-function): Estimates the expected cumulative
reward for taking an action in a particular state and following a specific
policy thereafter.
 Formula: Qs, a  Qs, a   * R   * maxQs ' , a ' Qs, a 
Qs, a : Q-value for state-action pair s, a .
 alpha : Learning rate, determines the impact of new
experiences on the Q-values.
R : Immediate reward for taking action a in state s .

gamma : Discount factor, balances the importance of


immediate and future
maxQs ' , a ': Maximum Q-value for the next state s ' and all
possible actions a ' .
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 48
CLASS OF 2022/2023

Q-learning in RL

• Exploration vs. Exploitation


 Exploration: Discovering new actions to gain information about the
environment.

 Exploitation: Taking actions based on the current knowledge to


maximize rewards.

 Exploration-Exploitation Tradeoff: Balancing between exploring new


actions and exploiting current knowledge.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 49
CLASS OF 2022/2023

Q-learning in RL

• Q-Learning Algorithm Steps


1. Initialize Q(s, a) arbitrarily for all state-action pairs.
2. Choose an initial state s.
3. Repeat until convergence
a. Select an action a based on the exploration-exploitation
strategy.
b. Take action a, observe the reward R, and transition to the next
state s'.
c. Update the Q-value using the Q-Learning formula.
d. Set s = s'.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 50
CLASS OF 2022/2023

Q-learning in RL

• Learning Algorithm Pseudocode

• Exploration-Exploitation Strategy: Epsilon-Greedy

 Epsilon-Greedy: Selects a random action with probability epsilon (ε)


and otherwise chooses the action with the highest Q-value.

 Balances between exploration (random actions) and exploitation (best


Q-value actions).

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 51
•Goal state: S(3, 3)
•Prohibited state: S(2, 2) CLASS OF 2022/2023
•Actions: Up, Down, Left, Right
Q-learning in RL

• Q-Learning Algorithm Example


 Consider a 3x3 Gridworld with the following properties:
Start state: S(1, 1) Here’s a representation of the Q-Values for each state-action pair in a
table
Goal state: S(3, 3)

Prohibited state: S(2, 2)

Actions: Up, Down, Left,


Right

Rewards: +1 for
reaching the goal state, -
1 for entering the
prohibited state, 0 for all
other transitions

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 52
CLASS OF 2022/2023

Q-learning in RL

• Iteration 1
Current state: S(1, 1)
Select action: Up (random exploration)
Take action: Agent moves to state S(1, 2)
Observe reward: 0 (no immediate reward)
Next state: S(1, 2)
Update Q-Value:
Assume learning rate (alpha)=0.5,
discounting factor (gamma)=0.9

QS 1,1, right  QS 1,1, right  0.5 * R  .09 * maxQS 1,2 , a  QS 1,1, right 

QS 1,1, right  0  0.5 * 0  .09 * maxQS 1,2 , a  0

The Q-Value is updated accordingly.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 53
CLASS OF 2022/2023

Q-learning in RL

• Iteration 2
Current state: S(1, 2)
Select action: right (random exploration)
Take action: Agent moves to state S(1, 3)
Observe reward: 0 (no immediate reward)
Next state: S(1, 3)
Update Q-Value:
Assume learning rate (alpha)=0.5,
discounting factor (gamma)=0.9

QS 1,2 , right  QS 1,2 , right  0.5 * R  .09 * maxQS 1,3, a  QS 1,2 , right 

QS 1,2 , right  0  0.5 * 0  .09 * maxQS 1,3, a  0

The Q-Value is updated accordingly.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 54
CLASS OF 2022/2023

Q-learning in RL

• Iteration 3
Current state: S(1, 3)
Select action: down(random exploration)
Take action: Agent moves to state S(2, 3)
Observe reward: 0 (no immediate reward)
Next state: S(2, 3)
Update Q-Value:
Assume learning rate (alpha)=0.5,
discounting factor (gamma)=0.9

QS 1,3, down  QS 1,3, down  0.5 * R  .09 * max QS 2,3, a  QS 1,3, down 

QS 1,3, down  0  0.5 * 0  .09 * maxQS 2,3, a  0

The Q-Value is updated accordingly.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 55
CLASS OF 2022/2023

Q-learning in RL

• Iteration 4
Current state: S(2, 3)
Select action: up (random exploration)
Take action: Agent moves to state S(1, 3)
Observe reward: 0 (no immediate reward)
Next state: S(1, 3)
Update Q-Value:
Assume learning rate (alpha)=0.5,
discounting factor (gamma)=0.9

QS 2,3, up  QS 2,3, up  0.5 * R  .09 * maxQS 1,3, a  QS 2,3, up 

QS 2,3, up  0  0.5 * 0  .09 * maxQS 1,3, a  0

The Q-Value is updated accordingly.

Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 56
CLASS OF 2022/2023

Q-learning in RL

• Iteration 5
Current state: S(1, 3)
Select action: right (random exploration)
Take action: Agent moves to state S(1, 3)
Observe reward: 0 (no immediate reward)
Next state: S(1, 3)
Update Q-Value:
Assume learning rate (alpha)=0.5,
discounting factor (gamma)=0.9

QS 1,3, right  QS 1,3, right  0.5 * R  .09 * maxQS 1,3, a  QS 1,3, right 

QS 1,3, right  0  0.5 * 0  .09 * maxQS 1,3, a  0


The Q-Value is updated accordingly.
Continue extending the iterations, updating Q-Values, and exploring the
Gridworld environment
Presenter: Thomas Artificial Intelligence (AI) and Problem Solving and Search Strategies Slide 57

You might also like