ECE-517: Reinforcement Learning in Artificial Intelligence: Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)

This document summarizes key concepts from a lecture on reinforcement learning. It discusses evaluative feedback, the n-armed bandit problem, exploration vs exploitation tradeoffs, and several methods for estimating action values including sample averages, softmax selection, and tracking nonstationary rewards. The document also covers pursuit methods and associative search techniques for reinforcement learning tasks.

Uploaded by

edwindove

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views

ECE-517: Reinforcement Learning in Artificial Intelligence: Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)

Uploaded by

edwindove

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

1

Dr. Itamar Arel

College of Engineering
Electrical Engineering & Computer Science Department
The University of Tennessee
Fall 2011
August 23, 2011
ECE-517: Reinforcement Learning in
Artificial Intelligence
Lecture 2: Evaluative Feedback
(Exploration vs. Exploitation)
ECE-517 - Reinforcement Learning in AI
2
Outline
Recap
What is evaluative feedback
N-arm Bandit Problem (test case)
Action-Value Methods
Softmax Action Selection
Incremental Implementation
Tracking nonstationary rewards
Optimistic Initial Values
Reinforcement Comparison
Pursuit methods
Associative search
ECE-517 - Reinforcement Learning in AI
3
Recap
RL revolves around learning from experience by
interacting with the environment
Unsupervised learning discipline
Trial-and-error based
Delayed reward main concept (value functions,
etc.)
Policy maps from situations to actions
Exploitation vs. Exploration is key challenge
We looked at the Tic-Tac-Toe example where:
V(s) V(s) + o[V(s) V(s)]
ECE-517 - Reinforcement Learning in AI
4
What is Evaluative Feedback?
RL uses training information that evaluates the actions
taken rather than instructs by giving correct actions
Necessitates trail-by-error search for good behavior
Creates need for active exploration
Pure evaluative feedback indicates how good the action
taken is, but not whether it is the best or the worst action
possible
Pure instructive feedback, on the other hand, indicates
the correct action to take, independent of the action
actually taken
Corresponds to supervised learning
e.g. artificial neural networks
ECE-517 - Reinforcement Learning in AI
5
n-Armed Bandit Problem
Lets look at a simple version of the n-armed bandit
problem
First step in understanding the full RL problem
Here is the problem description:
An agent is repeatedly faced with making one
out of n actions
After each step a reward value is provided,
drawn from a stationary probability
distribution that depends on the action
selected

The agents objective is to maximize the

expected total reward over time
Each action selection is called a play or iteration
Extension of the classic slot machine (one-armed bandit)
ECE-517 - Reinforcement Learning in AI
6
n-Armed Bandit Problem (cont.)
Each action has a value an expected or mean reward given
that the action is selected
If the agent knew the value of each function the problem
would be trivial
The agent maintains estimates of the values, and chooses the
highest
Greedy algorithm
Directly associated with (policy) exploitation
If agent chooses non-greedily we say it explores
Under uncertainly the agent must explore
A balance must be found between exploration & exploitation
Initial condition: all levers assume to yield reward = 0
Well see several simple balancing methods and show that
they work much better than methods that always exploit
ECE-517 - Reinforcement Learning in AI
7
Action-Value Methods
Well look at simple methods for estimating the values of
actions
Let Q
*
(a) denote the true (actual) value of a, and Q
t
(a) its
estimate at time t
The true value equals the mean reward for that action
Lets assume that by iteration (play) t, action a has been
taken k
a
times hence we may use the sample-average
The greedy policy selects the highest sample-average,
i.e.
a
k
k
i
i
a
t
k
r r r
r
k
a Q
a
a
+ + +
= =

=
...
1
) (
2 1
1
) ( max ) ( ) ( max arg
* *
a Q a Q a Q a
t
a
t t
a
= =
ECE-517 - Reinforcement Learning in AI
8
Action-Value Methods (cont.)
A simple alternative is to behave greedily most of the time,
but every once in a while, say with small probability c, instead
select an action at random
This is called an c greedy method
We simulate the 10-arm bandit problem, where
r
a
~ N(Q
*
(a),1) (noisy readings of rewards)
Q
*
(a) ~ N(0,1) (actual, true mean reward for action a)
We further assume that there are 2000 machines (tasks),
each with 10 levers
The rewards distributions are drawn independently for each
machine
Each iteration, choose a lever on each machine and calculate
the average reward from all 2000 machines

ECE-517 - Reinforcement Learning in AI

9
Action-Value Methods (cont.)

ECE-517 - Reinforcement Learning in AI

10
Side note: the optimal average reward
2 4 6 8 10 12 14 16 18 20
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Number of levers
E
x
p
e
c
t
e
d

v
a
l
u
e

o
f

t
h
e

m
a
x
i
m
a
l

r
e
w
a
r
d
( ) { }
) 1 , 0 ( ~
,... , max
2 1
N z
z z z E
i
n

ECE-517 - Reinforcement Learning in AI

11
Action-Value Methods (cont.)
The advantage of c greedy methods depends on
the task
If the rewards have high variance c greedy would
have stronger advantage
If the rewards had zero variance, greedy
algorithm would have sufficed
If the problem was non-stationary (true rewards
values changed slowly over time)
c greedy would have been a must
Q: Perhaps some better methods exist ?
ECE-517 - Reinforcement Learning in AI
12
Softmax Action Selection
So far we assumed that while exploring (using c greedy)
we chose equally among the alternatives
This means we could have chosen really bad, as opposed
(for example) to choosing the next-best action
The obvious solution is to rank the alternatives
Generate a probability density/mass function to estimate the
rewards from each action
All actions are ranked/weighted
Typically use Boltzmann distribution, i.e. choose action a on
iteration t with probability

=
= = =
n
b
b Q
a Q
a
t
t
e
e
t a action
1
/ ) (
/ ) (
) ( } Pr{
t
t
t
ECE-517 - Reinforcement Learning in AI
13
Softmax Action Selection (cont.)
1 2 3 4
0
2
4
6
8
10
Action index
A
v
e
r
a
g
e

v
a
l
u
e
1 2 3 4
0
0.2
0.4
0.6
0.8
1
Action index
A
v
e
r
a
g
e

v
a
l
u
e
t = 1
1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
t = 4
Action index
A
v
e
r
a
g
e

v
a
l
u
e
1 2 3 4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Action index
A
v
e
r
a
g
e

v
a
l
u
e
t = 20
ECE-517 - Reinforcement Learning in AI
14
Incremental Implementation
Sample-average methods require linearly-increasing
memory (storage of reward history)
We need a more memory-efficient method
( )
( )
| |
k k k
k k k
k k k k
k
i
i k
k
i
i k
Q r
k
Q
Q k Q r
k
Q Q kQ r
k
r r
k
r
k
Q

+
+ =
+ +
+
=
+ +
+
=
|
.
|

\
|
+
+
=
+
=
+
+
+
=
+
+
=
+

1
1
1
1
1
1
1
1
1
1
) 1 (
1
1
1
1
1
1
1
1

ECE-517 - Reinforcement Learning in AI

15
Recurring theme in RL
The previous result is consistent with a recurring theme in
RL which is
New_Estimate
Old_Estimate + StepSize[Target Old_Estimate]
The StepSize may be fixed or adaptive (in accordance
with the specific application)
ECE-517 - Reinforcement Learning in AI
16
Tracking a Nonstationary Problem
So far we have considered stationary problems
In reality, many problems are effectively nonstationary
A popular approach is to weigh recent rewards more
heavily than older ones
One such technique is called fixed step size
This is a weighted average that exponentially decreases
| |

+ =
+ +
+ + + =
+ + = + =
+ =
k
i
i
i k k
k k
k k k
k k k k k
k k k k
r Q
Q r
r r r
Q r r Q r
Q r Q Q
1
0
0 1
1
2
2
1
2
2
1 1
1 1
) 1 ( ) 1 (
) 1 ( ) 1 (
... ) 1 ( ) 1 (
) 1 ( ) 1 ( ) 1 (
o o o
o o o
o o o o o
o o o o o o
o

ECE-517 - Reinforcement Learning in AI

17
Optimistic Initial Values
All methods discussed so far depended, to some extent, on
the initial action-value estimates, Q
0
(a)
For sample-average methods this bias disappears when
all actions have been selected at least once
For fixed step-size methods, the bias disappears with
time (geometrically decreasing)
In the 10-arm bandit example with o = 0.1
If we were to set all
initial reward guesses
to +5 (instead of zero)
Exploration is
guaranteed, since
true values
are ~N(0,1)

ECE-517 - Reinforcement Learning in AI

18
Reinforcement Comparison
An intuitive element in RL is that
higher rewards made more likely to occur
lower rewards made less likely to occur
How is the learner to know what constitutes a high or low
reward?
To make a judgment, one must compare the reward to a
reference reward - r
t
Natural choice average of previously received rewards
These methods are called reinforcement comparison methods
The agent maintains an action preference value, p
t
(a), for
each action a
The preference might be
used to select an action according
to a softmax relationship

=
=
n
b
b p
a p
t
t
t
e
e
a
1
) (
) (
) ( t

ECE-517 - Reinforcement Learning in AI

19
Reinforcement Comparison (cont.)
The reinforcement comparison idea is used in updating the
action preferences
High reward increases the probability of an action to be
selected, and visa versa
Following the action preference update, the agent updates
the reference reward
allows us to
differentiate between
rates for r
t
and p
t
| |
t t t t t t
r r a p a p + =
+
| ) ( ) (
1
| |
t t
t t
r r r r + =
+
o
1
ECE-517 - Reinforcement Learning in AI
20
Pursuit Methods
Another class of effective learning methods are pursuit
methods
They maintain both action-value estimates and action
preferences
The preferences continually pursue the greedy actions
Letting denote the greedy action, the update rules
are:
The action value estimates, Q
t+1
(a), are updated using one
the ways described (e.g. sample averages of observed
rewards)
| |
| |
*
1 1 1 1 1
*
1
*
1
*
1
*
1 1
for ) ( 0 ) ( ) (
for ) ( 1 ) ( ) (
+ + + + +
+ + + + +
= + =
= + =
t t t t t t t
t t t t t t t
a a a a a
a a a a a
t | t t
t | t t
*
1 + t
a
ECE-517 - Reinforcement Learning in AI
21
Pursuit Methods (cont.)
ECE-517 - Reinforcement Learning in AI
22
Associative Search
So far weve considered nonassociative tasks in which
there was no association of actions with states
Find the single best action when task is stationary, or
Track the the best action as it changes over time
However, in RL the goal is to learn a policy (i.e. state
to action mappings)
A natural extension of the n-arm bandit:
Assume you have K machines, but only one is played at a time
The agent maps the state (i.e. machine played) to the action
This would be called an associative search task
It is like the full RL problem in that it involves a policy
However, it lacks the long-term reward prospect of full RL
ECE-517 - Reinforcement Learning in AI
23
Summary
Weve looked at various action-selection schemes
Balancing exploration vs. exploitation
c greedy
Softmax techniques
There is no single-best solution for all problems
Well see more of this issue later

RLbook Solutions Manual
No ratings yet
RLbook Solutions Manual
35 pages
Assignment 1: Reinforcement Learning Prof. B. Ravindran
100% (2)
Assignment 1: Reinforcement Learning Prof. B. Ravindran
4 pages
Practical Design of Experiments: DoE Made Easy
From Everand
Practical Design of Experiments: DoE Made Easy
Colin Hardwick
4.5/5 (7)
RL-Endterm Report - Mridul Agarwal
No ratings yet
RL-Endterm Report - Mridul Agarwal
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
rl-unit5
No ratings yet
rl-unit5
101 pages
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
No ratings yet
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
32 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
16 - Reinforcement Learning and Bandits.pptx
No ratings yet
16 - Reinforcement Learning and Bandits.pptx
41 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Multi-armed bandits
No ratings yet
Multi-armed bandits
11 pages
RL Ese Answers
No ratings yet
RL Ese Answers
22 pages
UNIT IV-1
No ratings yet
UNIT IV-1
32 pages
Reinforcement Learning 1
No ratings yet
Reinforcement Learning 1
11 pages
SOS Final
No ratings yet
SOS Final
21 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
cs188 sp23 Note14
No ratings yet
cs188 sp23 Note14
2 pages
Data Challenge - NC Soft
No ratings yet
Data Challenge - NC Soft
4 pages
Unit II
No ratings yet
Unit II
10 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
RL UNIT PPT
No ratings yet
RL UNIT PPT
595 pages
Azar 17 A
No ratings yet
Azar 17 A
10 pages
RL
No ratings yet
RL
27 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Stateless Algorithms in Reinforcement Learning
No ratings yet
Stateless Algorithms in Reinforcement Learning
4 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
ECE 517: Reinforcement Learning in Artificial Intelligence
No ratings yet
ECE 517: Reinforcement Learning in Artificial Intelligence
27 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
AS01
No ratings yet
AS01
14 pages
AdaptiveEpsilonGreedyExploration PDF
No ratings yet
AdaptiveEpsilonGreedyExploration PDF
8 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
Reinforcement Learn
No ratings yet
Reinforcement Learn
36 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
31 pages
ML UNIT 5
No ratings yet
ML UNIT 5
13 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Reinforcement Learning 2
No ratings yet
Reinforcement Learning 2
13 pages
1 s2.0 S1566253522000288 Main
No ratings yet
1 s2.0 S1566253522000288 Main
22 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Mavrin 19 A
No ratings yet
Mavrin 19 A
11 pages
Class 3
No ratings yet
Class 3
32 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning - Chapter 2
100% (1)
Reinforcement Learning - Chapter 2
22 pages
Bandit
No ratings yet
Bandit
8 pages
114021
No ratings yet
114021
55 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
1、Bayesian Q-learning（1998）
No ratings yet
1、Bayesian Q-learning（1998）
8 pages
CHAPTER 21-Final
No ratings yet
CHAPTER 21-Final
20 pages
Dissecting Reinforcement Learning-Part6
No ratings yet
Dissecting Reinforcement Learning-Part6
25 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
Lecture13 Postclass
No ratings yet
Lecture13 Postclass
36 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Facilitating Learner Centered Teaching CHAPTER 3
No ratings yet
Facilitating Learner Centered Teaching CHAPTER 3
13 pages
A Single Shard Lesson Plan
No ratings yet
A Single Shard Lesson Plan
2 pages
English Syllabus: Caribbean Exam Inations Council
No ratings yet
English Syllabus: Caribbean Exam Inations Council
5 pages
Assignment/ Tugasan
No ratings yet
Assignment/ Tugasan
6 pages
11 HHW
No ratings yet
11 HHW
13 pages
Lessson 2 - Wetlands
No ratings yet
Lessson 2 - Wetlands
3 pages
ENG 111 - Course Outline - Sonika
No ratings yet
ENG 111 - Course Outline - Sonika
4 pages
Cognitive Performance Enhancement For Multi-Domain Operations
No ratings yet
Cognitive Performance Enhancement For Multi-Domain Operations
23 pages
Ped 17-Unit 2 (Assessment)
No ratings yet
Ped 17-Unit 2 (Assessment)
4 pages
Basic Principles Pies
No ratings yet
Basic Principles Pies
10 pages
35-MBBS in Ukraine
No ratings yet
35-MBBS in Ukraine
1 page
Cog 602 Pe Portfolio Template Xiaoman Yang
No ratings yet
Cog 602 Pe Portfolio Template Xiaoman Yang
28 pages
Module 2
No ratings yet
Module 2
6 pages
Are Materials Developing
No ratings yet
Are Materials Developing
27 pages
Development of Trigonometry Learning Kit With A STEM Approach To Improve Problem-Solving Skills and Learning Achievement
No ratings yet
Development of Trigonometry Learning Kit With A STEM Approach To Improve Problem-Solving Skills and Learning Achievement
20 pages
Organizationandmanagement Mod1 MANAGEMENT v1MEDINA
No ratings yet
Organizationandmanagement Mod1 MANAGEMENT v1MEDINA
29 pages
CURR 21 Curriculum Development 1
No ratings yet
CURR 21 Curriculum Development 1
120 pages
Fulbright FLTA Application Form
No ratings yet
Fulbright FLTA Application Form
10 pages
Lesson Plan - G11 - Population Ecology - Demographics (Details Removed)
No ratings yet
Lesson Plan - G11 - Population Ecology - Demographics (Details Removed)
5 pages
Arts 1 - DLP 1
No ratings yet
Arts 1 - DLP 1
4 pages
Case Study Girls Education EXAMPLE
No ratings yet
Case Study Girls Education EXAMPLE
19 pages
Syllabus DATAN
No ratings yet
Syllabus DATAN
6 pages
Formulation of Training Objectives
No ratings yet
Formulation of Training Objectives
22 pages
Don'T Forget To Edit: Input Data Sheet For E-Class Record
No ratings yet
Don'T Forget To Edit: Input Data Sheet For E-Class Record
12 pages
Educ8 M9.reviewer.
No ratings yet
Educ8 M9.reviewer.
7 pages
Practice Teaching in or Off Campus Evaluation Form
No ratings yet
Practice Teaching in or Off Campus Evaluation Form
2 pages
Excel-8 2-тоқсан
No ratings yet
Excel-8 2-тоқсан
13 pages
Let's Go: Ages Levels Features
No ratings yet
Let's Go: Ages Levels Features
3 pages
MAEDU
No ratings yet
MAEDU
54 pages
It Refers To A Teaching and Learning Tool That Is Used To Organize Information and Ideas in A Way That Is Easy To Comprehend and Internalize
0% (1)
It Refers To A Teaching and Learning Tool That Is Used To Organize Information and Ideas in A Way That Is Easy To Comprehend and Internalize
3 pages

ECE-517: Reinforcement Learning in Artificial Intelligence: Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)

Uploaded by

ECE-517: Reinforcement Learning in Artificial Intelligence: Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)

Uploaded by

1

Dr. Itamar Arel

The agents objective is to maximize the

ECE-517 - Reinforcement Learning in AI

ECE-517 - Reinforcement Learning in AI

ECE-517 - Reinforcement Learning in AI

ECE-517 - Reinforcement Learning in AI

ECE-517 - Reinforcement Learning in AI

ECE-517 - Reinforcement Learning in AI

ECE-517 - Reinforcement Learning in AI

You might also like