ECE-517: Reinforcement Learning in Artificial Intelligence: Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)
ECE-517: Reinforcement Learning in Artificial Intelligence: Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)
=
= = =
n
b
b Q
a Q
a
t
t
e
e
t a action
1
/ ) (
/ ) (
) ( } Pr{
t
t
t
ECE-517 - Reinforcement Learning in AI
13
Softmax Action Selection (cont.)
1 2 3 4
0
2
4
6
8
10
Action index
A
v
e
r
a
g
e
v
a
l
u
e
1 2 3 4
0
0.2
0.4
0.6
0.8
1
Action index
A
v
e
r
a
g
e
v
a
l
u
e
t = 1
1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
t = 4
Action index
A
v
e
r
a
g
e
v
a
l
u
e
1 2 3 4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Action index
A
v
e
r
a
g
e
v
a
l
u
e
t = 20
ECE-517 - Reinforcement Learning in AI
14
Incremental Implementation
Sample-average methods require linearly-increasing
memory (storage of reward history)
We need a more memory-efficient method
( )
( )
| |
k k k
k k k
k k k k
k
i
i k
k
i
i k
Q r
k
Q
Q k Q r
k
Q Q kQ r
k
r r
k
r
k
Q
+
+ =
+ +
+
=
+ +
+
=
|
.
|
\
|
+
+
=
+
=
+
+
+
=
+
+
=
+
1
1
1
1
1
1
1
1
1
1
) 1 (
1
1
1
1
1
1
1
1
+ =
+ +
+ + + =
+ + = + =
+ =
k
i
i
i k k
k k
k k k
k k k k k
k k k k
r Q
Q r
r r r
Q r r Q r
Q r Q Q
1
0
0 1
1
2
2
1
2
2
1 1
1 1
) 1 ( ) 1 (
) 1 ( ) 1 (
... ) 1 ( ) 1 (
) 1 ( ) 1 ( ) 1 (
o o o
o o o
o o o o o
o o o o o o
o