Unit II
Unit II
Here, the mean value of third arm is more so it is better to play with third machine in
order to get maximum total reward. But the mean values (i.e. probability distributions of
rewards) of machines are not known to us. Then how would we know the best machine
for playing??
Ɛ-greedy Method
The epsilon-greedy algorithm is one of the simplest strategies for solving the MAB
problem. It works as follows:
With probability epsilon, explore a random arm.
With probability 1 – epsilon, exploit the arm with the highest estimated reward.
Algorithm
1. Initialize the estimated values of all arms to zero or a small positive number.
2. For each trial:
Generate a random number between 0 and 1.
If the number is less than epsilon, select a random arm (exploration).
Otherwise, select the arm with the highest estimated reward (exploitation).
Update the estimated reward of the selected arm based on the observed
reward.
More precisely, Ɛ-greedy method takes the best action or chooses a random action from
the set of actions. It always produces maximum total reward than the greedy method.
The following graph shows the comparison between greedy and Ɛ-greedy methods. The
first graph shows the following.
At the beginning the greedy method improves slightly after that it is levelled off.
The greedy method performs worst in the long run because it gets stuck in a sub-
optimal action.
In long run Ɛ-greedy performs well than greedy method.
The second graph shows that the greedy method finds optimal action (or machine) only
1/3 of tasks whereas the Ɛ-greedy method finds optimal action (or machine) in 2/3 of the
tasks.
Applications of N-armed Bandit problems
Design of experiments (Clinical Trials)
Online ad placement
Web page personalization
Games
Network(packet routing)
i=1
i=1
In fact, the weight given to each reward decays exponentially into the past.
this sometimes called an exponential or recency-weighted average
Two conditions should be met :
Step size must be large enough to overcome the initial conditions.
Step size must be small enough to ensure the convergence.
From this we think that machine b is good and the rest of the trails will be played with b
in the hope of receiving maximum total reward. In fact, C may be good. Here, the
inotialization is not allowing us to play more number of times with the machines to find
the best one.
Whereas optimistic intialization will allow us to play more number of times with the
machines to find the best one. In optimistic inialization, the q values are initialized to a
large value say 10. This allows us to play more number of times with the machines until
the average q value of all amchines comes closer to 10. Hence, it leads to finding the best
machine correctly.
[ √ ]
At =argmax a Qt ( a ) +c
ln t
Nt (a )
Wher t is the time step
Qt(a) is the estimated reward of action ‘a’ at time step t,
C is a parameter used to control exploration
ln t is the natural logarithm of t
Nt(a) denote number of times the action ‘a’ was selected prior to t.
Select the arm with the highest UCB value.
Update the estimated reward of the selected arm based on the observed
reward.
The diagram shows that gradient bandit algorithm performs well with baseline.
The above figure shows that first one is a normal bandit. In this whenever we want to
play then machine B is given as it is best.
The second one shows contextual bandits where depending on the context, the
appropriate machine is given. For example
if the context is shoes then machine C is given
if the context is medicines then machine C is given
if the context is chips then machine B is given
if the context is diapers then machine A is given
Such bandits are called as contextual bandits.
2. 9. Summary
The following diagram shows the summary of all bandit algorithms.