Combes - An Introduction To Stochastic Approximation - 2013
Combes - An Introduction To Stochastic Approximation - 2013
Richard Combes
October 11, 2013
g 0 (xn ) ∗
g(xn+1 ) = g(xn ) + (g − g(xn )).
n
1
By the fundamental theorem of calculus: (1/n)|g ∗ − g(xn )| ≤ (g 0 /n)|x∗ − xn |. So for n ≥ g 0 ,
we have either xn ≤ xn+1 ≤ x∗ or xn ≥ xn+1 ≥ x∗ . In both cases, n 7→ |g(xn ) − g ∗ | is
decreasing for large n.
It is also noted that:
xn+1 − xn
= (g ∗ − g(xn )),
n
so that xn appears as a discretization (with discretization steps {1/n} of the following
ordinary differential equation (o.d.e.):
ẋ = g ∗ − g(x).
This analogy will be made precise in the next subsection.
2
Iterative updates: Once again since they are Markovian updates, stochastic approxima-
tion schemes are good models for collective learning phenomena where a set of agents
interact repeatedly and update their behavior depending on their most recent observa-
tion. This is the reason why results on learning schemes in game theory rely heavily
on stochastic approximation arguments.
f (xn + δn ) − f (xn − δn )
xn+1 = xn − n ,
2δn
The associated o.d.e is ẋ = −∇f (x) which admits the Liapunov function V (x) = f (x) −
f (x∗ ). With the proper step sizes (say n = n−1 , δn = n−1/3 ) it can be proven that the
method converges to the minimum xn →n→∞ f (x∗ ) almost surely.
Then it can be proven that the behavior of {xn } can be described by the o.d.e. ẋ = h(x).
Namely, the asymptotic behavior of {xn } is the same as in the case where all its components
are updated simultaneously. This is described for instance in [1][Chap 7].
3
reward Aka1 ,a2 , where A1 , A2 are two A by A matrices with real entries. Define the empirical
distribution of actions of player k at time n by :
n
k 1X
p (a, n) = 1{akt = a}.
n t=1
A natural learning scheme for agent k is to assume that at time n + 1, agent k 0 will choose an
0
action whose probability distribution is equal to pk (., n), and play the best action. Namely
0 0
agent k assumes that P[akn+1 = a] = pk (a, n), and chooses the action maximizing his expected
payoff given that assumption.
We define X X
g k (., p0 ) = max p(a)Aka,a0 p0 (a0 ),
p∈P
1≤a≤A 1≤a0 ≤A
with P the set of probability distributions on {1, . . . , A}. g k (., p0 ) is the probability distribu-
tion of the action of k maximizing the expected payoff, knowing that player k 0 will play an
action distributed as p0 .
The empirical probabilities can be written recursively as:
so that:
1
pk (a, n + 1) = pk (a, n) + (1{akn = a} − pk (a, n)).
n+1
0
Using the fact that E[1{akn = a}] = g k (., pk ), we recognize that the probabilities p are updated
according to a stochastic approximation scheme with n = 1/(n + 1), and the corresponding
o.d.e. is
ṗ = g(p) − p.
It is noted that such an o.d.e may have complicated dynamics and might not admit a Liapunov
function without further assumptions on the structure of the game (the matrices A1 and A2 ).
2.1 Assumptions
We denote by Fn the σ-algebra generated by (x0 , M0 , . . . , xn , Mn ). Namely Fn contains all
the information about the history of the algorithm up to time n.
We introduce the following assumptions:
4
(A1) (Lipshitz continuity of h) There exists L ≥ 0 such that for all x, y ∈ Rd ||h(x)−h(y)|| ≤
L||x − y||.
(A2) (Diminishing step sizes) We have that n≥0 n = ∞ and n≥0 2n < ∞.
P P
(A3) (Martingale difference noise) There exists K ≥ 0 such that for all n we have that
E[Mn+1 |Fn ] = 0 and E[||Mn+1 ||2 |Fn ] ≤ K(1 + ||xn ||).
(A4) (Boundedness of the iterates) We have that supn≥0 ||xn || < ∞ almost surely.
(A5) (Liapunov function) There exists a positive, radially unbounded, continuously differ-
entiable function V : Rd → R such that for all x ∈ Rd , h∇V (x), h(x)i ≤ 0 with strict
inequality if V (x) 6= 0.
(A1) is necessary to ensure that the o.d.e. has a unique solution given an initial condition,
and that the value of the solution after a given amount of time depends continuously on the
initial condition. (A2) is necessary for almost sure convergence, and holds in particular for
n = 1/n. (A3) is required to control the random fluctuations of xn around the solution of
the o.d.e. (using the martingale convergence theorem), and holds in particular if {Mn }n∈N is
independent with bounded variance. (A4) is essential, and can (in some cases) be difficult
to prove. We will discuss how to ensure that (A4) holds in the latter sections. (A5) ensures
that all solutions of the o.d.e. converge to the set of zeros of V , and that this set is stable (in
the sense of Liapunov). Barely assuming that all solutions of the o.d.e. converge to a single
point does not guarantee convergence of the corresponding stochastic approximation.
The proof of theorem 1 is based on an intermediate result stating that the sequence {xn }
(suitably interpolated) remains arbitrarily close to the solution of the o.d.e. We define Φt (x)
the value at t of the unique solution to the o.d.e. starting at x at time 0. Φ Pis uniquely
n−1
defined because of (A1) and the Picard-Lindelöf theorem. We define t(n) = k=0 k , and
x(t) the interpolated version of {xn }n∈N . Namely for all n , x(t(n)) = xn , and x is linear by
parts. We define xn (t) = Φt−t(n) (xn ) the o.d.e. trajectory started at xn at time t(n).
Proof of lemma 1: Since the result holds almost surely we consider a fixed sample path
throughout the proof. Define m = inf{k : t(k) > t(n) + T } so that we can prove the result for
T = t(m)−t(n) and consider the time interval [t(n), t(m)]. Consider n ≤ k ≤ m, we are going
5
to start by bounding the difference between x and xn at time instants t ∈ {t(n), ..., t(m)},
that is supn≤k≤m |xk − xn (t(k))|.
We start by re-writing the definition of xk and xk (t(k)):
k−1
X k−1
X
xk = xn + u h(xu ) + k Mk
u=n u=1
R t(u+1)
we recall that t(u) dv = u .
Our goal is to upper bound the following difference, decomposed into 3 terms:
k−1
X k−1
X
n
Ck = ||x (t(k)) − xk || ≤ Ak + Bu + Lu Cu , (1)
u=n u=n
with:
k−1
X
Ak = || k Mk ||,
u=n
Z t(u+1)
Bk = ||h(xn (v)) − h(xn (t(u)))||dv.
t(u)
From (A3) , E[||Mn+1 ||2 |Fn ] ≤ K(1 + supk ||xk ||) < ∞. Therefore the sequence {Sn } is a
square integrable martingale:
X X
E[||Sn+1 − Sn ||2 |Fn ] ≤ K(1 + sup ||xn ||) 2n < ∞.
n
n≥0 n≥0
Using the martingale convergence theorem (lemma 2), we have that Sn converges almost
surely to a finite value S∞ . This implies that:
6
Therefore, until the end of the proof we choose n large enough so that Ak ≤ δ/2 for all k ≥ n
with δ > 0 arbitrarily small.
The discretization term, maximal slope of xn
In order to upper bound Bu , we prove that for t ∈ [t(u), t(u + 1)], xn (t) can be approxi-
mated by xn (t(u)) (up to a term proportional to u ). To do so we have to bound the maximal
slope of t 7→ xn (t) on [t(n), t(m)]. We know that xn (t(n)) = xn ≤ supn∈N ||xn || which is finite
by (A4). Using the fact that h is Lipshitz and applying Gromwall’s inequality (lemma 2)
there exists a constant KT > 0 such that:
We have used the fact that h is Lipschitz so it grows at most linearly: for all x, ||h(x)−h(0)|| ≤
L||x||, so that ||h(x)|| ≤ ||h(0)|| + L||x||. Therefore by the fundamental theorem of calculus,
for t ∈ [t(u), t(u + 1)]:
Z t(u+1)
n n
||x (t) − x (t(u))|| ≤ ||h(xn (v))||dv ≤ k KT .
t(u)
sup Ck ≤ δeLT .
n≤k≤m
7
Applying the fundamental theorem of calculus twice, xn (t) can be written:
Z t
n n
x (t) = x (t(k)) + h(xn (v))dv
t(k)
Z t(k+1)
n
= x (t(k + 1)) − h(xn (v))dv
t
Therefore the error due to linear interpolation can be upper bounded as follows:
3 Appendix
3.1 Ordinary differential equations
We state here two basic results on o.d.e.s used in the proof of the main theorem.
8
Lemma 3 (Gronwall’s inequality, discrete case). Consider K ≥ 0 and positive sequences
{xn } , {n } such that for all 0 ≤ n ≤ N :
n
X
xn+1 ≤ K + n xn .
u=0
PN
Then we have the upper bound: sup0≤n≤N xn ≤ Ke n=0 n .
3.2 Martingales
We state the martingale convergence which is required to control the random fluctuations of
the stochastic approximation in the proof of the main theorem.
Consider a sequence of σ-fields F = (Fn )n∈N , and {Mn }n∈N a sequence of random variables
in Rd . We say that {Mn }n∈N is a F - martingale if Mn is Fn - measurable and E[Mn+1 |Fn ] =
Mn . The following theorem (due to Doob) states that if the sum of squared increments of a
martingale is finite (in expectation), then this martingale has a finite limit a.s.
then there exists a random variable M∞ ∈ Rd such that ||M∞ || < ∞ a.s. and Mn →n→∞ M∞
a.s.
References
[1] Vivek S. Borkar. Stochastic approximation: a dynamical systems viewpoint. Cambridge
University Press, 2008.
[2] G.W Brown. Iterative solutions of games by fictitious play. Activity Analysis of Production
and Allocation, 1951.
[3] Drew Fudenberg and David K. Levine. The Theory of Learning in Games. MIT Press
Books. The MIT Press, July 1998.
[4] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function.
The Annals of Mathematical Statistics, 23(3):pp. 462–466, 1952.
[5] L. Ljung. Analysis of recursive stochastic algorithms. Automatic Control, IEEE Trans-
actions on, 22(4):551–575, 1977.
[6] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals
of Mathematical Statistics, 22(3):400–407, September 1951.