0% found this document useful (0 votes)
9 views

Stochastic Online Opitmization Using Kalman Recursion - Paper

The document discusses an algorithm called the Extended Kalman Filter (EKF) for stochastic online optimization. It provides theoretical guarantees for the EKF in terms of bounds on the cumulative excess risk under assumptions of generalized linear models and local convergence. The EKF is presented as a parameter-free online procedure that can optimally solve some unconstrained optimization problems.

Uploaded by

visava789
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Stochastic Online Opitmization Using Kalman Recursion - Paper

The document discusses an algorithm called the Extended Kalman Filter (EKF) for stochastic online optimization. It provides theoretical guarantees for the EKF in terms of bounds on the cumulative excess risk under assumptions of generalized linear models and local convergence. The EKF is presented as a parameter-free online procedure that can optimally solve some unconstrained optimization problems.

Uploaded by

visava789
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Journal of Machine Learning Research 22 (2021) 1-55 Submitted 6/20; Revised 8/21; Published 9/21

Stochastic Online Optimization using Kalman Recursion

Joseph de Vilmarest joseph.de [email protected]


Laboratoire de Probabilités, Statistique et Modélisation, Sorbonne Université, CNRS
4 place Jussieu, 75005 Paris, France
and Électricité de France R&D
Olivier Wintenberger [email protected]
Laboratoire de Probabilités, Statistique et Modélisation, Sorbonne Université, CNRS
4 place Jussieu, 75005 Paris, France

Editor: Tong Zhang

Abstract
We study the Extended Kalman Filter in constant dynamics, offering a bayesian perspective
of stochastic optimization. For generalized linear models, we obtain high probability bounds
on the cumulative excess risk in an unconstrained setting, under the assumption that the
algorithm reaches a local phase. In order to avoid any projection step we propose a two-
phase analysis. First, for linear and logistic regressions, we prove that the algorithm enters
a local phase where the estimate stays in a small region around the optimum. We provide
explicit bounds with high probability on this convergence time, slightly modifying the
Extended Kalman Filter in the logistic setting. Second, for generalized linear regressions,
we provide a martingale analysis of the excess risk in the local phase, improving existing
ones in bounded stochastic optimization. The algorithm appears as a parameter-free online
procedure that optimally solves some unconstrained optimization problems.
Keywords: extended kalman filter, online learning, stochastic optimization

1. Introduction
The optimization of convex functions is a long-standing problem with many applications. In
supervised machine learning it frequently arises in the form of the prediction of an observa-
tion yt ∈ R given explanatory variables Xt ∈ Rd . The aim is to minimize a cost depending
on the prediction and the observation. We focus in this article on linear predictors, hence
the loss function is of the form `(yt , θ> Xt ).
Two important settings have emerged in order to analyse learning algorithms. In the
online setting (Xt , yt ) may be set by an adversary. The assumption required is bound-
edness and the goal is to upper estimate the regret (cumulative excess loss compared to
the optimum). In the stochastic setting with independent identically distributed (i.i.d.)
(X
Pnt , yt ), we define the risk L(θ) = E[`(y, θ> X)]. We focus on the cumulative excess risk
? ?
t=1 L(θ̂t ) − L(θ ) where θ minimizes the risk. We obtain bounds holding with high prob-
ability simultaneously for any horizon, that is, we control the whole trajectory of the risk.
Furthermore, our bounds on the cumulative risk all lead to similar ones on the excess risk
at any step for the averaged version of the algorithm.
Due to its low computational cost the Stochastic Gradient Descent (SGD) of Robbins
and Monro (1951) has been widely used, along with its equivalent in the online setting,

c 2021 Joseph de Vilmarest and Olivier Wintenberger.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v22/20-618.html.
de Vilmarest and Wintenberger

the online gradient descent (Zinkevich, 2003) and a simple variant where the iterates are
averaged (Ruppert, 1988; Polyak and Juditsky, 1992). More recently Bach and Moulines
(2013) provided a sharp bound in expectation on the excess risk for a two-step procedure
that has been extended to the average of SGD with a constant step size (Bach, 2014).
Second-order methods based on stochastic versions of Newton-Raphson algorithm have been
developed in order to converge faster in iterations, although with a bigger computational
cost per iteration (Hazan et al., 2007).
In order to obtain a parameter-free second-order algorithm we apply a bayesian perspec-
tive, seeing the loss as a negative log-likelihood and approximating the maximum-likelihood
estimator at each step. We get a state-space model interpretation of the optimization prob-
lem: in a well-specified setting the space equation is yt ∼ pθt (· | Xt ) ∝ exp(−`(·, θt> Xt ))
with θt ∈ Rd and the state equation defines the dynamics of the state θt . The stochastic con-
vex optimization setting corresponds to a degenerate constant state-space model θt = θt−1
called static. As usual in State-Space models, the optimization is realized with the Kalman
recursion (Kalman and Bucy, 1961) for the quadratic loss and the Extended Kalman Filter
(EKF) (Fahrmeir, 1992) in a more general case. A correspondence has recently been made
by Ollivier (2018) between the static EKF and the online natural gradient (Amari, 1998).
This motivates a risk analysis in order to enrich the link between Kalman filtering and
the optimization community. We may see the static EKF as the online approximation of
bayesian model averaging, and similarly to its analysis derived by Kakade and Ng (2005)
our analysis is robust to misspecification, that is we don’t assume the data to be generated
by the probabilistic model.
The static EKF is very close to the Online Newton Step (ONS) introduced by Hazan
et al. (2007) as both are second-order online algorithms and our results are of the same
flavor as those obtained on the ONS (Mahdavi et al., 2015). However the ONS requires the
knowledge of the region in which the optimization is realized. It is involved in the choice of
the gradient step size and a projection step is done to ensure that the search stays in the
chosen region. On the other hand the static EKF yields two advantages at the cost of being
less generic.
First, there is no costly projection step and each recursive update runs in O(d2 ) op-
erations, where d is the dimension of the features Xt . Therefore, our comparison of the
static EKF with the ONS provides a lead to the open question of Koren (2013). Indeed, the
problem of the ONS pointed out by Koren (2013) is to control the cost of the projection step
and the question is whether it is possible to perform better than the ONS in the stochastic
exp-concave setting. We don’t answer the open question in the general setting. However,
we suggest a general way to get rid of the projection by dividing the analysis between a
convergence proof of the algorithm to the optimum and a second phase where the estimate
stays in a small region around the optimum where no projection is required.
Second, the algorithm is (nearly) parameter-free. We believe that bayesian statistics is
the reasonable approach in order to obtain parameter-free online algorithms in the uncon-
strained setting. Parameter-free is not exactly correct as there are initialization parameters,
which we see as a smoothed version of the hard constraint imposed by bounded algorithm,
but they have no impact on the leading terms of our bounds. Static Kalman filter co-
incides with the ridge forecaster and similarly the static EKF may be seen as the online
approximation of a regularized empirical risk minimizer.

2
Stochastic Online Optimization using Kalman Recursion

1.1 Related Work


Theoretical guarantees for online and stochastic algorithms are multi-criteria and of various
natures. The comparison of upper-bounds or computational complexity highly depends on
the values of d the dimension of the explanatory vectors and n the time horizon, leading to
different views on whether the dependence to d or n is the most important. The nature of
the guarantee obviously depends on the objective pursued.
In the advesarial setting, the learner suffers a loss `t (θ̂t ) depending on its estimate θ̂t at
each time step t. It is natural to minimize the cumulative loss, or equivalently the regret
n
X n
X
`t (θ̂t ) − `t (θ? ) ,
t=1 t=1

where θ? reaches the minimum value of the cumulative loss and thus highly depends on
(`t )1≤t≤n . Under an assumption of bounded gradients, Zinkevich (2003) proved that a first-

order online gradient descent yields a regret bound in O( n). The Online Newton Step
(ONS) is a second-order online gradient descent that has been designed to obtain a regret
bound in O(ln n) (Hazan et al., 2007) under the assumption that the losses are exp-concave.
The improved guarantee comes at a cost of O(d2 ) operations per step instead of O(d), along
with a projection at each step whose cost depends on the data.
In the stochastic setting where the losses (`t ) are assumed i.i.d., the aim is to minimize
the risk L(θ) = E[`(θ)]. A natural candidate is the Empirical Risk Minimizer (ERM), whose
asymptotics are well understood (see for example Murata and Amari (1999)). Assuming the
∂2
existence of θ? minimizing the risk and that the Hessian matrix H ? = ∂θ ?
2 L(θ ) is positive

definite, the ERM θ̂n satisfies


tr(G? H ?−1 ) h∂ ∂ i
E[L(θ̂n )] − L(θ? ) = + o(1/n) , G? = E `(y, θ?> X) `(y, θ?> X)> .
2n ∂θ ∂θ
Although in the well-specified setting the identity tr(G? H ?−1 ) = d holds, in the misspec-
ified case there is no general estimate for tr(G? H ?−1 ). Recently a non-asymptotic bound
L(θ̂n ) − L(θ? ) = O(tr(G? H ?−1 ) ln δ −1 /n) holding with probability 1 − δ has been shown
by Ostrovskii and Bach (2021) on the ERM. However the ERM is defined only implicitly
and may have important computational cost, hence recursive algorithms based on gradient
descent have been studied under different sets of assumptions to bound tr(G? H ?−1 ).
The most simple is Stochastic Gradient Descent (SGD), where each step is in the op-
posite direction of the gradient. This algorithm has been widely used with various step
sizes. Sharp results have been obtained by Bach (2014) for a constant gradient step size

C/ n with fixed horizon n. Under the assumption that gradients are bounded by R
we have tr(G? H ?−1 ) ≤ R2 /µ where µ is the minimal eigenvalue of H ? . The fast rate
E[L(θn )] − L(θ? ) = O(R2 /(µn)) is obtained by Bach (2014) for the averaged estimate θn of
SGD. In the same article the author also derives a bound with high probability but with a

slower rate: it degrades into L(θn ) − L(θ? ) = O(log δ −1 / n) with probability 1 − δ. Finally,
in the quadratic setting a fast rate L(θn ) − L(θ? ) = O(1/(nδ α )) is achieved with probability
1 − δ for a defined α > 0 (Bach and Moulines, 2013).
To obtain fast rates with high probability beyond the quadratic setting, it seems nec-
essary to use second-order information as in the ONS (Mahdavi et al., 2015). Under the

3
de Vilmarest and Wintenberger

ERM Averaged SGD ONS This article


Regret O(ln n)
Excess risk in expectation O( n1 ) O( n1 )
−1 −1
Excess risk w.h.p. O( ln δn ) O( ln√δn )
Cumulative excess risk w.h.p. O(ln n + ln δ −1 ) O(ln n + ln δ −1 + S(δ))
Cost per iteration Implicit O(d) O(d2 ) + Tproj O(d2 )

Table 1: Summary of existing results along with the static EKF for which we prove the
bound for the cumulative excess risk. We focus on the dependence on n, and δ for
the bounds holding with probability 1 − δ (w.h.p.). S(δ) is the cumulative excess
risk of the convergence phase.

assumption that the loss is α-exp-concave, tr(G? H ?−1 ) ≤ d/α and for the averaged version
of the ONS the rate L(θn ) − L(θ? ) = O(d(ln n + ln δ −1 )/(αn)) with probability 1 − δ is
obtained. From our perspective, the result is stronger than what is claimed by Mahdavi
et al. (2015): the bound obtained is
n
X
L(θ̂n ) − L(θ? ) = O(ln n + ln δ −1 ) , (1)
t=1

holding simultaneously for any n with probability 1 − δ. Note that although averaging this
bound with Jensen’s inequality leads to a sub-optimal bound on the excess risk of the last
averaged estimate, it is conversely not possible to obtain Equation (1) from
L(θ̂n ) − L(θ? ) = O(ln δ −1 /n) ,
holding with probability 1 − δ.

1.2 Contributions
Our first contribution is a local analysis of the static EKF under assumptions defined in
Section 2, and provided that consecutive steps stay in a small ball around the optimum
θ? . We derive local bounds on the cumulative risk with high probability from a martingale
analysis. Our analysis of Section 3 is similar to the one of Mahdavi et al. (2015) and we
slightly refine their constants as a by-product.
We then show that the convergence property crucial in our analysis is reachable. To that
end we focus on linear regression and logistic regression as these two well-known problems
are challenging in the unconstrained setting. In linear regression, the gradient of the loss is
not bounded globally. In logistic regression, the loss is strictly convex, but neither strongly
convex nor exp-concave in the unconstrained setting. In Section 4, we develop a global
bound in the logistic setting on a slight modification of the algorithm introduced by Bercu
et al. (2020). We prove that this modified algorithm converges and stays into a local region
around θ? after a finite number of steps. Moreover we show that it coincides with the static
EKF and thus our local analysis applies. In Section 5, we apply our local analysis to the
quadratic setting. We rely on Hsu et al. (2012) to obtain the convergence of the algorithm
after exhibiting the correspondence between Kalman filter in constant dynamics and the
ridge forecaster, and we therefore obtain similarly a global bound.

4
Stochastic Online Optimization using Kalman Recursion

Finally, we demonstrate numerically the competitiveness of the static EKF for logistic
regression in Section 6.

2. Definitions and Assumptions


We consider loss functions that may be written as the negative log-likelihood of a generalized
linear model (McCullagh and Nelder, 1989). Formally, the loss is defined as `(y, θ> X) =
− log pθ (y | X) where θ ∈ Rd , (X, y) ∈ X × Y for some X ⊂ Rd and Y ⊂ R and pθ is of the
form  >
y θ X − b(θ> X)

pθ (y | X) = h(y) exp , (2)
a
where a is a constant and h and b are one-dimensional functions on which a few assumptions
are required (Assumption 3). This includes logistic and quadratic regressions, see Sections
4 and 5. We display the static EKF in Algorithm 1 in this setting (see Appendix A for a
derivation relying on Durbin and Koopman, 2012). In the quadratic setting, noting that the
EKF estimate θ̂t does not depend on the (unknown) variance σ 2 , we consider the quadratic
loss `(y, ŷ) = (y − ŷ)2 /2 by convention.

Algorithm 1: Static Extended Kalman Filter for Generalized Linear Model

1. Initialization: P1 is any positive definite matrix, θ̂1 is any initial parameter in Rd .

2. Iteration: at each time step t = 1, 2, . . .


Pt Xt Xt> Pt b00 (θ̂t> Xt )
(a) Update Pt+1 = Pt − α
1+Xt> Pt Xt αt t
with αt = a .
(yt −b0 (θ̂t> Xt ))Xt
(b) Update θ̂t+1 = θ̂t + Pt+1 a .

Due to matrix-vector and vector-vector multiplications, Algorithm 1 has a running-time


complexity of O(d2 ) at each iteration and thus O(nd2 ) for n iterations.
Note that although we need the loss function to be derived from a likelihood of the form
(2), we do not need the data to be generated under this process. We need two standard
hypotheses on the data. The first one is the i.i.d. assumption and bounded random design
(all along the paper k.k is the Euclidean norm):

Assumption 1 The observations (Xt , yt )t are i.i.d. copies of the pair (X, y) ∈ X × Y and
there exists DX such that kXt k ≤ DX almost surely (a.s.).

Working under Assumption 1, we define the risk function L(θ) = E `(y, θ> X) . Note that
 

in Section 3 we don’t need E[XX > ] invertible, but we will make such an assumption in
Sections 4 and 5 to prove the convergence of the algorithm in the logistic and quadratic
settings, respectively. In order to work on a well-defined optimization problem we assume
there exists a minimum:

Assumption 2 There exists θ? ∈ Rd such that L(θ? ) = inf L(θ).


θ∈Rd

5
de Vilmarest and Wintenberger

We treat two different settings requiring different assumptions, summarized in Assump-


tion 3 and 4 respectively. First, motivated by logistic regression we define:
Assumption 3 There exists (κε )ε>0 , (hε )ε>0 and ρε −−−→ 1 such that for any ε > 0 and
ε→0
any θ, θ0 ∈ Rd satisfying max(kθ − θ? k, kθ0 − θ? k) ≤ ε, we have
• `0 (y, θ> X)2 ≤ κε `00 (y, θ> X) a.s.

• `00 (y, θ> X) ≤ hε a.s.

• `00 (y, θ> X) ≥ ρε `00 (y, θ0> X) a.s.


Here `0 and `00 are the first and second derivatives of ` with respect to the second variable.
Assumption 3 requires local exp-concavity (around θ? ) along with some regularity on `00
(`00 continuous and `00 (y, θ?> X) ≥ γ > 0 a.s. is sufficient). That setting implies Y bounded,
because `0 depends on y whereas `00 doesn’t. In logistic regression, Y = {−1, +1} and
?
Assumption 3 is satisfied for κε = eDX (kθ k+ε) , hε = 41 , ρε = e−εDX .
Second, we consider the quadratic loss, corresponding to a gaussian model. In order to
include the well-specified setting and to bound G? = E[(y − θ?> X)2 XX > ], we assume y
sub-gaussian conditionally to X and not too far away from the model:
Assumption 4 The distribution of (X, y) ∈ X × Y satisfies
σ 2 s2
• There exists σ 2 > 0 such that for any s ∈ R, E es(y−E[y|X]) | X ≤ e 2 a.s.,
 

• There exists Dapp ≥ 0 such that |E[y | X] − θ?> X| ≤ Dapp a.s.


Both conditions of Assumption 4 hold with Y = R and Dapp = 0 for the well-specified
gaussian linear model with random bounded design. The second condition of Assumption
4 is satisfied for Dapp > 0 in misspecified sub-gaussian linear model with a.s. bounded
approximation error.

3. The Algorithm Around the Optimum


In this section, we analyse the cumulative risk under a strong convergence assumption.
Precisely we define

τ (ε) = min{k ∈ N | ∀t > k, kθ̂t − θ? k ≤ ε} ,

where (θ̂t )t are the estimates of the static EKF, and with the convention min ∅ = +∞. We
assume a bound on τ (ε) holding with high probability:
Assumption 5 For any δ, ε > 0, there exists T (ε, δ) ∈ N such that

P τ (ε) ≤ T (ε, δ) ≥ 1 − δ .

Assumption 5 states that with high probability there exists a convergence time after which
the algorithm stays trapped in a local region around the optimum. Sections 4 and 5 are
devoted to define explicitly such a convergence time for a modified EKF in the logistic
setting and for the true EKF in the quadratic setting.

6
Stochastic Online Optimization using Kalman Recursion

We present our result in the bounded and sub-gaussian settings. The results and their
proofs are very similar, but two crucial steps are different. First, Assumption 3 yields a
bound on the gradient holding almost surely. We relax the boundedness condition for the
quadratic loss with a sub-gaussian hypothesis, requiring a specific analysis. Second, our
analysis is based on a second-order expansion. The quadratic loss is equal to its second-
order Taylor expansion but we need Assumption 5 along with the third point of Assumption
3 otherwise.
The following theorem is our result in the bounded setting.
Theorem 1 Starting the static EKF from any θ̂1 ∈ Rd , P1  0, if Assumptions 1, 2, 3, 5
are satisfied and if ρε > 0.95, for any δ > 0, it holds for any n ≥ 1 simultaneously
T (ε,δ)+n  2 
X 5 hε λmax (P1 )DX  
L(θ̂t ) − L(θ? ) ≤ dκε ln 1 + n + 5λmax PT−1
(ε,δ)+1 ε
2
2 d
t=T (ε,δ)+1

+ 30 2κε + hε ε2 DX
2
ln δ −1 ,


with probability at least 1 − 3δ.


The constant 0.95 may be chosen arbitrarily close to 0.5 with growing constants in the bound
on the cumulative risk. There is a hidden trade-off in ε: on the one hand, the smaller ε
the better our upper-bound, but on the other hand T (ε, δ) increases when ε decreases, and
thus our bound applies after a bigger convergence time.
For the quadratic loss, we obtain the following result under the sub-gaussian hypothesis.
Theorem 2 In the quadratic setting, starting the static EKF from any θ̂1 ∈ Rd , P1  0, if
Assumptions 1, 2, 4 and 5 are satisfied, for any δ > 0 and any ε > 0, it holds for any n ≥ 1
simultaneously
T (ε,δ)+n  2 
X
? 15 2 2 2 2
 λmax (P1 )DX
L(θ̂t ) − L(θ ) ≤ d 8σ + Dapp + ε DX ln 1 + n
2 d
t=T (ε,δ)+1
   
λmax (P1 )DX2  
+ 5λmax PT−1 (ε,δ)+1 ε 2
+ 115 σ 2
4 + + D 2
app + 2ε 2 2
D X ln δ
−1
,
4
with probability at least 1 − 5δ.
We observe a similar trade-off in ε as in Theorem 1. Up to numerical constants, the tight
constant d(σ 2 +Dapp
2 ) (see for instance Hsu et al., 2012) is achieved by choosing ε arbitrarily

small, at the cost of a loose control of the T (ε, δ) first steps.


Both results follow from a regret analysis close to the one on the ONS (see Section 3.1),
and on a control on the martingales stated below:
Lemma 3 Let k ≥ 0 and (∆Nt )t>k be any martingale difference adapted to the filtration
(Ft )t≥k such that for any t > k, E[∆Nt2 | Ft−1 ] < ∞. For any δ, λ > 0, we have the
simultaneous property
k+n
X  ln δ −1

λ 2 2
∆Nt − ((∆Nt ) + E[(∆Nt ) | Ft−1 ]) ≤ , n ≥ 1,
2 λ
t=k+1

with probability at least 1 − δ.

7
de Vilmarest and Wintenberger

Algorithm 2: Recursive updates of the ONS and the static EKF


Online Newton Step Static Extended Kalman Filter
−1
Pt+1 = Pt−1 + `0 (yt , wt> Xt )2 Xt Xt> , −1
Pt+1 = Pt−1 + `00 (yt , θ̂t> Xt )Xt Xt> ,
∇t = `0 (yt , wt> Xt )Xt , ∇t = `0 (yt , θ̂t> Xt )Xt ,
−1
Pt+1  
Y 1
wt+1 = wt − Pt+1 ∇t , θ̂t+1 = θ̂t − Pt+1 ∇t ,
γ
K

−1
Pt+1
Q
where is the projection on K for the norm k.kP −1 .
t+1
K

This result proved in Appendix B.1 is a corollary of a martingale inequality from Bercu and
Touati (2008) and a stopping time construction (Freedman, 1975).
We detail the key ideas of the proofs in the rest of the Section, and we defer to Appendix
B the proof of the intermediate results along with the detailed proof of Theorems 1 and 2.
Specifically, we display in Section 3.1 the parallel with the ONS, where we compare with
the existing result on the cumulative risk, and a similar analysis yields an adversarial bound
on a second-order expansion of the loss. In Section 3.2 we compare the excess risk with
its second-order expansion thanks to Assumption 5, and we use a martingale analysis to
obtain a bound on the cumulative excess risk.

3.1 Comparison with Online Newton Step and Adversarial Analysis


We display the parallel between the ONS and the static EKF in Algorithm 2 through their
recursive updates. We observe that the square of the first derivative of ` is replaced with
the second derivative. Thus tPt−1 in the static EKF is an estimate of the Hessian H ? which
is the optimal preconditioning matrix as shown in Corollary 3 of Murata and Amari (1999).
Then the recursion on the parameter (wt and θ̂t ) has two differences: there is a gradient
step size 1/γ in the ONS absent in the static EKF, and after the gradient step the ONS
applies a projection. Lemma 3 yields the following refinement on the bound of Mahdavi
et al. (2015) obtained on the cumulative excess risk of the ONS:

Corollary 4 Assume the search region K has diameter D and the gradients are bounded
by R. Let (wt )t be the ONS estimates starting from w1 ∈ K, P1 = λI and using a step-size
γ = 12 min( 4RD
1
, α) with α the exp-concavity constant of ` on K. Then for any δ > 0, it
holds for any n ≥ 1 simultaneously
n
nR2 12 4γR2 D2
   
X 3 λγ 2
?
L(wt ) − L(θ ) ≤ d ln 1 + + D + + ln δ −1 ,
2γ λd 6 γ 3
t=1

with probability at least 1 − 2δ.

For the sake of consistency, we display Corollary 4 as a bound on the cumulative excess risk,
whereas Theorem 3 of Mahdavi et al. (2015) is a bound on the excess risk of the averaged

8
Stochastic Online Optimization using Kalman Recursion

ONS. The latter follows directly from an application of Jensen’s inequality. The proof of
Corollary 4 consists in replacing Theorem 4 of Mahdavi et al. (2015) with Lemma 3. We
obtain similar constants in Theorem 1 and in Corollary 4, as κε is the inverse of the exp-
concavity constant α. The use of second-order methods with well-tuned preconditioning is
crucial in order to replace the leading constant R2 /µ obtained for first-order methods by
d/α (µ is the minimum eigenvalue of the hessian H ? ).
Our results on the static EKF are less general than the ones obtained on the ONS as
a control of the convergence time τ (ε) ≤ T (ε, δ) is required with high probability. On the
other hand the results obtained on the ONS require the knowledge of the exp-concavity
constant α whereas the static EKF is parameter-free. That is why we argue that the static
EKF provides an optimal way to tune the step size and the preconditioning matrix. Indeed,
as ε is a parameter of the EKF analysis but not of the algorithm, we can improve the
leading constant κε on a local region arbitrarily small around θ? , at a cost of a loose control
of the T (ε, δ) first steps. In the ONS the choice of a diameter D > kθ? k makes the gradient
step-size sub-optimal and impacts the leading constant.
Once the parallel between the ONS and the static EKF has been displayed (Algorithm 2),
it is natural to adopt an approach similar to the one in Hazan et al. (2007). The cornerstone
of our local analysis is the derivation of an adversarial bound on the second-order Taylor
expansion of `, from the recursive update formulae.
Lemma 5 For any sequence (Xt , yt )t , starting from any θ̂1 ∈ Rd , P1  0, it holds for any
θ? ∈ Rd and n ∈ N that
n  > 
X
0 > ? 1 ? >

00 > >

?
` (yt , θ̂t Xt )Xt (θ̂t − θ ) − (θ̂t − θ ) ` (yt , θ̂t Xt )Xt Xt (θ̂t − θ )
2
t=1
n
1X > kθ̂1 − θ? k2
≤ Xt Pt+1 Xt `0 (yt , θ̂t> Xt )2 + .
2 λmin (P1 )
t=1

We cannot compare the excess loss with the second-order Taylor expansion in general,
and it is natural to use a step size parameter. In Hazan et al. (2007), the regret analysis of
the ONS is based on a very similar bound on
 > γ  
`0 (yt , wt> Xt )Xt (wt − θ? ) − (wt − θ? )> `0 (yt , wt> Xt )2 Xt Xt> (wt − θ? ) ,
2
where γ is a step size parameter. Then the regret bound follows from the exp-concavity
property, bounding the excess loss `(yt , wt> Xt ) − `(yt , θ?> Xt ) with the previous quantity
for a specific γ. The dependence of γ on the exp-concavity constant and the bound on the
gradients require that the algorithm stays in a bounded region around the optimum θ? , and
a projection on this region is used, potentially at each step.
We follow a very different approach, to stay parameter-free, unconstrained and to avoid
any additional cost in the leading constant. In the stochastic setting, we observe that we can
upper-bound the excess risk with a second-order expansion, up to a multiplicative factor.

3.2 From Adversarial to Stochastic: the Cumulative Risk


In order to compare the excess risk with a second-order expansion, we compare the first-
order term with the second-order one.

9
de Vilmarest and Wintenberger

Proposition 6 If Assumptions 1, 2 and 3 are satisfied, for any θ ∈ Rd , it holds


∂L > ∂2L
(θ − θ? ) ≥ ρkθ−θ? k (θ − θ? )> (θ − θ? ) .
∂θ θ ∂θ2 θ
This result leads immediately to the following proposition, using the first-order convexity
property of L.
Proposition 7 If Assumptions 1, 2 and 3 are satisfied, for any θ ∈ Rd , 0 < c < ρkθ−θ? k ,
it holds
ρkθ−θ? k 2
 
? ∂L > ? ? >∂ L ?
L(θ) − L(θ ) ≤ (θ − θ ) − c(θ − θ ) (θ − θ ) .
ρkθ−θ? k − c ∂θ θ ∂θ2 θ

Lemma 5 motivates the use of c > 21 , thus we need at least ρkθ−θ? k > 12 . In the quadratic
setting, it holds as an equality with ρ = 1 because the second derivative of the quadratic
loss is constant. In the bounded setting we need to control the second derivative in a small
range, and we can achieve that only locally, therefore we separate the condition ρkθ−θ? k > 12
between the third point of Assumption 3 and Assumption 5.
Then we are left to obtain a bound on the cumulative risk from Lemma 5. In order
to compare the derivatives of the risk and the losses, we need to control the martingale
difference adapted to the natural filtration (Ft ) and defined as
 >
∂L
∆Mt = − ∇t (θ̂t − θ? ), where ∇t = `0 (yt , θ̂t> Xt )Xt .
∂θ θ̂t
We thus apply Lemma 3 to this martingale difference.
Lemma 8 Starting the static EKF from any θ̂1 ∈ Rd , P1  0, if Assumptions 1 and 2 are
satisfied, for any k ≥ 0 and δ, λ > 0, it holds simultaneously
k+n
ln δ −1
 
X  3  
∆Mt − λ(θ̂t − θ? )> ∇t ∇>
t + E ∇ ∇
t t
>
| Ft−1 ( θ̂ t − θ ?
) ≤ , n ≥ 1,
2 λ
t=k+1

with probability at least 1 − δ.


The proof of Theorems 1 and 2 deferred to Appendix B builds on the above results.
Summing Lemma 5 and 8, we obtain for any δ, λ > 0 the simultaneous bound
T (ε,δ)+n  
X ∂L > 1
(2) 3  
? >
?
(θ̂t − θ ) − (θ̂t − θ ) ∇t + λ∇t ∇>
t
> ?
+ λE ∇t ∇t | Ft−1 (θ̂t − θ )
∂θ θ̂t 2 2
t=T (ε,δ)+1
T (ε,δ)+n
1 X kθ̂1 − θ? k2 ln δ −1
≤ Xt> Pt+1 Xt `0 (yt , θ̂t> Xt )2 + + , n ≥ 1,
2 λmin (PT (ε,δ)+1 ) λ
t=T (ε,δ)+1

(2)
with probability at least 1 − δ, where we define ∇t = `00 (yt , θ̂t> Xt )Xt Xt> for any t. In the
last equation, we control (see Appendix B.4 and B.5) the quadratic term in θ̂t − θ? on the
2
left hand-side in terms of (θ̂t −θ? )> ∂∂θL2 |θ̂t (θ̂t −θ? ) in order to lower-bound the left expression
proportionally to the cumulative excess risk using Proposition 7 for well chosen λ.

10
Stochastic Online Optimization using Kalman Recursion

Algorithm 3: Truncated Extended Kalman Filter for Logistic Regression

1. Initialization: P1 is any positive definite matrix, θ̂1 is any initial parameter in Rd .

2. Iteration: at each time step t = 1, 2, . . .


 
Pt Xt Xt> Pt 1 1
(a) Update Pt+1 = Pt − α,
1+Xt> Pt Xt αt t
with αt = max ,
tβ (1+eθ̂t> Xt )(1+e−θ̂t> Xt )
.
yt Xt
(b) Update θ̂t+1 = θ̂t + Pt+1 >X .
1+eyt θ̂t t

4. Logistic Setting
Logistic regression is a widely used statistical model in classification. The prediction of a
binary random variable y ∈ Y = {−1, 1} consists in modelling L(y | X) with
>
!
1 yθ> X − (2 ln(1 + eθ X ) − θ> X)
pθ (y | X) = = exp .
1 + e−yθ> X 2
>X
In the GLM notations, it yields a = 2 and b(θ> X) = 2 ln(1 + eθ ) − θ> X.

4.1 Results for the Truncated Algorithm


In order to prove the convergence of the algorithm needed in the local phase, we follow a trick
introduced by Bercu et al. (2020) consisting in changing slightly the update on Pt . Indeed,
when the authors tried to prove the asymptotic convergence of the static EKF (which
they named stochastic Newton step) using Robbins-Siegmund Theorem, they needed the
convergence of t λmax (Pt )2 . This seems very likely to hold as we have intuitively Pt ∝ 1/t.
P
However, in order to obtain λmax (Pt ) = O(1/t), one needs to lower-bound αt , that is, to
upper-bound |θ̂t> Xt |, and that is impossible in the global logistic setting. Therefore, the
idea is to force a lower-bound on αt in its definition. We thus define, for some 0 < β < 1/2,
!
1 1
αt = max β , , t ≥ 1.
t (1 + eθ̂t> Xt )(1 + e−θ̂t> Xt )

This modification yields Algorithm 3, where we keep the notations θ̂t , Pt , τ (ε) with some
abuse in the rest of this section. We impose a decreasing threshold on αt (β > 0) and we
prove that the recursion coincides with Algorithm 1 after some steps. The sensitivity of
the algorithm to β is discussed at the end of Section 4.2. Also, note that the threshold
could be c/tβ , c > 0, as in Bercu et al. (2020). We consider 1/tβ for clarity. We control the
convergence time τ (ε) of this version of the EKF:
Proposition 9 Starting Algorithm 3 from θ̂1 = 0 and any P1  0, if Assumptions 1 and 2
are satisfied and E[XX > ] is invertible, for any ε, δ > 0, it holds τ (ε) ≤ T (ε, δ) along with
1
∀t > T (ε, δ), αt = > >X
,
(1 + eθ̂t Xt )(1 + e−θ̂t t )

11
de Vilmarest and Wintenberger

with probability at least 1 − δ, where T (ε, δ) ∈ N is defined in Corollary 13.

Besides the convergence of the truncated EKF, the proposition states that the truncated
recursions coincide with the static EKF ones after the first T (ε, δ) steps. Thus we can apply
our analysis of Section 3. We state the global result for ε = 1/(20DX ):

Theorem 10 Under the assumptions of Proposition 9, for any δ > 0, it holds for any
n ≥ 1 simultaneously

n 2 
λmax (P1−1 )

X
DX kθ? k λmax (P1 )DX ?
?
L(θ̂t ) − L(θ ) ≤ 3de ln 1 + n + 2 + 64eDX kθ k ln δ −1
4d 75DX
t=1
 1  1   1 2 λ (P )D2
max 1
+T ,δ + DX kθ̂1 − θ? k + T ,δ X
,
20DX 300 20DX 2

with probability at least 1 − 4δ, where T (1/(20DX ), δ) is defined in Corollary 13.

4.2 Explicit Definition of T (ε, δ) in Proposition 9


It is proved that kθ̂n − θ? k2 = O (ln n/n) almost surely (Bercu et al., 2020, Theorem 4.2).
We don’t obtain a non-asymptotic version of this rate of convergence, but the aim of this
paragraph is to prove Proposition 9 for an explicit value of T (ε, δ) for any δ, ε > 0.
The objective of the truncation introduced in the algorithm is to improve the control
on Pt . We state that fact formally with a concentration result relying on Tropp (2012). We
define Λmin the smallest eigenvalue of E[XX > ].

Proposition 11 Under the assumptions of Proposition 9, for any δ > 0, it holds simulta-
neously

20DX4 
625dDX 8 1/(1−β)
4
∀t > 2 ln 4 , λmax (Pt ) ≤ ,
Λmin Λmin δ Λmin t1−β
with probability at least 1 − δ.

This proposition justifies the choice β < 1/2 in the introduction of the truncated algo-
rithm to satisfy the condition t λmax (Pt )2 < +∞ with high probability. Motivated by
P
Proposition 11, we define, for C > 0, the event
∞ 
\ C 
AC := λmax (Pt ) ≤ .
t1−β
t=1

To obtain a control on Pt holding for any t, we use the relation λmax (Pt ) ≤ λmax (P1 ) holding
almost surely. We thus define
 
20DX 4  625dD8  
4 X
Cδ = max , λmax (P1 ) ln ,
Λmin Λ2min Λ4min δ

and we obtain P (ACδ ) ≥ 1 − δ. We obtain the following theorem under that condition.

12
Stochastic Online Optimization using Kalman Recursion

Theorem
 12 Under the assumptions of Proposition 9, we have for any δ, ε > 0 and t ≥
8 C 2 (1+eDX (kθ ? k+ε) )3
28 DX
exp δ
Λ3 (1−2β)3/2 ε2
,
min

√ Λ6min (1 − 2β)ε4
 
? 2
P(kθ̂t − θ k > ε | ACδ ) ≤ ( t + 1) exp − 12 C 2 (1 + eDX (kθ? k+ε) )6
ln(t)
216 DX δ
Λ2min (1 − 2β)ε4 √
 
1−2β
+ t exp − 11 4 2 ( t − 1) .
2 DX Cδ (1 + eDX (kθ? k+ε) )2

The beginning of our convergence proof starts similarly as the analysis of Bercu et al.
(2020): we obtain a recursive inequality ensuring that (L(θ̂t ))t is decreasing in expectation.
However, in order to obtain a non-asymptotic result we cannot apply Robbins-Siegmund
Theorem. Instead we use the fact that the variations of the algorithm θ̂t are slow thanks
to the control on Pt . Thus, if the algorithm was far from the optimum, the last estimates
were far too which contradicts the decrease in expectation of the risk. Consequently, we
look at the last k ≤ t such that kθ̂k − θ? k < ε/2, if it exists. We decompose the probability
of being
√ outside the local region in two scenarii, yielding the two terms in Theorem 12. If
k < t, the recursive decrease in expectation√makes it unlikely that the estimate stays far
from the optimum for a long period. If k > t, the control on Pt allows a control on the
probability that the algorithm moves fast, in t − k steps, away from the optimum.
The following corollary explicitly defines a guarantee for the convergence time.

Corollary 13 Proposition 9 holds with for any ε, δ > 0

12 C 2 (1 + eDX (kθ? k+ε) )6 


 3 · 215 DX
!
 1/β
(kθ? k+ε) δ/2
T (ε, δ) = max 2(1 + eDX ) , exp , 6δ −1 .
Λ6min (1 − 2β)3/2 ε4

This definition of T (ε, δ) allows a discussion on the dependence of the bound Theorem 10
to the different parameters. Note that the choice ε = 1/(20DX ) in Theorem 10 is artificially
made for simplifying constants since the bound actually holds for any ε > 0 simultaneously.
The truncation has introduced an extra parameter 0 < β < 1/2 that does not impact the
leading term in Theorem 10. However, it impacts the first step control in an intricate way.
On the one hand, when β is close to 0, the algorithm is slow to coincide with the true EKF
as T (ε, δ) = eO(1)/β . On the other hand, the larger β, the larger our control on λmax (Pt )
3/2
and thus we get T (ε, δ) = eO(1)/(1−2β) . Practical considerations show that the truncation
is artificial and can even deteriorate the performence of the EKF, see Section 6. Thus Bercu
et al. (2020) suggest to choose β = 0.49.
The dependence on δ is even more complex. The third constraint on T (ε, δ) is O(δ −1 )
which should not be sharp. To improve this lousy dependence in the bound, one needs a
better control of Pt . It would follow from a specific analysis of the O(ln δ −1 ) first recursions
in order to ”initialize” the control on Pt . However the objective of Corollary 13 was to prove
Proposition 9 and not to get an optimal value of T (ε, δ). A refinement of our convergence
analysis following from a tighter control on Pt of the EKF than the one provided by Tropp
(2012) is a very important and challenging open question.

13
de Vilmarest and Wintenberger

5. Quadratic Setting
We obtain a global result for the quadratic loss where Algorithm 1 becomes the standard
Kalman filter (recall that we take σ 2 = 1, that is `(y, ŷ) = (y − ŷ)2 /2 and a = 1, b0 (θ̂t> Xt ) =
θ̂t> Xt , αt = 1).
The parallel with the ridge forecaster was evoked by Diderrich (1985), and it is crucial
that the static Kalman filter is the ridge regression estimator for a decaying regularization
parameter. It highlights that the static EKF may be seen as an approximation of the
regularized ERM:

Proposition 14 In the quadratic setting, for any sequence (Xt , yt ), starting from any θ̂1 ∈
Rd and P1  0, the static EKF satisfies the optimisation problem
t−1
!
1X 1
θ̂t = arg min (ys − θ> Xs )2 + (θ − θ̂1 )> P1−1 (θ − θ̂1 ) , t ≥ 1.
θ∈Rd 2 2
s=1

Note that the static Kalman filter provides automatically a right choice of the ridge
regularization parameter. This equivalence yields a logarithmic regret bound for the Kalman
filter (Theorem 11.7 of Cesa-Bianchi and Lugosi, 2006). It follows from Lemma 5 as the
quadratic loss coincides with its second-order Taylor expansion. The leading term of the
bound is d ln n maxt (yt − θ̂t> Xt )2 , thus yt − θ̂t> Xt needs to be bounded.
As the static Kalman filter estimator is exactly the ridge forecaster, we can also use
the regularized empirical risk minimization properties to control T (ε, δ). In particular, we
apply the ridge analysis of Hsu et al. (2012), and we check Assumption 5:

Proposition 15 Starting from any θ̂1 ∈ Rd and P1  0, if Assumptions 1, 2 and 4 hold and
if E[XX > ] is invertible then Assumption 5 holds for T (ε, δ) defined explicitly in Appendix
D, Corollary 26.

Up to universal constants, defining Λmin as the smallest eigenvalue of E[XX > ], we get

ε−1 kθ̂1 − θ? k2 D2 √
2
T (ε, δ) . h + X (1 + Dapp ) ln δ −1 + σ 2 d
Λmin p1 Λmin
!!
kθ̂1 − θ? k
 
DX ? 2 −1
+ √ (Dapp + DX kθ k) + √ + σ ln δ ,
Λmin p1

with h(x) = x ln x. We obtain a much less dramatic dependence in ε than in the logistic
setting. However we could not avoid a Λ−1 min factor in the definition of T (ε, δ). It is not
surprising since the convergence phase relies deeply on the behavior of Pt .
As for the logistic setting, we split the cumulative risk into two sums. The sum of the
first terms is roughly bounded by a worst case analysis, and the sum of the last terms is
estimated thanks to our local analysis (Theorem 2). However, as the loss and its gradient
are not bounded we cannot obtain a similar almost sure upper-bound on the convergence
phase. The sub-gaussian assumption provides a high probability bound instead.

14
Stochastic Online Optimization using Kalman Recursion

Theorem 16 Under the assumptions of Proposition 15, for any ε, δ > 0, it holds simulta-
neously
n  2 
X 15 λmax (P1 )DX
L(θ̂t ) − L(θ? ) ≤ d 8σ 2 + Dapp
2
+ ε2 D X
2
+ 5λmax (P1−1 )ε2

ln 1 + n
2 d
t=1

λmax (P1 )DX 2 
2
+ 115 σ (4 + ) + Dapp + 2ε DX ln δ −1
2 2 2
4
 
+ DX 2
5ε2 + 2(kθ̂1 − θ? k2 + 3λmax (P1 )DX σ ln δ −1 )2 T (ε, δ)
2λmax (P1 )2 DX
4 (3σ + D
app )
2
+ T (ε, δ)3 , n ≥ 1,
3
with probability at least 1 − 6δ.

Note that the dependence of the cumulative excess risk of the convergence phase in
terms of δ is O(log(δ −1 )3 ).

6. Experiments
We experiment the static EKF for logistic regression. Precisely, we compare the following
sequential algorithms that we all initialize at 0:

• The static EKF and the truncated version (Algorithm 3). We take the default value
P1 = Id along with the value β = 0.49 suggested by Bercu et al. (2020). Note that a
threshold 10−10 /t0.49 as recommended by Bercu et al. (2020) would coincide with the
static EKF.

• The ONS and the averaged version. The convex region of search is a ball centered
in 0 and of radius Dθ = 1.1kθ? k, a setting where we have good knowledge of θ? .
We consider two choices of the exp-concavity constant on which the ONS crucially
relies to define the gradient step size. First, we use the only available bound e−Dθ DX .
Second, in the settings where the step size is so small that the ONS doesn’t move, we
use the exp-concavity constant κ0 at θ? . This yields a bigger step size, though the
exp-concavity is not satisfied on the region of search.

• Two Averaged Stochastic Gradient Descent as described


√ by Bach (2014). First we
test the choice of the gradient step√size γ = 1/(2DX2 N ) denoted by ASGD and a
second version with γ = kθ? k/(DX N ) denoted by ASGD oracle. Note that these
algorithms are with fixed horizon, thus at each step t, we have to re-run the whole
procedure.

6.1 Synthetic Data


We first consider well-specified data generated by the process of Bercu et al. (2020). The
explanatory variables X = (1, Z > )> are of dimension d = 11 where Z is a random vector√
composed of 10 independent components uniformly generated in [0, 1], thus DX = d.
With this distribution for X we define three synthetic settings that we evaluate:

15
de Vilmarest and Wintenberger

Figure 1: Density of the Bernoulli parameter on 107 samples: on the left and on the middle
?>
density of (1 + e−θ X )−1 for the two well-specified settings (left, the ordinate is
>
in log scale), and on the right density of (1 + e−θj Xt )−1 with j ∈ {1, 2} uniformly
at random for the misspecified setting. On the right we observe the two modes
> >
E[(1 + e−θ1 Xt )−1 ] ≈ 0.28 and E[(1 + e−θ2 Xt )−1 ] ≈ 0.79.

• Well-specified 1: we define θ? = (−9, 0, 3, −9, 4, −9, 15, 0, −7, 1, 0)> , and at each it-
?>
eration t, the variable yt ∈ {−1, 1} is a Bernoulli variable of parameter (1+e−θ Xt )−1 .
• Well-specified 2: in the first well-specified setting the Bernoulli parameter is mostly
distributed around 0 and 1 (see Figure 1), thus we try a less discriminated setting
1
with θ? = 10 (−9, 0, 3, −9, 4, −9, 15, 0, −7, 1, 0)> .
• Misspecified: In order to demonstrate the robustness of the EKF we test the algo-
rithms in a misspecified setting switching randomly between two well-specified logistic
processes. We define θ1 = 101
(−9, 0, 3, −9, 4, −9, 15, 0, −7, 1, 0)> and θ2 where we have
only changed the first coefficient from −9/10 to 15/10. Then yt is a Bernoulli random
> >
variable whose parameter is either (1 + e−θ1 Xt )−1 or (1 + e−θ2 Xt )−1 uniformly at
random. We checked that Assumption 2 is still satisfied.
We evaluate the different algorithms with the mean squared error E[kθ̂t − θ? k2 ] that we
approximate by its empirical version on 100 samples. We display the results in Figure 2.

6.2 Real Data Sets


To illustrate better the robustness to misspecification, we run the same procedures on real
data sets:
• Forest cover-type (Blackard and Dean, 1999): the feature vector is of dimension
d = 54, and as it is a multi-class task (7 classes) we focus on classifying 2 versus
all others. There are n = 581012 instances and we randomly split in two halves for
training and testing.
• Adult income (Kohavi, 1996): the objective is to predict whether a person’s annual
income is smaller or bigger than 50K. There are 14 explanatory variables, and we

16
Stochastic Online Optimization using Kalman Recursion

Figure 2: Mean squared error in log-log scale for the three synthetic settings. For the first
well-specified setting (left) the √ONS is applied using the exp-concavity constant
κ0 ≈ 1.7 · 10−15 instead of e−Dθ d to accelerate the algorithm, and both the ONS
and its averaged

version still don’t move. In the other two (middle and right)
we use e −D θ d for the ONS. We observe that the EKF and the truncated version
coincide in the two last settings.

obtain d = 98 once categorical variables are transformed into binary variables. We use
the canonical split between training (32561 instances) and testing (16281 instances).
For each data set, we standardize X such that each feature ranges from 0 to 1. At each step
we sample within the training set (with replacement). We evaluate through an empirical
version of E[L(θ̂n )] − L(θ? ) estimated on 100 samples and where L is estimated on the test
set, see Figure 3.

6.3 Summary
Our experiments show the superiority of the EKF for logistic regression compared to the
ONS or to averaged SGD in all the settings we tested. We display in Table 2 a few indicators
of the data sets. In particular, it is interesting that the static EKF works well even in a
setting where the Hessian matrix H ? is singular.
It appears clear that low exp-concavity constants are responsible of the poor perfor-
mances of the ONS. One may tune the gradient step size at the cost of losing the exp-
concavity property and thus the regret guarantee of (Hazan et al., 2007) or its analogous
for the cumulative risk (Mahdavi et al., 2015). Averaging is crucial for the ONS, whereas it
is useless for the static EKF. Indeed we chose not to plot the averaged version of the EKF
for clarity, but the EKF performs better than its averaged version.
It is important to note that in the first synthetic setting the truncation deteriorates
the performance of the EKF, as well as in the adult income data set to a lesser extent,
whereas the results are the same in the other settings. Bercu et al. (2020) argue that the
truncation is artificially introduced for the convergence property, thus they use the threshold
10−10 /t0.49 instead of 1/t0.49 and the truncated version almost coincides with the true EKF.
We confirm here that the truncation may be damaging if the threshold is set too high and

17
de Vilmarest and Wintenberger

Figure 3: Excess test risk for forest cover type (left) and adult income (right). As the
ONS doesn’t move when applied with the exp-concavity constant e−Dθ DX we use
instead the exp-concavity constant at θ? : κ0 ≈ 1.4 · 10−3 for forest cover type
and κ0 ≈ 5.5 · 10−6 for adult income. The EKF and the truncated version almost
coincide for both data sets.

Setting d λmax (H ? )/µ tr(G? H ?−1 ) R2 /µ deDθ DX dκ0


Synthetic well-specified 1 11 6.9 · 102 1.7 · 102 1.0 · 105 9.2 · 1037 6.4 · 1015
Synthetic well-specified 2 11 1.5 · 102 7.1 · 101 2.5 · 103 5.4 · 104 3.3 · 102
Synthetic misspecified 3 11 1.5 · 102 7.1 · 101 2.0 · 103 3.6 · 103 7.4 · 101
Forest cover type 54 ∞ ∞ ∞ 9.2 · 1032 3.8 · 104
Adult income 98 2.5 · 107 7.2 · 105 5.3 · 108 1.8 · 1062 1.8 · 107

Table 2: For the different experimental settings we display the dimension d and the condi-
tion number of the Hessian at θ? (λmax (H ? ) and µ are the maximal and minimal
eigenvalues of H ? ). We present the value of tr(G? H ?−1 ) which is bounded either
by R2 /µ, or by deDθ DX because e−Dθ DX bounds the exp-concavity constant on
the centered ball of radius Dθ . We add to the table dκ0 ≤ deDθ DX where κ0 is the
inverse of the exp-concavity constant of the loss at θ? .

18
Stochastic Online Optimization using Kalman Recursion

we recommend to use the EKF in practice, or equivalently the truncated version with the
low threshold suggested by Bercu et al. (2020).

7. Conclusion
We studied an efficient way to tackle some unconstrained optimization problems, in which we
get rid of the projection step of bounded algorithm such as the ONS. We presented a bayesian
approach where we transformed the loss into a negative log-likelihood. We used the Kalman
recursion to provide a parameter free approximation of the maximum-likelihood estimator.
We demonstrated the optimality of the local phase for locally exp-concave losses which can
be expressed as GLM log-likelihoods. We proved the finiteness of the convergence phase
in logistic and quadratic regressions. We illustrated our theoretical results with numerical
experiments for logistic regression. It would be interesting to generalize our results to a
larger class of optimization problems.
Finally, this article aimed at strengthening the bridge between Kalman recursion and
the optimization community. Therefore we made the i.i.d. assumption, standard in the
stochastic optimization literature and we focus on the static EKF. It may lead the way to
a risk analysis of the general EKF in non i.i.d. state-space models.

19
de Vilmarest and Wintenberger

The Appendix follows the structure of the article:


• Appendix A presents the EKF for generalized linear models.
• Appendix B contains the proofs of Section 3. Precisely, Lemma 3 is proved in Section
B.1, the intermediate results of Sections 3.1 and 3.2 are proved in Sections B.2 and
B.3, then Theorem 1 is proved in Section B.4 and Theorem 2 in Section B.5.
• Appendix C contains the proofs of Section 4. We derive the global bound (Theorem
10) in Section C.1, then we obtain the concentration result on Pt in Section C.2, and
finally we prove the convergence of the truncated algorithm in Section C.3.
• Appendix D contains the proofs of Section 5. We prove Theorem 16 in Section D.1
and then in Section D.2 we prove the convergence of the algorithm, and we define an
explicit value of T (ε, δ) satisfying Assumption 5.

Appendix A. Derivation of the Static EKF for Generalized Linear Models


As in Section 10.2 of Durbin and Koopman (2012) we consider the following state-space
model:
yt = Zt (θt ) + εt ,
θt+1 = Tt (θt ) + ηt .
where εt and ηt are independent with mean zero and variances ht (θt ), Qt (θt ). The state-
space version of equation (2) is
 >
yt θt Xt − b(θt> Xt )

p(yt | Xt ) = h(yt ) exp .
a
The preceding equation matches the space equation form with Zt (θt ) = b0 (θt> Xt ) and
ht (θt ) = ab00 (θt> Xt ). Thus we can write the EKF as follows (see Equation 10.4 of Durbin
and Koopman, 2012): denoting by Ṫt the derivative of Tt ,
vt = yt − b0 (θ̂t> Xt ) , Ft = Xt> Pt Xt b00 (θ̂t> Xt )2 + ab00 (θ̂t> Xt ) ,
θ̂t|t = θ̂t + Pt Xt b00 (θ̂t> Xt )Ft−1 vt , Pt|t = Pt − Pt Xt Ft−1 Xt> Pt b00 (θ̂t> Xt )2 ,
>
θ̂t+1 = Tt (θ̂t|t ) , Pt+1 = Ṫt Pt|t Ṫt + Qt (θ̂t|t ) .

We focus on the static setting where the state equation becomes θt+1 = θt , thus we have
θ̂t+1 = θ̂t|t and Pt+1 = Pt|t . We rewrite the update on Pt as follows:
Pt Xt Xt> Pt b00 (θ̂t> Xt )/a
Pt+1 = Pt − .
Xt> Pt Xt b00 (θ̂t> Xt )/a + 1
Moreover we have Pt+1 Xt = Pt Xt Ft−1 ab00 (θ̂t> Xt ) thus we can rewrite the update on θ̂t as
follows:
Pt+1 Xt (yt − b0 (θ̂t> Xt )
θ̂t+1 = θ̂t + .
a
This yields Algorithm 1.

20
Stochastic Online Optimization using Kalman Recursion

Appendix B. Proofs of Section 3


B.1 Proof of Lemma 3
We prove the following Lemma inspired by the stopping time technique of Freedman (1975)
from which we derive Lemma 3. We give a general form useful in several proofs.
Lemma 17 Let (Fn ) be a filtration, and we consider a sequence of events (An ) that is
adapted to (Fn ). Let (Vn ) be a sequence of random variables adapted to (Fn ) satisfying
V0 = 1, Vn ≥ 0 almost surely for any n, and

E[Vn | Fn−1 , An−1 ] ≤ Vn−1 , n ≥ 1.

Then for any δ > 0, it holds


∞ ∞ ! ∞
[  [ !
[
−1
P Vn > δ ∪ An ≤δ+P An .
n=1 n=0 n=0

An important particular case is when (Vn ) is a super-martingale adapted to the filtration


(Fn ) satisfying V0 = 1 and Vn ≥ 0 almost surely: then we have simultaneously Vn ≤ δ −1
for n ≥ 1 with probability larger than 1 − δ.
Proof We define
[k
Vn > δ −1 ∪ An−1 .

Ek =
n=1

As (Ek ) is increasing, we have, for any k ≥ 1,


k
X 
P(Ek ) = P En ∩ En−1
n=1
Xk k
X
P Vn > δ −1 ∩ En−1 ∩ An−1 .
 
= P An−1 ∩ En−1 +
n=1 n=1

First, we have
k k−1
!
X  [
P An−1 ∩ En−1 ≤ P An .
n=1 n=0

Second, we apply Markov’s inequality:


k k  
X
−1
 X Vn
P Vn > δ ∩ En−1 ∩ An−1 ≤ E −1 1En ∩En−1 ∩An−1
δ
n=1 n=1
Xk h i
=δ E Vn (1En−1 ∩An−1 − 1En )
n=1
k  h i
X  
=δ E Vn 1En−1 ∩An−1 − E Vn 1En .
n=1

21
de Vilmarest and Wintenberger


The second line is obtained since En ⊂ En−1 ∩ An−1 . According to the tower property
and the super-martingale assumption,
h i h i
E Vn 1En−1 ∩An−1 = E E[Vn | Fn−1 , An−1 ]1En−1 ∩An−1
h i
≤ E E[Vn | Fn−1 , An−1 ]1En−1
h i
≤ E Vn−1 1En−1 .

Therefore, a telescopic argument along with V0 = 1 and Vk 1Ek ≥ 0 yields

k
X
P Vn > δ −1 ∩ En−1 ∩ An−1 ≤ δ .

n=1

Finally, for any k ≥ 1, we obtain

k−1
!
[
P (Ek ) ≤ P An +δ
n=0

and the desired result follows by letting k → ∞.

Proof of Lemma 3. Let λ > 0. For any n ≥ 1, we define

k+n !
λ2
X 
Vn = exp λ∆Nt − ((∆Nt )2 + E[(∆Nt )2 | Ft−1 ]) .
2
t=k+1

Lemma B.1 of Bercu and Touati (2008) states that (Vn ) is a super-martingale adapted to
the filtration (Fk+n ). Moreover V0 = 1 and for any n, it holds Vn ≥ 0 almost surely. There-
fore we can apply Lemma 17.

B.2 Proof of Lemma 5


(yt −b0 (θ̂t> Xt ))Xt
Proof of Lemma 5. We start from the update formula θ̂t+1 = θ̂t + Pt+1 a
yielding

−1 −1 (yt − b0 (θ̂t> Xt ))Xt>


(θ̂t+1 − θ? )> Pt+1 (θ̂t+1 − θ? ) = (θ̂t − θ? )> Pt+1 (θ̂t − θ? ) + 2 (θ̂t − θ? )
a
!2
yt − b0 (θ̂t> Xt )
+ Xt> Pt+1 Xt .
a

22
Stochastic Online Optimization using Kalman Recursion

With a summation argument, re-arranging terms, we obtain:


n
!
X (b0 (θ̂t> Xt ) − yt )Xt> 1 −1
(θ̂t − θ? ) − (θ̂t − θ? )> (Pt+1 − Pt−1 )(θ̂t − θ? )
a 2
t=1
n
!2
1X > yt − b0 (θ̂t> Xt )
= Xt Pt+1 Xt
2 a
t=1
n
1 X 
+ (θ̂t − θ? )> Pt−1 (θ̂t − θ? ) − (θ̂t+1 − θ? )> Pt+1
−1
(θ̂t+1 − θ? ) .
2
t=1

−1
We bound the telescopic sum: as Pn+1 < 0, we have
n 
X 
(θ̂t − θ? )> Pt−1 (θ̂t − θ? ) − (θ̂t+1 − θ? )> Pt+1
−1
(θ̂t+1 − θ? )
t=1
kθ̂1 − θ? k2
≤ (θ̂1 − θ? )> P1−1 (θ̂1 − θ? ) ≤ .
λmin (P1 )
The result follows from the identities

(b0 (θ̂t> Xt ) − yt )Xt −1


= `0 (yt , θ̂t> Xt )Xt , Pt+1 − Pt−1 = `00 (yt , θ̂t> Xt )Xt Xt> .
a

B.3 Proofs of Section 3.2


Proof of Proposition 6. The first-order condition satisfied by θ? is

(y − b0 (θ?> X))X
 
E − = 0,
a

yielding E [yX] = E[b0 (θ?> X)X]. Therefore


>
(b0 (θ> X) − y)X

1 h i
E (θ − θ? )> E X(b0 (θ> X) − b0 (θ?> X)) .
(θ − θ? ) =
a a

Considering the function f : λ → (θ − θ? )> E Xb0 (θ> X + λ(θ − θ? )> X)) , we know there
 

exists λ ∈ [0, 1] such that f 0 (λ) = f (1) − f (0). This translates into
∂L > 1 h   i
(θ − θ? ) = (θ − θ? )> E Xb00 θ> X + λ(θ? − θ)> X (θ − θ? )> X .
∂θ θ a
Then we use Assumption 3:

b00 θ> X + λ(θ? − θ)> X `00 yt , θ> X + λ(θ? − θ)> X


 
= ≥ ρkθ−θ? k ,
b00 (θ> X) `00 (yt , θ> X)

23
de Vilmarest and Wintenberger

yielding
∂L > h i
(θ − θ? ) ≥ ρkθ−θ? k (θ − θ? )> E `00 (y, θ> X)XX > (θ − θ? )
∂θ θ
∂2L
= ρkθ−θ? k (θ − θ? )> (θ − θ? ) .
∂θ2 θ

>
∂L
Proof of Proposition 7. We first recall that L(θ)−L(θ? ) ≤ ∂θ (θ−θ? ), then Proposition
θ
6 yields
∂L > ∂2L c ∂L >
(θ − θ? ) − c(θ − θ? )> 2
(θ − θ? ) ≥ (1 − ) (θ − θ? ) ,
∂θ θ ∂θ θ ρkθ−θ? k ∂θ θ

and the result follows.

Proof of Lemma 8. We first develop (∆Mt )2 :


 2
(∆Mt )2 = (E [∇t | Ft−1 ] − ∇t )> (θ̂t − θ? )

= (θ̂t − θ ) E[∇t | Ft−1 ]E[∇t | Ft−1 ]> + ∇t ∇>
? >
t

− ∇t E[∇t | Ft−1 ]> − E[∇t | Ft−1 ]∇> ?
t (θ̂t − θ )
 
≤ 2(θ̂t − θ? )> E[∇t | Ft−1 ]E[∇t | Ft−1 ]> + ∇t ∇> ?
t (θ̂t − θ )
 
≤ 2(θ̂t − θ? )> E[∇t ∇>t | F t−1 ] + ∇ > ?
t t (θ̂t − θ ) .

d > > > >
h because if U, V ∈ R , it holds −U V − V U i4 U U + V V . The last
The third line holds
one comes from E (∇t − E[∇t | Ft−1 ])(∇t − E[∇t | Ft−1 ])> | Ft−1 < 0.
Also, we have the relation
E[(∆Mt )2 | Ft−1 ] ≤ (θ̂t − θ? )> E[∇t ∇> ?
t | Ft−1 ](θ̂t − θ ) .

It yields
 
(∆Mt )2 + E[(∆Mt )2 | Ft−1 ] ≤ (θ̂t − θ? )> 3E[∇t ∇>
t | Ft−1 ] + 2∇ ∇
t t
>
(θ̂t − θ? ) ,

and the result follows from Lemma 3.

We derive the following Lemma in order to control the right-hand side of Lemma 5, in
both settings.
Lemma 18 Assume the second point of Assumption 3 holds. For any k, n ≥ 1, if kθ̂t −
θ? k2 ≤ ε for any k < t ≤ k + n then we have
k+n  2 
X
−1 −1
 hε λmax (Pk+1 )DX
Tr Pt+1 (Pt+1 − Pt ) ≤ d ln 1 + n .
d
t=k+1

24
Stochastic Online Optimization using Kalman Recursion

Proof We apply Lemma 11.11 of Cesa-Bianchi and Lugosi (2006):

k+n k+n
!
X
−1
X det(Pt−1 )
− Pt−1 ) =

Tr Pt+1 (Pt+1 1− −1
t=k+1 t=k+1
det(Pt+1 )
k+n
!
−1
X det(Pt+1 )
≤ ln −1
t=k+1
det(Pt )
−1
!
det(Pk+n+1 )
= ln −1
det(Pk+1 )
k+n
!
1/2 1/2
X
≤ ln det I + `00 (yt , θ̂t> Xt )(Pk+1 Xt )(Pk+1 Xt )>
t=k+1
d
X
= ln(1 + λi ) ,
i=1

k+n
1/2 1/2
`00 (yt , θ̂t> Xt )(Pk+1 Xt )(Pk+1 Xt )> . Therefore we
P
where λ1 , ..., λd are the eigenvalues of
t=k+1
have
k+n d
!
X
−1 1X
Pt−1 )

Tr Pt+1 (Pt+1 − ≤ d ln 1 + λi
d
t=k+1 i=1
 
1 2
≤ d ln 1 + nhε λmax (Pk+1 )DX .
d

B.4 Bounded Setting (Assumption 3)


Proof of Theorem 1. Let δ > 0. On the one hand, we sum Lemma 5 and 8. We obtain,
for any λ > 0,

T (ε,δ)+n
X  1
E[∇t | Ft−1 ]> (θ̂t − θ? ) − Qt
2
t=T (ε,δ)+1
 3   
− λ(θ̂t − θ? )> ∇t ∇> t + E ∇ ∇
t t
>
| F t−1 (θ̂ t − θ ?
)
2
T (ε,δ)+n
1 X kθ̂1 − θ? k2 ln δ −1
≤ Xt> Pt+1 Xt `0 (yt , θ̂t> Xt )2 + + , n ≥ 1, (3)
2 λmin (PT (ε,δ)+1 ) λ
t=T (ε,δ)+1

 
with probability at least 1−δ, where we define Qt = (θ̂t −θ? )> `00 (yt , θ̂t> Xt )Xt Xt> (θ̂t −θ? )
for any t.

25
de Vilmarest and Wintenberger

On the other hand, thanks to Assumption 3, we can apply Proposition 7 with c = 0.75
to obtain, for any t ≥ 1,

kθ̂t − θ? k ≤ ε
ρε  ∂L > ∂2L 
=⇒ L(θ̂t ) − L(θ? ) ≤ (θ̂t − θ? ) − 0.75(θ̂t − θ? )> 2 (θ̂t − θ? ) ,
ρε − 0.75 ∂θ θ̂t ∂θ θ̂t
 
=⇒ L(θ̂t ) − L(θ? ) ≤ 5 E[∇t | Ft−1 ]> (θ̂t − θ? ) − 0.75E[Qt | Ft−1 ] , (4)

because ρε > 0.95.


In order to bridge the gap between Equations (3) and (4), we need to control the
quadratic terms of Equation (3) with E[Qt | Ft−1 ]. First, for any t, if kθ̂t − θ? k ≤ ε, we
have Qt ∈ [0, hε ε2 DX
2 ], and we apply Lemma A.3 of Cesa-Bianchi and Lugosi (2006) to the

random variable h ε21D2 Qt ∈ [0, 1]: for any s > 0,


ε X

es − 1
   
s ?
E exp 2 Qt − 2 E [Qt | Ft−1 ] | Ft−1 , k θ̂ t − θ k ≤ ε ≤ 1.
hε ε2 DX hε ε2 DX

We fix s = 0.1 and we define


 
T (ε,δ)+n   
X 0.1 0.1 1
Vn = exp  2 Qt − (e − 1)E 2 Qt | Ft−1
.
hε ε2 DX hε ε2 DX
t=T (ε,δ)+1

The sequence (Vn ) is adapted to (FT (ε,δ)+n ), almost surely we have V0 = 1 and Vn ≥ 0.
Finally,
h i
E Vn | FT (ε,δ)+n−1 , kθ̂T (ε,δ)+n − θ? k ≤ ε ≤ Vn−1 ,

and (kθ̂T (ε,δ)+n − θ? k ≤ ε) belongs to FT (ε,δ)+n−1 . We apply Lemma 17:


∞ ∞ ! ∞
[  [ !
[
P Vn > δ −1 ∪ (kθ̂T (ε,δ)+n − θ? k > ε) ≤ δ+P (kθ̂T (ε,δ)+n − θ? k > ε) .
n=1 n=1 n=1


We define Aεk = (kθ̂n − θ? k ≤ ε) for any k. The last inequality is equivalent to
T
n=k+1

∞ 
[ T (ε,δ)+n T (ε,δ)+n  
X X
0.1 2 2 −1
P Qt > 10(e − 1) E [Qt | Ft−1 ] + 10hε ε DX ln δ ∩ AεT (ε,δ)
n=1 t=T (ε,δ)+1 t=T (ε,δ)+1

≤ δ. (5)

We then bound the two quadratic terms coming from Lemma 8: using Assumption 3
we have the implications

kθ̂t − θ? k ≤ ε =⇒ (θ̂t − θ? )> ∇t ∇> ?


t (θ̂t − θ ) ≤ κε Qt ,
kθ̂t − θ? k ≤ ε =⇒ (θ̂t − θ? )> E ∇t ∇> ?
 
t | Ft−1 (θ̂t − θ ) ≤ κε E [Qt | Ft−1 ] .

26
Stochastic Online Optimization using Kalman Recursion

Therefore, we get from (5)

∞  T (ε,δ)+n 1
[ X
P Qt + λ(θ̂t − θ? )> ∇t ∇> ?
t (θ̂t − θ )
2
n=1 t=T (ε,δ)+1
3 
+ λ(θ̂t − θ? )> E ∇t ∇> ?
 
t | F t−1 (θ̂ t − θ ) >
2
 1 3  T (ε,δ)+n
X
0.1
10(e − 1)( + λκε ) + λκε E [Qt | Ft−1 ]
2 2
t=T (ε,δ)+1
 !
1
+ 10( + λκε )hε ε2 DX2
ln δ −1 ∩ AεT (ε,δ) ≤ δ .
2

0.75−5(e0.1 −1)
We set λ = (10(e0.1 −1)+ 23 )κε
, so that

1 3
10(e0.1 − 1)( + λκε ) + λκε = 0.75 ,
2 2
1 1 0.75 − 5(e0.1 − 1)
+ λκε = + ≈ 0.59 ≤ 0.6 ,
2 2 10(e0.1 − 1) + 32

and consequently

∞ T (ε,δ)+n  
[ X
P E[∇t | Ft−1 ]> (θ̂t − θ? ) − 0.75E[Qt | Ft−1 ] > 6hε ε2 DX
2
ln δ −1
n=1 t=T (ε,δ)+1
T (ε,δ)+n 
X 1
+ E[∇t | Ft−1 ]> (θ̂t − θ? ) − Qt
2
t=T (ε,δ)+1
! !
 3  
− λ(θ̂t − θ? )> ∇t ∇> > ?
t + E ∇t ∇t | Ft−1 (θ̂t − θ ) ∩ AεT (ε,δ) ≤ δ.
2

We plug Equation (4) in the last inequality:

∞ T (ε,δ)+n
[ X
P (L(θ̂t ) − L(θ? )) > 30hε ε2 DX
2
ln δ −1
n=1 t=T (ε,δ)+1
T (ε,δ)+n 
X 1
+5 E[∇t | Ft−1 ]> (θ̂t − θ? ) − Qt
2
t=T (ε,δ)+1
  ! !
3
− λ(θ̂t − θ? )> ∇t ∇> >
(θ̂t − θ? ) ∩ AεT (ε,δ) ≤ δ .
 
t + E ∇t ∇t | Ft−1
2

27
de Vilmarest and Wintenberger

1 (10(e0.1 −1)+ 32 )κε


We then use Equation (3) with λ = 0.75−5(e0.1 −1)
≈ 11.4κε ≤ 12κε . It yields

∞ T (ε,δ)+n T (ε,δ)+n
[ X 5 X
P ?
(L(θ̂t ) − L(θ )) > Xt> Pt+1 Xt `0 (yt , θ̂t> Xt )2
2
n=1 t=T (ε,δ)+1 t=T (ε,δ)+1
! !
5kθ̂1 − θ? k2
+ + 30(2κε + hε ε2 DX
2
) ln δ −1 ∩ AεT (ε,δ) ≤ 2δ .
λmin (PT (ε,δ)+1 )

Thanks to Assumption 3, we have


−1
Xt> Pt+1 Xt `0 (yt , θ̂t> Xt )2 ≤ κε Tr Pt+1 (Pt+1 − Pt−1 ) ,

t > T (ε, δ) ,

therefore we apply Lemma 18: for any n, it holds

T (ε,δ)+n
!
X 2
hε λmax (PT (ε,δ)+1 )DX
Xt> Pt+1 Xt `0 (yt , θ̂t> Xt )2 ≤ dκε ln 1 + n .
d
t=T (ε,δ)+1

As PT (ε,δ)+1 4 P1 , we obtain

∞ T (ε,δ)+n  2 
[ X
? 5 hε λmax (P1 )DX
P (L(θ̂t ) − L(θ )) > dκε ln 1 + n
2 d
n=1 t=T (ε,δ)+1
! !
5kθ̂1 − θ? k2 2 2 −1 ε
+ + 30(2κε + hε ε DX ) ln δ ∩ AT (ε,δ) ≤ 2δ .
λmin (PT (ε,δ)+1 )

To conclude, we use Assumption 5.

B.5 Quadratic Setting (Assumption 4)


We recall two definitions introduced in the previous subsection:

\
Aεk = (kθ̂n − θ? k ≤ ε), k ≥ 1,
n=k+1

Qt = (θ̂t − θ ) Xt Xt> (θ̂t − θ? ),


? >
t ≥ 1.

The sub-gaussian hypothesis requires a different treatment of several steps in the proof. In
the following proofs, we use a consequence of the first points of Assumption 4. We apply
Lemma 1.4 of Rigollet and Hütter (2015): for any X ∈ Rd ,

E[(y − E[y | X])2i | X] ≤ 2i(2σ 2 )i Γ(i) = 2(2σ 2 )i i!, i ∈ N? . (6)

First, we control the quadratic terms in ∇t = −(yt − θ̂t> Xt )Xt in the following lemma.

28
Stochastic Online Optimization using Kalman Recursion

Lemma 19 1. For any k ∈ N and δ > 0, we have



[ k+n
X
P (θ̂t − θ? )> ∇t ∇> ?
t (θ̂t − θ )
n=1 t=k+1

 k+n
! !
X
> 3 8σ + 2 2
Dapp +ε 2 2
DX Qt + 12ε 2
DX σ ln δ −1
2 2
∩ Aεk ≤ δ.
t=k+1

2. For any t, it holds almost surely


 
(θ̂t − θ? )> E[∇t ∇>
t | Ft−1 ](θ̂ t − θ ?
) ≤ 3 σ 2
+ D 2
app + kθ̂ t − θ ? 2 2
k D X E[Qt | Ft−1 ] .

Proof

1. We recall that for any a, b, c, we have (a + b + c)2 ≤ 3(a2 + b2 + c2 ). Thus

(θ̂t − θ? )> ∇t ∇> ? >


t (θ̂t − θ ) = Qt (yt − θ̂t Xt )
2
 
≤ 3Qt (yt − E[yt | Xt ])2 + (E[yt | Xt ] − θ?> Xt )2 + ((θ? − θ̂t )> Xt )2
 
≤ 3Qt (yt − E[yt | Xt ])2 + Dapp 2
+ kθ̂t − θ? k2 DX
2
. (7)

To obtain the last inequality, we use the second point of Assumption 4 to bound the
middle term. Then we use Taylor series for the exponential, and we apply Equation
(6). For any t and any µ satisfying 0 < µ ≤ 4Q1t σ2 , we have

X µi Qit E[(yt − E yt | Xt ])2i | Xt


 
E exp µQt (yt − E[yt | Xt ])2 | Ft−1 , Xt = 1 +
  
i!
i≥1
X µi Qi i!(2σ 2 )i
t
≤1+2
i!
i≥1
X i
≤1+2 2µQt σ 2
i≥1
1
≤ 1 + 8µQt σ 2 , 2µQt σ 2 ≤
2
≤ exp 8µQt σ 2 .


Therefore, for any t,


   
1 2 2
 ?
E exp
4ε2 DX2 σ 2 Qt (yt − E[yt | Xt ]) − 8σ | Ft−1 , Xt , kθ̂t − θ k ≤ ε ≤ 1 .

We define the random variable


k+n
!
1 X
Qt (yt − E[yt | Xt ])2 − 8σ 2 ,

Vn = exp 2 σ2 n ∈ N.
4ε2 DX
t=k+1

29
de Vilmarest and Wintenberger

(Vn )n is adapted to the filtration (σ(X1 , y1 , ..., Xk+n , yk+n , Xk+n+1 )n , moreover V0 = 1
and Vn ≥ 0 almost surely, and
h i
E Vn | X1 , y1 , ..., Xk+n−1 , yk+n−1 , Xk+n , kθ̂k+n − θ? k ≤ ε ≤ Vn−1 .

Therefore we apply Lemma 17: for any δ > 0,



!
[
−1
P (Vn > δ )∩ Aεk ≤ δ,
n=1

which is equivalent to
∞ k+n k+n
! !
[ X X
P Qt (yt − E[yt | Xt ])2 > 8σ 2 Qt + 4ε2 DX σ ln δ −1
2 2
∩ Aεk ≤ δ.
n=1 t=k+1 t=k+1

Substituting in Equation (7), we obtain the desired result.

2. We apply the same decomposition as for Equation (7): for any t,

(θ̂t − θ? )> E[∇t ∇> ?


t | Ft−1 ](θ̂t − θ )
   
? > > 2 2 ? 2 2
≤ 3(θ̂t − θ ) E Xt Xt (yt − E[yt | Xt ]) + Dapp + kθ − θ̂t k DX | Ft−1 (θ̂t − θ? ) .

Assumption 4 implies that for any Xt , E[(yt − E[yt | Xt ])2 | Xt ] ≤ σ 2 . Thus, the tower
property yields

(θ̂t − θ? )> E[∇t ∇> ?


t | Ft−1 ](θ̂t − θ )
 
≤ 3 σ 2 + Dapp2
+ kθ̂t − θ? k2 DX
2
(θ̂t − θ? )> E[Xt Xt> | Ft−1 ](θ̂t − θ? ) .

Second, we bound the right-hand side of Lemma 5, that is the objective of the following
lemma.

Lemma 20 Let k ∈ N. For any δ > 0, we have


[ k+n
X
P Xt> Pt+1 Xt (yt − θ̂t> Xt )2 > 12λmax (P1 )DX σ ln δ −1
2 2

n=1 t=k+1
! !
 2 
λmax (Pk+1 )DX
2 2 2 2 ε

+ 3 8σ + Dapp +ε DX d ln 1 + n ∩ Ak ≤ δ .
d

Proof We apply a similar analysis as in the proof of Lemma 19 in order to use the sub-
gaussian assumption, and then we apply the telescopic argument as in the bounded setting.

30
Stochastic Online Optimization using Kalman Recursion

We decompose yt − θ̂t> Xt :

Xt> Pt+1 Xt (yt − θ̂t> Xt )2


 
≤ 3Xt> Pt+1 Xt (yt − E[yt | Xt ])2 + (E[yt | Xt ] − b0 (θ?> Xt ))2 + ((θ? − θ̂t )> Xt )2
 
≤ 3Xt> Pt+1 Xt (yt − E[yt | Xt ])2 + Dapp
2
+ kθ̂t − θ? k2 DX 2
. (8)

To control (yt − E[yt | Xt ])2 Xt> Pt+1 Xt , we use its positivity along with Equation (6).
Precisely, for any t and any µ > 0 satisfying 0 < µ ≤ 4X > P 1 X σ2 , we have
t t+1 t

h  i 
E exp µ(yt − E[yt | Xt ])2 Xt> Pt+1 Xt | Ft−1 , Xt
X µi (Xt> Pt+1 Xt )i E (yt − E[yt | Xt ])2i | Xt
 
=1+
i!
i≥1
X µi (X > Pt+1 Xt )i i!(2σ 2 )i
t
≤1+2
i!
i≥1
X i
=1+2 2µXt> Pt+1 Xt σ 2
i≥1
1
≤ 1 + 8µXt> Pt+1 Xt σ 2 , 0 < 2µXt> Pt+1 Xt σ 2 ≤
  2
> 2
≤ exp 8µXt Pt+1 Xt σ .

1
We apply the previous bound with a uniform µ = 2 σ2 .
4λmax (P1 )DX
As λmax (Pt+1 ) ≤ λmax (P1 )
1
for any t, we get µ ≤ 4Xt> Pt+1 Xt σ 2
. Thus, we define

k+n
!
1 X
(yt − E[yt | Xt ])2 − 8σ 2
Xt> Pt+1 Xt

Vn = exp 2 σ2 , n ∈ N.
4λmax (P1 )DX
t=k+1

(Vn ) is a super-martingale adapted to the filtration (σ(X1 , y1 , ..., Xk+n−1 , yk+n−1 , Xk+n ))n
satisfying almost surely V0 = 1, Vn ≥ 0, thus we apply Lemma 17:

!
[
P (Vn > δ −1 ) ≤ δ ,
n=1

or equivalently


[ k+n
X
P Xt> Pt+1 Xt (yt − E[yt | Xt ])2
n=1 t=k+1
k+n
!!
X
> 8σ 2
Xt> Pt+1 Xt + 4λmax (P1 )DX σ ln δ −1
2 2
≤ δ.
t=k+1

31
de Vilmarest and Wintenberger

Combining it with Equation (8), we get


∞ k+n
[ X  k+n
X
P Xt> Pt+1 Xt (yt − θ̂t> Xt )2 > 3 8σ + 2 2
Dapp +ε 2 2
DX Xt> Pt+1 Xt
n=1 t=k+1 t=k+1
! !
+ 12λmax (P1 )DX σ ln δ −1
2 2
∩ Aεk ≤ δ.

Then we apply Lemma 18: the second point of Assumption 3 holds with hε = 1, thus
k+n  2 
X
−1 −1
 λmax (Pk+1 )DX
Tr Pt+1 (Pt+1 − Pt ) ≤ d ln 1 + n , n ≥ 1.
d
t=k+1
−1
We conclude with Xt> Pt+1 Xt = Tr(Pt+1 (Pt+1 − Pt−1 )).

We sum up our findings and we prove the result for the quadratic loss. The structure
of the proof is the same as the one of Theorem 1.
Proof of Theorem 2. On the one hand, we sum Lemma 5 and Lemma 8: for any λ, δ > 0
T (ε,δ)+n  
X
> ? 1 ? >

> 3
 >
 ?
E[∇t | Ft−1 ] (θ̂t −θ )− Qt −λ(θ̂t −θ ) ∇t ∇t + E ∇t ∇t | Ft−1 (θ̂t −θ )
2 2
t=T (ε,δ)+1
T (ε,δ)+n
1 X kθ̂T (ε,δ)+1 − θ? k2 ln δ −1
≤ Xt> Pt+1 Xt (yt − θ̂t> Xt )2 + + , n ≥ 1 , (9)
2 λmin (PT (ε,δ)+1 ) λ
t=T (ε,δ)+1

with probability at least 1 − δ. On the other hand, we have


T (ε,δ)+n T (ε,δ)+n
X 1 X  
?
(L(θ̂t ) − L(θ )) ≤ E[∇t | Ft−1 ]> (θ̂t − θ? ) − 0.8E[Qt | Ft−1 ] .
1 − 0.8
t=T (ε,δ)+1 t=T (ε,δ)+1
(10)
We aim to relate Equations (9) and (10) as in the proof of Theorem 1. To that end, we
apply Lemma 19:
∞ T (ε,δ)+n    
[ X 1 ? > > 3  >
 ?
P Qt + λ(θ̂t − θ ) ∇t ∇t + E ∇t ∇t | Ft−1 (θ̂t − θ )
2 2
n=1 t=T (ε,δ)+1

1  T (ε,δ)+n
X
> + 3λ(8σ 2 + Dapp
2
+ ε2 D X
2
) Qt
2
t=T (ε,δ)+1
T (ε,δ)+n
!
9   X
+ λ σ 2 + Dapp
2
+ ε2 D X
2
E [Qt | Ft−1 ] + 12λε2 DX σ ln δ −1
2 2
2
t=T (ε,δ)+1
!
∩ AεT (ε,δ) ≤ δ.

32
Stochastic Online Optimization using Kalman Recursion

As in the proof of Theorem 1 we apply Lemma A.3 of (Cesa-Bianchi and Lugosi, 2006)
and Lemma 17: for any δ > 0,

∞  T (ε,δ)+n T (ε,δ)+n 
[ X X
0.1 2 2 −1
P Qt > 10(e − 1) E[Qt | Ft−1 ] + 10ε DX ln δ
n=1 t=T (ε,δ)+1 t=T (ε,δ)+1
!
∩ AεT (ε,δ) ≤ δ.

We combine the last two inequalities:

∞ T (ε,δ)+n    
[ X1 ? > > 3  >
 ?
P Qt + λ(θ̂t − θ ) ∇t ∇t + E ∇t ∇t | Ft−1 (θ̂t − θ )
2 2
n=1 t=T (ε,δ)+1
 1  9 
0.1 2 2 2 2 2 2 2 2
> 10(e − 1) + 3λ(8σ + Dapp + ε DX ) + λ(σ + Dapp + ε DX )
2 2
T (ε,δ)+n
X
E [Qt | Ft−1 ]
t=T (ε,δ)+1
  !
2 2 1
 
2 2 2 2 2 2 2 −1
+ 10ε DX + 3λ(8σ + Dapp + ε DX ) + 12λε DX σ ln δ
2
!
∩ AεT (ε,δ) ≤ 2δ . (11)

We set

 −1
0.1
 0.1 2 2 2 2 9 2 2 2 2
λ = 0.8 − 5(e − 1) 30(e − 1)(8σ + Dapp + ε DX ) + (σ + Dapp + ε DX )
2

in order to obtain

1  9
10(e0.1 − 1) + 3λ(8σ 2 + Dapp2
+ ε2 DX2
) + λ(σ 2 + Dapp
2
+ ε2 D X
2
) = 0.8 ,
2 2
1 1
2 2 2 2 <λ< 2 2 2 ,
109σ + 28Dapp + 28ε DX 108σ + 27Dapp + 27ε2 DX
2 1
 
10ε2 DX + 3λ(8σ 2 + Dapp
2
+ ε 2 DX
2
) + 12λDX2 2 2
ε σ ≤ 8ε2 DX2
2
1
≤ 28(4σ 2 + Dapp
2
+ ε2 D X2
).
λ

33
de Vilmarest and Wintenberger

Combining Equations (9), (10) and (11), we obtain


∞ T (ε,δ)+n
[ X
P 0.2 (L(θ̂t ) − L(θ? ))
n=1 t=T (ε,δ)+1
T (ε,δ)+n
1 X ε2
> Xt> Pt+1 Xt (yt − θ̂t> Xt )2 +
2 λmin (PT (ε,δ)+1 )
t=T (ε,δ)+1
! !
+ 28(4σ + 2 2
Dapprox +ε 2 2
DX ) ln δ −1 + 8ε 2 2
DX ln δ −1
∩ AεT (ε,δ) ≤ 3δ .

Finally, we apply Lemma 20 with PT (ε,δ)+1 4 P1 and we use Assumption 5: it holds


simultaneously
T (ε,δ)+n  2 
X
? 3  λmax (P1 )DX
8σ 2 + Dapp
2
+ ε2 D X
2

L(θ̂t ) − L(θ ) ≤ 5 d ln 1 + n
2 d
t=T (ε,δ)+1
 
+ λmax PT−1 2 2 2 2 2
(ε,δ)+1 ε + 28(4σ + Dapprox + ε DX ) ln δ
−1


2 2 −1 2 2 −1
+ 8ε DX ln δ + 6λmax (P1 )DX σ ln δ , n ≥ 1,

with probability at least 1 − 5δ. To conclude, we write


28(4σ 2 + Dapprox
2
+ ε2 D X
2
) + 8ε2 DX2
+ 6λmax (P1 )DX2 2
σ

λmax (P1 )DX2 
≤ 28 σ 2 (4 + 2
) + Dapp + 2ε2 DX2
.
4

Appendix C. Proofs of Section 4


C.1 Proof of Theorem 10
? k+ε) 1
Proof of Theorem 10. We check Assumption 3 with κε = eDX (kθ , hε = 4 and
ρε = e−εDX > 0.95. We can thus apply Theorem 1 with
T (ε,δ)
1 X
λmax (PT−1
(ε,δ)+1 ) ≤ λmax (P1−1 ) + kXt k2 ,
4
t=1
5κε ?
 ε2 D X2 
?
< 3eDX kθ k , 30 2κε + < 64eDX kθ k , 5ε2 DX
2
≤ 1/75 .
2 4
We then control the first terms. To that end, we use a rough bound at any time t ≥ 1:
 >
? yX
L(θ̂t ) − L(θ ) ≤ E >X
| θ̂t (θ̂t − θ? )
1+e y θ̂ t

≤ DX kθ̂t − θ? k
≤ DX (kθ̂1 − θ? k + (t − 1)λmax (P1 )DX ) ,

34
Stochastic Online Optimization using Kalman Recursion

because for any s ≥ 1, we have Ps 4 P1 and therefore kθ̂s+1 − θ̂s k ≤ λmax (P1 )DX . Summing
1
from 1 to T ( 20D X
, δ) yields the result.

C.2 Concentration of Pt

We prove a concentration result based on Tropp (2012), which will be used on the inverse
of Pt .

Lemma 21 If Assumption 1 is satisfied, then for any 0 ≤ β < 1 and t ≥ 41/(1−β) , it holds

t−1
! !
Xs X > Λmin t1−β 2
 
1−β Λmin
X
s
P λmin < ≤ d exp −t 4 .
sβ 4(1 − β) 10DX
s=1

Proof We wish to center the matrices Xs Xs> by subtracting their (common) expected
value. We use that if A and B are symmetric, λmin (A − B) ≤ λmin (A) − λmin (B). Indeed,
denoting by v any eigenvector of A associated with its smallest eigenvalue,

x> (A − B)x
λmin (A − B) = min
x kxk2
v > (A − B)v

kvk2
v > Bv
= λmin (A) −
kvk2
x> Bx
≤ λmin (A) − min
x kxk2
= λmin (A) − λmin (B) .

We obtain:

t−1 t−1 ! t−1 t−1 


! !
Xs X > Xs Xs> Xs X > Xs Xs>
X X  X X
s s
λmin − E ≤ λmin − λmin E
sβ sβ sβ sβ
s=1 s=1 s=1 s=1
t−1 t−1
!
X Xs Xs> X 1
= λmin − Λmin
sβ sβ
s=1 s=1
t−1
!
X Xs Xs> t1−β − 1
≤ λmin − Λmin .
sβ 1−β
s=1

35
de Vilmarest and Wintenberger

Therefore, we obtain
t−1
! !
X Xs X >sΛmin (t1−β − 2)
P λmin <
sβ 2(1 − β)
s=1
t−1  ! !
Xs Xs> Xs Xs> Λmin (t1−β − 2) t1−β − 1
X 
≤ P λmin −E < − Λmin
sβ sβ 2(1 − β) 1−β
s=1
t−1  
! !
Xs Xs> Xs Xs> Λmin t1−β
X  
= P λmax E β
− β
> .
s s 2(1 − β)
s=1

We check the assumptions of Theorem 1.4 of Tropp (2012):


> >
h i
• Obviously E XssX
β
s
− XssX
β
s
is centered,
> Xs Xs> >
 h i   h i
• λmax E XssX
β
s
− sβ
≤ λmax E XssX
β
s 2 almost surely.
≤ DX
 h 2   2 
> 4 4
Xs Xs> Xs Xs>
i
DX DX
As 0 4 E E XssX
β
s
− sβ
4E sβ
4 s2β
I 4 sβ
I, we get

t−1
"  2 # t−1
!
Xs Xs> Xs Xs> D4 1−β
  
4 t
X X
X
04 E E − 4 I4 DX I.
sβ sβ sβ 1−β
s=1 s=1

Therefore we can apply Theorem 1.4 of Tropp (2012):


t−1   ! !
Xs Xs> Xs Xs> Λmin t1−β
X 
P λmax E − >
sβ sβ 2(1 − β)
s=1
!
Λ2min t2(1−β) /(8(1 − β)2 )
≤ d exp − 4 1−β 2 Λ 1−β /(6(1 − β))
DX t /(1 − β) + DX min t

Λ2 1/(1 − β)2
 
= d exp −t1−β min4 1/(1 − β) + Λ 2
8DX min /(6DX (1 − β))
2
 −1 !
Λ Λ min (1 − β)
= d exp −t1−β min4 1−β+ 2 .
8DX 6DX

2 ≤ 1 and β ≥ 0, we obtain 8(1 − β + Λmin (1−β) ) ≤ 8(1 + 1/6) = 28/3 ≤ 10,


Using Λmin /DX 6DX2

therefore
t−1
! !
Xs Xs> Λmin (t1−β − 2) 2
 
1−β Λmin
X
P λmin < ≤ d exp −t 4 .
sβ 2(1 − β) 10DX
s=1

The result follows from 12 t1−β − 2 > 0 for t ≥ 41/(1−β) .

We can now do a union bound to obtain Proposition 11.

36
Stochastic Online Optimization using Kalman Recursion

Proof of Proposition 11. We first move our problem to the setting of Lemma 21:
t−1
!−1
X
λmax (Pt ) = λmin P1−1 + Xs Xs> αs
s=1
t−1
!−1
X Xs X >
≤ λmin P1−1 + s
,

s=1

because αs ≥ 1/sβ . Therefore, for t ≥ 8 ≥ 41/(1−β) ,



t−1
!−1 
>
 
4 X Xs Xs 4
P λmax (Pt ) > 1−β
≤ P λmin P1−1 + β
> 
Λmin t s Λmin t1−β
s=1
t−1
! !
X X > Λ t1−β
s s min
X
= P λmin P1−1 + <
sβ 4
s=1
t−1
! !
X Xs Xs> Λmin t1−β
≤ P λmin <
sβ 4
s=1
Λ2
 
≤ d exp −t1−β min4 ,
10DX
where we applied Lemma 21 to obtain the last line. We take a union bound to obtain, for
any k ≥ 7,
2
  X  
4 1−β Λmin
P ∃t > k, λmax (Pt ) > ≤ d exp −t 4
Λmin t1−β 10DX
t>k
2
 
1−β Λmin
X
≤d exp −bt c 4
10DX
t>k
Λ2min X
X  
=d exp −m 4 1bt1−β c=m
m≥1
10DX
t>k
P
We bound 1btc=m : for any m
t>k

bt1−β c = m =⇒ m1/(1−β) ≤ t < (m + 1)1/(1−β) ,


then using ex ≤ 1 + 2x for any 0 ≤ x ≤ 1, we have
(m + 1)1/(1−β) = m1/(1−β) (1 + 1/m)1/(1−β)
= m1/(1−β) exp(ln(1 + 1/m)/(1 − β))
≤ m1/(1−β) exp(1/(m(1 − β)))
≤ m1/(1−β) (1 + 2/(m(1 − β))) ,
as long as m ≥ 2 ≥ 1/(1 − β). Therefore
(m + 1)1/(1−β) − m1/(1−β) + 1 ≤ 2m1/(1−β)−1 /(1 − β) + 1 ≤ 4m + 1 ≤ 4(m + 1) ,

37
de Vilmarest and Wintenberger

and that is true for m = 1 too. Hence


Λ2
   
4 X
P ∃t > k, λmax (Pt ) > ≤ 4d (m + 1) exp −m min4
Λmin t1−β 10DX
m≥bk1−β c
 1−β c
Λ2min bk Λ2min
  
exp − 10D 4
 exp − 10D 4

X 1−β X
= 4d 
Λ2min
 bk c+1+ 
Λ2min

1 − exp − 10D 4 1 − exp − 10D 4
 2 X X
Λmin k1−β
exp 10D 4

1
 
Λ2min
X 1−β
≤ 4d 
Λ2min
 k + 
Λ2min
 exp − 4 ,
1 − exp − 10D 1 − exp − 10DX
4 10D4
X X

1−β c+1
rbk
rm+1 =
P
where the second line is obtained deriving both sides of 1−r with
m≥bk1−β c
respect to r. Also, as 1 − e−x ≥ xe−x for any x ∈ R, we get
 
4
P ∃t > k, λmax (Pt ) >
Λmin t1−β
4 4 k1−β
Λ2min
 2 
Λ2min
  
10DX 1−β 10DX Λmin
≤ 4d 2 exp 2 4 (k + 2 exp 4 ) exp − 4 .
Λmin 10DX Λmin 10DX 10DX
Furthermore, as xe−x ≤ e−1 for any x ≥ 0, we get for any k ≥ 7:

10DX4  2   2

1−β Λmin 1−β Λmin
k + 2 exp 4 exp −k 4
Λmin 10DX 20DX
4 e−1 4
Λ2min Λ2min
   
20DX 10DX
≤ exp exp
Λ2min Λ2min 10DX 4 20DX 4

4 e−1
Λ2min
  
20DX 1
= 2 exp exp 4 .
Λmin 2 10DX
Combining the last two inequalities, we obtain
 
4
P ∃t > k, λmax (Pt ) >
Λmin t1−β
8 e−1
Λ2min
  2   2

800DX 1 Λmin 1−β Λmin
≤d exp 2 4 + 2 exp 10D 4 exp −k
Λ4min 10DX X 20DX4

625D8 Λ2
 
≤ d 4 X exp −k 1−β min4 ,
Λmin 20DX
2 and consequently
and the result follows. The last line comes from Λmin ≤ DX
Λ2min
  2 
−1 1 Λmin 0.1
800e exp 2 4 + exp 4 ≤ 800e−1+0.2+0.5e ≈ 624.7 ≤ 625 .
10DX 2 10DX
The condition k ≥ 7 is not necessary because

20DX 4 
625dDX 8 1/(1−β)
ln ≥ 20 ln(625δ −1 ) ,
Λ2min Λ4min δ

38
Stochastic Online Optimization using Kalman Recursion

and either δ ≥ 1 and the result is trivial, either δ < 1 and 20 ln(625δ −1 ) ≥ 128.

C.3 Convergence of the Truncated Algorithm


In order to prove Theorem 12, we state and prove an intermediate lemma.

Lemma 22 Let θ ∈ Rd .

1. For any η > 0, we have

∂L
L(θ) − L(θ? ) > η =⇒ ≥ Dη
∂θ θ

Λmin η
where Dη = √
√ .
DX (kθ ? k+ 8η/D 2 )
2DX (1+e X )

2. For any ε > 0, we have

Λmin
kθ − θ? k > ε =⇒ L(θ) − L(θ? ) > ε2 .
4(1 + eDX (kθ? k+ε) )

Proof Both points derive from a second-order identity, turned in an upper-bound in the
one case and in a lower-bound in the other. Using ∂L ?
∂θ (θ ) = 0, there exists 0 ≤ λ ≤ 1 such
that
 
? 1 ? > 1 >
L(θ) = L(θ ) + (θ − θ ) E XX (θ − θ? ) .
2 (1 + e(λθ+(1−λ)θ? )> X )(1 + e−(λθ+(1−λ)θ? )> X )

1. We first have
DX 2
kθ − θ? k2 .
L(θ) − L(θ? ) ≤
8
q
Assume L(θ) − L(θ? ) > η. Then kθ − θ? k ≥ 8η/DX 2 . Also, using the Taylor

expansion of θ? around some θ0 ∈ Rd , we get

∂L > 1 h i
L(θ? ) ≥ L(θ0 )+ (θ? −θ0 )+ (θ0 −θ? )> E XX > (θ0 −θ? ) ,
∂θ θ0 4(1 + eDX (kθ? k+kθ0 −θ? k) )

and that yields

∂L > Λmin
(θ0 − θ? ) ≥ L(θ0 ) − L(θ? ) + kθ0 − θ? k2 .
∂θ θ0 4(1 + e (kθ? k+kθ0 −θ? k) )
DX

Therefore, as L(θ0 ) − L(θ? ) ≥ 0,

∂L Λmin
≥ kθ0 − θtrue k .
∂θ θ0 4(1 + e (kθ? k+kθ0 −θ? k) )
DX

39
de Vilmarest and Wintenberger

Finally, as L is convex of minimum θ? ,

∂L ∂L
≥ min

∂θ θ ?
kθ0 −θ k= 2
8η/DX ∂θ θ0

Λmin
q
≥ √ 2
8η/DX
? k+ 2 )
4(1 + eDX (kθ 8η/DX
)
Λmin √
≥√ √ 2 )
η.
(kθ? k+
2DX (1 + eDX 8η/DX
)

2. On the other hand we have


Λmin
L(θ) ≥ L(θ? ) + kθ − θ ? k2 .
4(1 + e (kθ? k+kθ−θ? k) )
DX

Thus, as L is convex of minimum θ? , if kθ − θ? k > ε it holds

Λmin
L(θ) − L(θ? ) > min L(θ0 ) − L(θ? ) ≥ ε2 .
kθ0 −θ? k=ε 4(1 + eDX (kθ? k+ε) )

Proof of Theorem 12. We prove the convergence of (L(θ̂t ))t to L(θ? ) and then the
convergence of (θ̂t )t to θ? follows. The convergence of (L(θ̂t ))t comes from the first point of
Lemma 22. The link between the two convergences is stated in the second point.
To study the evolution of L(θ̂t ) we first apply a second-order Taylor expansion: for any
t ≥ 1 there exists 0 ≤ αt ≤ 1 such that

∂L > 1 ∂2L
L(θ̂t+1 ) = L(θ̂t ) + (θ̂t+1 − θ̂t ) + (θ̂t+1 − θ̂t )> 2 (θ̂t+1 − θ̂t ) . (12)
∂θ θ̂t 2 ∂θ θ̂t +αt (θ̂t+1 −θ̂t )

2
We have ∂∂θL2 4 14 E[XX > ], therefore, using the update formula on θ̂, the second-order
term is bounded with

∂2L 1 E[XX > ]


(θ̂t+1 − θ̂t )> (θ̂t+1 − θ̂t ) ≤ Xt> Pt+1
>
Pt+1 Xt
∂θ2 θ̂t +αt (θ̂t+1 −θ̂t ) (1 + e )2yt θ̂t> Xt 4
1 4 1 4
≤ DX λmax (Pt+1 )2 ≤ DX λmax (Pt )2 .
4 4
The first-order term is controlled using the definition of the algorithm:

Pt Xt Xt> Pt
 
yt Xt
θ̂t+1 − θ̂t = Pt − >
αt >
,
1 + Xt Pt Xt αt 1 + eyt θ̂t Xt
and as αt ≤ 1,
Pt Xt Xt> Pt yt Xt 3
−αt >
≤ DX λmax (Pt )2 .
1 + Xt Pt Xt αt 1 + eyt θ̂t> Xt

40
Stochastic Online Optimization using Kalman Recursion

∂L
Also, ∂θ ≤ DX . Substituting our findings in Equation (12), we obtain
∂L > yt Xt 4
L(θ̂t+1 ) ≤ L(θ̂t ) + Pt >
+ 2DX λmax (Pt )2 . (13)
∂θ θ̂t 1+ eyt θ̂t Xt
We define
 
∂L > yt Xt ∂L > yt Xt
Mt = Pt >X
−E Pt | X1 , y1 , ..., Xt−1 , yt−1
∂θ θ̂t 1 + eyt θ̂t t ∂θ θ̂t 1 + eyt θ̂t> Xt
∂L > yt Xt ∂L > ∂L
= Pt >X
+ Pt .
∂θ θ̂t 1 + eyt θ̂t t ∂θ θ̂t ∂θ θ̂t
Hence we have
> 2 2
∂L yt Xt ∂L 1 ∂L
Pt >X
≤ Mt − λmin (Pt ) ≤ Mt − 2 ,
∂θ θ̂t 1 + eyt θ̂t t ∂θ θ̂t tDX ∂θ θ̂t

because Ps < sDI 2 . Combining it with Equation (13) and summing consecutive terms, we
X
obtain, for any k < t,
t−1
!
2
X 1 ∂L 4 2
L(θ̂t ) − L(θ̂k ) ≤ Ms − 2 + 2DX λmax (Ps ) . (14)
sDX ∂θ θ̂s
s=k

We recall that there exists Cδ such that P(ACδ ) ≥ 1 − δ where


∞ 
\ Cδ 
ACδ := λmax (Pt ) ≤ .
t1−β
t=1

On the previous inequality, we see that the right-hand side is the sum of a martingale and
a term which is negative for s large enough, under the event ACδ .
We are then interested in P((L(θ̂t ) − L(θ? ) > η) | ACδ ) for some η > 0. For 0 ≤ k ≤ t,
we define Bk,t the event (∀k < s < t, L(θ̂s ) − L(θ? ) > η/2). Then we use the law of total
probability:

P(L(θ̂t ) − L(θ? ) > η | ACδ ) (15)


 
≤ P (L(θ̂t ) − L(θ? ) > η) ∩ B0,t | ACδ
t−1 
X η 
P (L(θ̂t ) − L(θ? ) > η) ∩ L(θ̂k ) − L(θ? ) ≤
+ ∩ Bk,t | ACδ (16)
2
k=1
 
≤ P (L(θ̂t ) − L(θ? ) > η) ∩ B0,t | ACδ
t−1 
X η 
+ P L(θ̂t ) − L(θ̂k ) > ∩ Bk,t | ACδ .
2
k=1

Lemma 22 yields
η ∂L
L(θ̂s ) − L(θ? ) > =⇒ ≥ Dη .
2 ∂θ θ̂s

41
de Vilmarest and Wintenberger

We combine the last equation, along with Equation (14) and the definition of ACδ to
get, for any 1 ≤ k < t,
t−1
!
  X 
P (L(θ̂t ) − L(θ̂k ) > η/2) ∩ Bk,t | ACδ ≤ P Ms > f (k, t) ∩ Bk,t | ACδ
s=k
t−1
!
X
≤P Ms > f (k, t) | ACδ ,
s=k

η Dη2 t−1
P 1 4 C2
t−1
P 1
where f (k, t) = 2 + 2
DX s − 2DX δ s2(1−β)
for any 1 ≤ k < t.
s=k s=k
Similarly, we get
t−1
!
  X
P (L(θ̂t ) − L(θ? ) > η) ∩ B0,t | AC ≤ P Ms > f0 (t) | AC ,
s=1

Dη2 t−1
P 1 t−1
1
with f0 (t) = η − (L(θ̂1 ) − L(θ? )) + 4 C2
P
2
DX s − 2DX δ s2(1−β)
for any t ≥ 1.
s=1 s=1
We have E[Ms | X1 , y1 , ..., Xs−1 , ys−1 ] = 0, and almost surely |Ms | ≤ 2DX2 λ
max (Ps ). We
can therefore apply Azuma-Hoeffding inequality: for t, k such that f (k, t) > 0,
t−1
! !
1−2β
2 (1 − 2β) max 1/2, (k − 1)
X
P Ms > f (k, t) | ACδ ≤ exp −f (k, t) 4 C2 ,
8DX δ
s=k
+∞
P 1 1
because s2(1−β)
≤ (1−2β) max(1/2,(k−1)1−2β )
. Similarly, for t such that f0 (t) > 0,
s=k

t−1
!  
2 1 − 2β
X
P Ms > f0 (t) | ACδ ≤ exp −f0 (t) 4 C2 .
s=1
16DX δ

We need to control f (k, t), f0 (t). We see that for t large enough, when k is small
D2
compared to t, f (k, t) is driven by D2η ln(t) and when k ≈ t, f (k, t) is driven by η/2. The
X
following Lemma formally states these approximations as lower-bounds. We prove it right
after the end of this proof.
6 C2 
 4 2  1 2
16DX
!
δ
2 8D C 1−2β
Lemma 23 For t ≥ max e Dη (1−2β) , 1 + η(1−2β) X δ
, it holds

Dη2 √
f (k, t) ≥ 2 ln(t), 1≤k< t,
4DX
η √
f (k, t) ≥ , t ≤ k < t.
4
2 4 C2
 
2DX 4DX
2 L(θ̂1 )−L(θ? )+ 1−2β δ

Similarly, for t ≥ e , we have
Dη2
f0 (t) ≥ 2 ln(t) .
2DX

42
Stochastic Online Optimization using Kalman Recursion

Dη4 (1−2β) η 2 (1−2β)


Then, defining C1 = 256DX 8 C2 and C2 = 128DX 4 C2 , we finally get for t large enough:
δ δ
 
P (L(θ̂t ) − L(θ? ) > η) ∩ B0,t | ACδ ≤ exp −4C1 ln(t)2 ,

 η 
P (L(θ̂t ) − L(θ? ) > η) ∩ (L(θ̂k ) − L(θ? ) ≤ ) ∩ Bk,t | ACδ
2
(
2
 √
exp − C1 ln(t) if 1 ≤ k < t ,
≤ √
exp − C2 (k − 1)1−2β

if t ≤ k < t .

Substituting in Equation (16) yields:

P(L(θ̂t ) − L(θ? ) > η | AC )



d te−1 t−1  
X X
2
exp −C1 ln(t)2 + exp −C2 (k − 1)1−2β
 
≤ exp −4C1 ln(t) +

k=1 k=d te
√  √ 
≤ ( t + 1) exp −C1 ln(t)2 + t exp −C2 ( t − 1)1−2β .


Λmin ε2
Finally, Point 2 of Lemma 22 allows to obtain the result: defining η = ?
4(1+eDX (kθ k+ε) )
,
we obtain

P(kθ̂t − θ? k > ε | ACδ ) ≤ P(L(θ̂t ) − L(θ? ) > η | ACδ )


√  √ 
≤ ( t + 1) exp −C1 ln(t)2 + t exp −C2 ( t − 1)1−2β .


In order to obtain the constants involved in the Theorem, we write


q
ε2
Λmin 4(1+eΛDmin ?
X (kθ k+ε) )

Λmin
3/2
ε
Dη =   ≥ ? k+ε) ,
D (kθ
r
? Λmin ε2 1+e X 4DX
2DX (1 + exp DX (kθ k + D2 (1+eDX (kθ? k+ε) ) ) )
X

Λ6min (1 − 2β)ε4
C1 ≥ 12 C 2 (1 + eDX (kθ? k+ε) )6
,
216 DX δ
Λ2min (1 − 2β)ε4
C2 ≥ 4 C 2 (1 + eDX (kθ? k+ε) )2
,
211 DX δ

and the conditions of Lemma 23 become


?
!
8 C 2 (1 + eDX (kθ k+ε) )3
28 DX δ
t ≥ exp ,
Λ3min (1 − 2β)ε2
 ! 1 2
?
4 C 2 (1
32DX δ + eDX (kθ k+ε) ) 1−2β

t ≥ 1 +  ,
(1 − 2β)Λmin ε2
? k+ε)
!
4 (1 + eDX (kθ 4 C2 
)3

32DX 4D
t ≥ exp L(θ̂1 ) − L(θ? ) + X δ
.
Λ3min ε2 1 − 2β

43
de Vilmarest and Wintenberger

We would like to obtain a single condition on t, thus we write


 ! 1 2
32D 4 C 2 (1 + eDX (kθ? k+ε) ) 1−2β
1 + X δ 
(1 − 2β)Λmin ε2
  ! 1 
32D 4 C 2 (1 + eDX (kθ? k+ε) ) 1−2β
X δ
= exp 2 ln 1 + 
(1 − 2β)Λmin ε2
!!
32DX 4 C 2 (1 + eDX (kθ? k+ε) )
2 δ
≤ exp ln 1 +
1 − 2β (1 − 2β)Λmin ε2
 s 
32DX4 C 2 (1 + eDX (kθ? k+ε) )
2 δ
≤ exp  
1 − 2β (1 − 2β)Λmin ε2
!
28 DX 8 C 2 (1 + eDX (kθ? k+ε) )3
δ
≤ exp ,
Λ3min (1 − 2β)3/2 ε2

The third line is obtained with the inequality ln(1 + x) ≤ x for any x > 0. Obviously, as
0 < 1 − 2β < 1, the first threshold on t is bounded by:
! !
8 C 2 (1 + eDX (kθ? k+ε) )3
28 DX 2 8 D 8 C 2 (1 + eDX (kθ? k+ε) )3
δ X δ
exp ≤ exp .
Λ3min (1 − 2β)ε2 Λ3min (1 − 2β)3/2 ε2
2
4DX
2 C ≥
To handle the third one, we use DX ≥ 4 and as θ̂1 = 0 we obtain L(θ̂1 ) − L(θ? ) ≤
δ Λmin
4 C2
4DX
ln 2 ≤ 1−2β
δ
, hence
? k+ε)
!
4 (1 + eDX (kθ 4 C2 
)3

32DX ? 4DX δ
exp L(θ̂1 ) − L(θ ) +
Λ3min ε2 1 − 2β
!
8 C 2 (1 + eDX (kθ? k+ε) )3
28 DX δ
≤ exp 3
.
Λmin (1 − 2β)3/2 ε2

Proof of Lemma 23. We recall that for any k ≥ 1,


t−1 t−1
X 1 X 1 1 1
≥ ln t − ln k , ≤ .
s s2(1−β) 1 − 2β max(1/2, (k − 1)1−2β )
s=k s=k

Therefore:
η Dη2 2DX4 C2
δ 1
f (k, t) ≥ + 2 (ln t − ln k) − ,
2 DX 1 − 2β max(1/2, (k − 1)1−2β )
Dη2 4DX4 C2
f0 (t) ≥ η − (L(θ̂1 ) − L(θ? ) + 2 ln t − δ
.
DX 1 − 2β

44
Stochastic Online Optimization using Kalman Recursion

√ 1
• For any 1 ≤ k < t, it holds ln k ≤ 2 ln t, and we have

Dη2 4DX4 C2
δ
f (k, t) ≥ 2 ln(t) − .
2DX 1 − 2β

16DX 6 C2
δ
2 (1−2β) Dη2
Taking t ≥ e Dη
yields f (k, t) ≥ 4DX2 ln(t).


• For t ≥ 2 and any k ≥ t, we have

2DX4 C2 4 C2
2DX
η δ η δ
f (k, t) ≥ − ≥ − √ .
2 (1 − 2β)(k − 1)1−2β 2 (1 − 2β)( t − 1)1−2β

 1 2
4 C2
 
8DX 1−2β
Then if t ≥ 1+ δ
η(1−2β) , we get f (k, t) ≥ η4 .

Dη2 4 C2
4DX
• Last point comes from f0 (t) ≥ 2
DX
ln t − (L(θ̂1 ) − L(θ? ) − 1−2β .
δ

 8 C 2 (1+eDX (kθ ? k+ε) 3 


28 DX δ/2
)
Proof of Corollary 13. We apply Theorem 12: for any t ≥ exp Λ3min (1−2β)3/2 ε2
,

√  √ 
P(kθ̂t − θ? k > ε | ACδ/2 ) ≤ ( t + 1) exp −C1 ln(t)2 + t exp −C2 ( t − 1)1−2β ,


where

Λ6min (1 − 2β)ε4 Λ2min (1 − 2β)ε4


C1 = 12 C 2 (1 + eDX (kθ? k+ε) )6
, C2 = 4 C 2 (1 + eDX (kθ? k+ε) )2
.
216 DX δ/2 211 DX δ/2

 8 C 2 (1+eDX (kθ ? k+ε) 3 


28 DX δ/2
)
We use a union bound: for any T ≥ exp Λ3min (1−2β)3/2 ε2
,


!
[ X√
?
( t + 1) exp −C1 ln(t)2

P (kθ̂t − θ k > ε) | ACδ/2 ≤
t=T +1 t>T
X  √ 
+ t exp −C2 ( t − 1)1−2β .
t>T

3
• If T ≥ e 2C1 , we have
X√  X√ 1
( t + 1) exp −C1 ln(t)2 ≤ ( t + 1) 5/2 ≤ 2/T .
t>T t>T
t

45
de Vilmarest and Wintenberger

√ 
12
4/(1−2β)
• For t ≥ 4, 1 − 1/ t ≥ 1/2, then for t ≥ C2 (1−2β) ,


 
3

1−2β
 C2 (1−2β)/2
t exp −C2 ( t − 1) ≤ exp 3 ln(t) − t
2
    
12 12 6 12
≤ exp ln −
1 − 2β C2 (1 − 2β) 1 − 2β C2 (1 − 2β)
≤ 1,

because for any x > 0, we have ln x ≤ x/2.

 4/(1−2β)
12
Thus for T ≥ C2 (1−2β)

X  √ 
t exp −C2 ( t − 1)1−2β ≤ 1/T .
t>T

Finally, for T satisfying the previous conditions as well as T ≥ 6δ −1 , we obtain


!
[
?
P (kθ̂t − θ k > ε) | ACδ/2 ≤ 3/T ≤ δ/2 .
t=T +1

We now compare the constants involved. As long as εDX ≤ 1, we have

? k+ε) ? k+ε)
28 DX
8 C 2 (1 + eDX (kθ )3 3 · 215 DX
12 C 2 (1 + eDX (kθ )6
! !
δ/2 δ/2
exp ≤ exp .
Λ3min (1 − 2β)3/2 ε2 Λ6min (1 − 2β)3/2 ε4

Furthermore, as 1 − 2β ≤ 1, we have

? k+ε)
3 · 215 DX
12 C 2 (1 + eDX (kθ )6
  !
3 δ/2
exp = exp
2C1 Λ6min (1 − 2β)ε4
? k+ε)
3 · 215 DX
12 C 2 (1 + eDX (kθ )6
!
δ/2
≤ exp .
Λ6min (1 − 2β)3/2 ε4

46
Stochastic Online Optimization using Kalman Recursion

Finally,
 4/(1−2β)  
12 4 12
= exp ln
C2 (1 − 2β) 1 − 2β C2 (1 − 2β)
? k+ε)
12 · 211 DX
4 C 2 (1 + eDX (kθ )2
!
4 δ/2
= exp ln
1 − 2β Λ2min (1 − 2β)2 ε4
12 · 211 DX 4 C 2 (1 + eDX (kθ? k+ε) )2
!
8 δ/2
= exp ln
1 − 2β Λ2min (1 − 2β)ε4
 s 
3 · 2 13 D 4 C 2 (1 + eDX (kθ? k+ε) )2
8 X δ/2
≤ exp  
1 − 2β Λ2min (1 − 2β)ε4
√ 9 2 ?
!
62 DX Cδ/2 (1 + eDX (kθ k+ε) )
= exp
Λmin (1 − 2β)3/2 ε2
3 · 215 DX12 C 2 (1 + eDX (kθ? k+ε) )6
!
δ/2
≤ exp .
Λ6min (1 − 2β)3/2 ε4

Appendix D. Proofs of Section 5


Proof of Proposition 14. The first order condition of the optimum yields

t−1 t−1
X 1 X
arg min (ys − θ> Xs )2 + (θ − θ̂1 )> P1−1 (θ − θ̂1 ) = θ̂1 + Pt (ys − θ̂1> Xs )Xs .
θ∈Rd 2
s=1 s=1

Therefore we prove recursively that θ̂t − θ̂1 = Pt t−1 >


P
s=1 (ys − θ̂1 Xs )Xs . It is clearly true at
t = 1. Assuming it is true for some t ≥ 1, we use the update formula

θ̂t+1 − θ̂1 = (I − Pt+1 Xt Xt> )(θ̂t − θ̂1 ) + Pt+1 yt Xt − Pt+1 Xt Xt> θ̂1
t−1
X
= (I − Pt+1 Xt Xt> )Pt (ys − θ̂1> Xs )Xs + Pt+1 (yt − θ̂1> Xt )Xt .
s=1

We conclude with the following identity:

Pt Xt Xt> Pt Xt Xt> Pt Pt Xt Xt> Pt


(I − Pt+1 Xt Xt> )Pt = Pt − Pt Xt Xt> Pt + = P t − = Pt+1 .
Xt> Pt Xt + 1 Xt> Pt Xt + 1

47
de Vilmarest and Wintenberger

D.1 Proof of Theorem 16


We first prove a result controlling the first estimates of the algorithm.
Lemma 24 Provided that Assumptions 1, 2 and 4 are satisfied, starting from any θ̂1 ∈ Rd
and P1  0, for any δ > 0, it holds simultaneously
kθ̂t − θ? k ≤ kθ̂1 − θ? k + λmax (P1 )DX (3σ + Dapprox )(t − 1) + 3σ ln δ −1 ,

t ≥ 1,
with probability at least 1 − δ.
Proof From Proposition 14, we obtain, for any t ≥ 1, θ̂t − θ̂1 = Pt t−1 >
P
s=1 (ys − θ̂1 Xs )Xs .
Consequently,
t−1 t−1
!
X X
θ̂t − θ? = Pt (ys − θ̂1> Xs )Xs − Pt P1−1 + Xs Xs> (θ? − θ̂1 )
s=1 s=1
t−1
X
= Pt (ys − θ?> Xs )Xs + Pt P1−1 (θ̂1 − θ? ) ,
s=1

and using Pt P1−1 4 I, we obtain


t−1
X
kθ̂t − θ? k ≤ kθ̂1 − θ? k + λmax (Pt )DX |ys − θ?> Xs |
s=1
t−1
X
≤ kθ̂1 − θ? k + λmax (P1 )DX (|ys − E[ys | Xs ]| + Dapp ) . (17)
s=1

We apply Lemma 1.4 of Rigollet and Hütter (2015) in the second line of the following:
for any µ such that 0 < µ < 2√12σ ,
X µi E[|yt − E[yt | Xt ]|i ]
E [exp(µ|yt − E[yt | Xt ]|)] = 1 +
i!
i≥1
X µi (2σ 2 )i/2 iΓ(i/2)
≤1+
i!
k≥1
X √ i
≤1+ 2µσ , because Γ(i/2) ≤ Γ(i) = (i − 1)!
i≥1
√ √ 1
≤ 1 + 2 2µσ, because 0 < 2µσ ≤
 √  2
≤ exp 2 2µσ .
t √
  
1 P
Therefore exp 2√2σ (|ys − E[ys | Xs ]| − 2 2σ) is a super-martingale to which we
s=1 t
can apply Lemma 17. We obtain, for any δ > 0,
t−1
X √ √
|yt − E[yt | Xt ]| ≤ 2 2(t − 1)σ + 2 2σ ln δ −1 , t ≥ 1,
s=1

48
Stochastic Online Optimization using Kalman Recursion


with probability at least 1 − δ. The result follows from Equation (17) and 2 2 ≤ 3.

Proof of Theorem 16. We first apply Theorem 2: with probability at least 1 − 5δ, it
holds simultaneously for all n ≥ T (ε, δ)
n  2 
X
? 15 2 2 2 2
 λmax (P1 )DX
L(θ̂t ) − L(θ ) ≤ d 8σ + Dapp + ε DX ln 1 + (n − T (ε, δ))
2 d
t=T (ε,δ)+1
 
+ 5λmax PT−1 (ε,δ)+1 ε
2


λmax (P1 )DX2 
2
+ 115 σ (4 + ) + Dapp + 2ε DX ln δ −1 .
2 2 2
4
 
Moreover, λmax PT−1 −1
(ε,δ)+1 ≤ λmax (P1 ) + T (ε, δ)DX .
2

Then we derive a bound on the first T (ε, δ) terms. For any t ≥ 1, we have L(θ̂t )−L(θ? ) ≤
2 kθ̂
DX − θ? k2 , thus, using (a + b)2 ≤ 2(a2 + b2 ) and applying Lemma 24 we obtain the
t
simultaneous property
L(θ̂t ) − L(θ? ) ≤ 2DX
2
(kθ̂1 − θ? k + 3λmax (P1 )DX σ ln δ −1 )2
+ 2λmax (P1 )2 DX
4
(3σ + Dapp )2 (t − 1)2 , t ≥ 1,
with probability at least 1 − δ. A summation argument yields, for any δ > 0,
T (ε,δ)
X
L(θ̂t ) − L(θ? ) ≤ 2DX
2
(kθ̂1 − θ? k + 3λmax (P1 )DX σ ln δ −1 )2 T (ε, δ)
t=1
(T (ε, δ) − 1)T (ε, δ)(2T (ε, δ) − 1)
+ λmax (P1 )2 DX
4
(3σ + Dapp )2 ,
3
with probability at least 1 − δ.

D.2 Definition of T (ε, δ)


We now focus on the definition of T (ε, δ). We first transcript the result of Hsu et al. (2012)
to our notations in the following lemma.
Lemma 25 Provided that Assumptions 1, 2 and 4 are satisfied, starting from any θ̂1 ∈ Rd
DX2
and P1 = p1 I, p1 > 0, we have, for any 0 < δ < e−2.6 and t ≥ 6 Λmin (ln d + ln δ −1 ),
√ !
3 kθ̂ − θ ? k2 D 2
4(1 + 8 ln δ −1 ) 3σ 2 (d/0.035 + ln δ −1 )
1
kθ̂t+1 − θ? k2Σ ≤ 2
+ X Dapp +
t 2p1 Λmin 0.072 0.07

12 kθ̂1 − θ? k2 DX2 √
+ (1 + 8 ln δ −1 )
0.072 t2 p1 Λmin
!
kθ̂1 − θ? k 2
 
DX ? −1 2
+ √ (Dapp + DX kθ k) + √ (ln δ ) ,
Λmin 2p1
with probability at least 1 − 4δ.

49
de Vilmarest and Wintenberger

Proof We first observe that


t t
1X 1X
arg min (ys − w> Xs )2 + λkw − β̂1 k2 = arg min (ys − β̂1> Xs − w> Xs )2 + λkwk2 ,
w∈Rd t w∈Rd t
s=1 s=1

therefore we apply the ridge analysis of Hsu et al. (2012) to (Xs , ys − β̂1> Xs ). We note that
(ys − β̂1> Xs ) has the same variance proxy and the same approximation error, therefore it
only amounts to translate the optimal w, that is denoted by β.
For any λ > 0, we observe that
DX
d2,λ ≤ d1,λ ≤ d , ρλ ≤ p , bλ ≤ ρλ (Dapp + DX kβ − β̂1 k) .
d1,λ Λmin

Therefore we can apply Theorem 16 of Hsu et al. (2012): for 0 < δ < e−2.6 and t ≥
6 √DX
Λmin
(ln(d)+ln δ −1 ), it holds that kβ̂t+1,λ −βk2Σ = 3(kβλ −βk2Σ +εbs +εvr ) with probability
1 − 4δ, with
DX2 DX2
| X] − β > X)2 ] + (1 + Λmin )kβλ − βk2Σ √

4 Λmin E[(E[y
εbs ≤ (1 + 8 ln δ −1 )
0.072 t
( √D
Λ
X
(Dapp + DX kβ − β̂1 k) + kβλ − βkΣ )2 
min −1 2
+ (ln δ ) ,
t2
r
DX4

√ +1 4 2
Λmin d
1 DX 1
δf ≤ √ √ (1 + 8 ln δ −1 ) + ln δ −1 ,
t Λmin t 3
p
σ 2 d(1 + δf ) 2σ 2 d(1 + δf ) ln δ −1 2σ 2 ln δ −1
εvr ≤ + + .
0.072 t 0.073/2 t 0.07t
Moreover E[(E[y | X]−β > X)2 ] ≤ Dapp
2 and Λmin ≤ DX 2 , hence, using kβ −βk ≤ λkβ −
λ Σ
β̂1 k we transfer the result in our KF notations, that is, θ̂t = β̂t,p−1 /2(t−1) , β̂1 = θ̂1 , β = θ? .
1
We obtain, for any 0 < δ < e−2.6 and t ≥ 6 √D
Λ
X
(ln(d) + ln δ −1 ),
min

2 2
DX 2 DX kθ̂1 −θ? k2
4  Λmin Dapp + Λmin p1 t

εbs ≤ (1 + 8 ln δ −1 )
0.072 t
kθ̂1 −θ? k 2
( √D
Λ
X
(Dapp + DX kθ? k) + √
2p1 t
) 
+ min
(ln δ −1 )2 ,
t2 r
DX4

√ +1 4 2
Λmin d
1 DX 1
δf ≤ √ √ (1 + 8 ln δ −1 ) + ln δ −1 ,
t Λmin t 3
p
σ 2 d(1 + δf ) 2σ 2 d(1 + δf ) ln δ −1 2σ 2 ln δ −1
εvr ≤ + + ,
0.072 t 0.073/2 t ! 0.07t
kθ̂1 − θ? k2
kθ̂t+1 − θ? k2Σ ≤ 3 + εbs + εvr ,
2p1 t

50
Stochastic Online Optimization using Kalman Recursion

DX2
with probability at least 1 − 4δ. For t ≥ Λmin ln δ −1 , as ln δ −1 ≥ 1, we get
√ √

r
1 14 1 1+ 8 2 2
δf ≤ √ (1 + 8 ln δ −1 ) + +1≤ √ + ≈ 1.9 ≤ 2 .
6 ln δ −1 63 d 6 9
√ a+b
Thus, as ab ≤ 2 for any a, b > 0, we have
r !
σ2 3d 3d ln δ −1
εvr ≤ +2 + 2 ln δ −1
0.07t 0.07 0.07
σ2
 
6d −1
≤ + 3 ln δ
0.07t 0.07
3σ 2 (d/0.035 + ln δ −1 )
≤ .
0.07t
It yields the result.

Lemma 25 allows the definition of an explicit value for T (ε, δ), as displayed in the
following Corollary.
Corollary 26 Assumption 5 is satisfied for T (ε, δ) = max(T1 (δ), T2 (ε, δ), T3 (ε, δ)) where
we define

DX2 2 2 
−1 48DX 24DX
T1 (δ) = max 12 (ln d + ln δ ), ln ,
Λmin Λmin Λmin
√ !
24ε−1 kθ̂1 − θ? k2 DX2
4(1 + 8 ln δ −1 ) 3σ 2 (d/0.035 + ln δ −1 )
T2 (ε, δ) = + D2 +
Λmin 2p1 Λmin app 0.072 0.07
√ !
12ε−1 kθ̂1 − θ? k2 DX2
4(1 + 8 ln δ −1 ) 3σ 2 (d/0.035 + ln δ −1 )
ln + D2 + ,
Λmin 2p1 Λmin app 0.072 0.07
s
96ε−1 kθ̂1 − θ? k2 DX 2 √
T3 (ε, δ) = (1 + 8 ln δ −1 )
0.072 Λmin p1 Λmin
!1/2
kθ̂1 − θ? k 2
 
DX ? −1 2
+ √ (Dapp + DX kθ k) + √ (ln δ )
Λmin 2p1
96ε−1 kθ̂1 − θ? k2 D2 √
ln (1 + X )(1 + 8 ln δ −1 )
0.072 Λmin 2p1 Λmin
!
?k 2
 
DX kθ̂ 1 − θ
+ √ (Dapp + DX kθ? k) + √ (ln δ −1 )2 .
Λmin 2p1

We recall that for any η ≤ 1, we have lnt t ≤ η for t ≥ 2η −1 ln(η −1 ), and we use it in the
following proof.
Proof of Corollary 26. We define δt = δ/t2 for any t ≥ 1. In order to apply Lemma
D2 D2
25 with a union bound, we need t ≥ 6 Λmin
X
(ln d + ln δt−1 ). If t ≥ 12 Λmin
X
(ln d + ln δ −1 ) and

51
de Vilmarest and Wintenberger

2
48DX 2
24DX
t≥ Λmin ln Λmin , we obtain

t t√
t≥ + t
2 2
D2 12DX 2 √
≥ 6 X (ln d + ln δ −1 ) + ln t, as ln t ≤ t
Λmin Λmin
DX2
=6 (ln d + ln δt−1 ) .
Λmin
2 48D2 24D2
 
DX
Therefore, we define T1 (δ) = max 12 Λmin (ln d + ln δ −1 ), ΛminX ln ΛminX , and we apply
Lemma 25. We get the simultaneous property
 q 
−1 −1
3 kθ̂1 − θ k? 2 D 2 4(1 + 8 ln δt ) 2
3σ (d/0.035 + ln δt ) 
kθ̂t+1 − θ? k2Σ ≤  + X Dapp 2
2
+
t 2p1 Λmin 0.07 0.07
2
12 kθ̂1 − θ? k2 DX
q
+ (1 + 8 ln δt−1 )
0.072 t2 p1 Λmin
!
?k 2
 
DX k θ̂ 1 − θ
+ √ (Dapp + DX kθ? k) + √ (ln δt−1 )2
Λmin 2p1

t−2 ≥ 1 − δ because T1 (δ) > 4.


P
for all t ≥ T1 (δ), with probability at least 1 − 4δ
t≥T1 (δ)
Thus, as ln t ≥ 1 for t ≥ T1 (δ) and kθ̂t+1 − θ k2Σ ≥ Λmin kθ̂t+1
? − θ? k2 , we obtain
√ !
6 ln t kθ̂1 − θ? k2 DX2
4(1 + 8 ln δ −1 ) 3σ 2 (d/0.035 + ln δ −1 )
kθ̂t+1 − θ? k ≤ + D2 +
Λmin t 2p1 Λmin app 0.072 0.07

48(ln t)2 kθ̂1 − θ? k2 DX2 √


+ (1 + 8 ln δ −1 )
0.072 Λmin t2 p1 Λmin
!
kθ̂1 − θ? k 2
 
DX ? −1 2
+ √ (Dapp + DX kθ k) + √ (ln δ )
Λmin 2p1

for all t ≥ T1 (δ, with probability at least 1 − δ. Finally, both terms of the last inequality
are bounded by ε/2.

From Corollary 26, we obtain the asymptotic rate by comparing T2 (δ) and T3 (δ). We
write T2 (δ) = 2A2 (δ) ln A2 (δ), T3 (δ) = 2A3 (δ) ln A3 (δ) with
!
ε−1 kθ̂1 − θ? k2 DX2 √
A2 (δ) . + D2 ln δ −1 + σ 2 (d + ln δ −1 )
Λmin p1 Λmin app
v !
u ε−1 kθ̂1 − θ? k2 D2 √
u
? kθ̂1 − θ? k 2
app + DX kθ k)
 D (D
X −1 X −1 2
A3 (δ) . t ln δ + √ + √ (ln δ ) .
Λmin p1 Λmin Λmin p1

52
Stochastic Online Optimization using Kalman Recursion

√ √ √
where
√ the symbol . means less than up to universal constants. As a + b . a + b and
ab . a + b, we obtain
s s
2 √
ε−1 kθ̂1 − θ? k2 DX
A3 (δ) . ln δ −1
Λmin p1 Λmin
!
?k
 
DX k θ̂ 1 − θ
+ √ (Dapp + DX kθ? k) + √ ln δ −1
Λmin p1
s
ε−1 kθ̂1 − θ? k2 D2 √
. + X ln δ −1
Λmin p1 Λmin
!
?k
 
DX k θ̂ 1 − θ
+ √ (Dapp + DX kθ? k) + √ ln δ −1 .
Λmin p1

ε−1
Thus, as long as Λmin ≤ 1, we get

ε−1 kθ̂1 − θ? k2 D2 √
2
A2 (δ), A3 (δ) . + X (1 + Dapp ) ln δ −1 + σ 2 d
Λmin p1 Λmin
!
kθ̂1 − θ? k
 
DX ? 2 −1
+ √ (Dapp + DX kθ k) + √ + σ ln δ .
Λmin p1

References
Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10
(2):251–276, 1998.
Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity
for logistic regression. Journal of Machine Learning Research, 15(1):595–627, 2014.
Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation
with convergence rate o(1/n). In Advances in Neural Information Processing Systems,
pages 773–781, 2013.
Bernard Bercu and Abderrahmen Touati. Exponential inequalities for self-normalized mar-
tingales with applications. The Annals of Applied Probability, 18(5):1848–1869, 2008.
Bernard Bercu, Antoine Godichon, and Bruno Portier. An efficient stochastic newton al-
gorithm for parameter estimation in logistic regressions. SIAM Journal on Control and
Optimization, 58(1):348–367, 2020.
Jock A. Blackard and Denis J. Dean. Comparative accuracies of artificial neural networks
and discriminant analysis in predicting forest cover types from cartographic variables.
Computers and Electronics in Agriculture, 24(3):131–151, 1999.
Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge
university press, 2006.

53
de Vilmarest and Wintenberger

George T. Diderrich. The Kalman filter from the perspective of Goldberger–Theil estima-
tors. The American Statistician, 39(3):193–198, 1985.

James Durbin and Siem J. Koopman. Time Series Analysis by State Space Methods. Oxford
university press, 2012.

Ludwig Fahrmeir. Posterior mode estimation by extended Kalman filtering for multivariate
dynamic generalized linear models. Journal of the American Statistical Association, 87
(418):501–509, 1992.

David A. Freedman. On tail probabilities for martingales. the Annals of Probability, pages
100–118, 1975.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online
convex optimization. Machine Learning, 69(2-3):169–192, 2007.

Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression.
In Conference on Learning Theory, pages 9–1, 2012.

Sham M. Kakade and Andrew Y. Ng. Online bounds for bayesian algorithms. In Advances
in Neural Information Processing Systems, pages 641–648, 2005.

Rudolph E. Kalman and Richard S. Bucy. New results in linear filtering and prediction
theory. Journal of Basic Engineering, 83(1):95–108, 1961.

Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.


In The International Conference on Knowledge Discovery and Data Mining, volume 96,
pages 202–207, 1996.

Tomer Koren. Open problem: Fast stochastic exp-concave optimization. In Conference on


Learning Theory, pages 1073–1075, 2013.

Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Lower and upper bounds on the generaliza-
tion of stochastic exponentially concave optimization. In Conference on Learning Theory,
pages 1305–1320, 2015.

Peter McCullagh and John A. Nelder. Generalized Linear Models. London Chapman and
Hall, 2nd ed edition, 1989.

Noboru Murata and Shun-ichi Amari. Statistical analysis of learning dynamics. Signal
Processing, 74(1):3–28, 1999.

Yann Ollivier. Online natural gradient as a Kalman filter. Electronic Journal of Statistics,
12(2):2930–2961, 2018.

Dmitrii M. Ostrovskii and Francis Bach. Finite-sample analysis of m-estimators using self-
concordance. Electronic Journal of Statistics, 15(1):326–391, 2021.

Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by


averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

54
Stochastic Online Optimization using Kalman Recursion

Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. Lecture notes for
course 18S997, 2015.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of
Mathematical Statistics, pages 400–407, 1951.

David Ruppert. Efficient estimations from a slowly convergent robbins-monro process.


Technical report, Cornell University Operations Research and Industrial Engineering,
1988.

Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of
Computational Mathematics, 12(4):389–434, 2012.

Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent.
In International Conference on Machine Learning, pages 928–936, 2003.

55

You might also like