Stochastic Online Opitmization Using Kalman Recursion - Paper
Stochastic Online Opitmization Using Kalman Recursion - Paper
Abstract
We study the Extended Kalman Filter in constant dynamics, offering a bayesian perspective
of stochastic optimization. For generalized linear models, we obtain high probability bounds
on the cumulative excess risk in an unconstrained setting, under the assumption that the
algorithm reaches a local phase. In order to avoid any projection step we propose a two-
phase analysis. First, for linear and logistic regressions, we prove that the algorithm enters
a local phase where the estimate stays in a small region around the optimum. We provide
explicit bounds with high probability on this convergence time, slightly modifying the
Extended Kalman Filter in the logistic setting. Second, for generalized linear regressions,
we provide a martingale analysis of the excess risk in the local phase, improving existing
ones in bounded stochastic optimization. The algorithm appears as a parameter-free online
procedure that optimally solves some unconstrained optimization problems.
Keywords: extended kalman filter, online learning, stochastic optimization
1. Introduction
The optimization of convex functions is a long-standing problem with many applications. In
supervised machine learning it frequently arises in the form of the prediction of an observa-
tion yt ∈ R given explanatory variables Xt ∈ Rd . The aim is to minimize a cost depending
on the prediction and the observation. We focus in this article on linear predictors, hence
the loss function is of the form `(yt , θ> Xt ).
Two important settings have emerged in order to analyse learning algorithms. In the
online setting (Xt , yt ) may be set by an adversary. The assumption required is bound-
edness and the goal is to upper estimate the regret (cumulative excess loss compared to
the optimum). In the stochastic setting with independent identically distributed (i.i.d.)
(X
Pnt , yt ), we define the risk L(θ) = E[`(y, θ> X)]. We focus on the cumulative excess risk
? ?
t=1 L(θ̂t ) − L(θ ) where θ minimizes the risk. We obtain bounds holding with high prob-
ability simultaneously for any horizon, that is, we control the whole trajectory of the risk.
Furthermore, our bounds on the cumulative risk all lead to similar ones on the excess risk
at any step for the averaged version of the algorithm.
Due to its low computational cost the Stochastic Gradient Descent (SGD) of Robbins
and Monro (1951) has been widely used, along with its equivalent in the online setting,
the online gradient descent (Zinkevich, 2003) and a simple variant where the iterates are
averaged (Ruppert, 1988; Polyak and Juditsky, 1992). More recently Bach and Moulines
(2013) provided a sharp bound in expectation on the excess risk for a two-step procedure
that has been extended to the average of SGD with a constant step size (Bach, 2014).
Second-order methods based on stochastic versions of Newton-Raphson algorithm have been
developed in order to converge faster in iterations, although with a bigger computational
cost per iteration (Hazan et al., 2007).
In order to obtain a parameter-free second-order algorithm we apply a bayesian perspec-
tive, seeing the loss as a negative log-likelihood and approximating the maximum-likelihood
estimator at each step. We get a state-space model interpretation of the optimization prob-
lem: in a well-specified setting the space equation is yt ∼ pθt (· | Xt ) ∝ exp(−`(·, θt> Xt ))
with θt ∈ Rd and the state equation defines the dynamics of the state θt . The stochastic con-
vex optimization setting corresponds to a degenerate constant state-space model θt = θt−1
called static. As usual in State-Space models, the optimization is realized with the Kalman
recursion (Kalman and Bucy, 1961) for the quadratic loss and the Extended Kalman Filter
(EKF) (Fahrmeir, 1992) in a more general case. A correspondence has recently been made
by Ollivier (2018) between the static EKF and the online natural gradient (Amari, 1998).
This motivates a risk analysis in order to enrich the link between Kalman filtering and
the optimization community. We may see the static EKF as the online approximation of
bayesian model averaging, and similarly to its analysis derived by Kakade and Ng (2005)
our analysis is robust to misspecification, that is we don’t assume the data to be generated
by the probabilistic model.
The static EKF is very close to the Online Newton Step (ONS) introduced by Hazan
et al. (2007) as both are second-order online algorithms and our results are of the same
flavor as those obtained on the ONS (Mahdavi et al., 2015). However the ONS requires the
knowledge of the region in which the optimization is realized. It is involved in the choice of
the gradient step size and a projection step is done to ensure that the search stays in the
chosen region. On the other hand the static EKF yields two advantages at the cost of being
less generic.
First, there is no costly projection step and each recursive update runs in O(d2 ) op-
erations, where d is the dimension of the features Xt . Therefore, our comparison of the
static EKF with the ONS provides a lead to the open question of Koren (2013). Indeed, the
problem of the ONS pointed out by Koren (2013) is to control the cost of the projection step
and the question is whether it is possible to perform better than the ONS in the stochastic
exp-concave setting. We don’t answer the open question in the general setting. However,
we suggest a general way to get rid of the projection by dividing the analysis between a
convergence proof of the algorithm to the optimum and a second phase where the estimate
stays in a small region around the optimum where no projection is required.
Second, the algorithm is (nearly) parameter-free. We believe that bayesian statistics is
the reasonable approach in order to obtain parameter-free online algorithms in the uncon-
strained setting. Parameter-free is not exactly correct as there are initialization parameters,
which we see as a smoothed version of the hard constraint imposed by bounded algorithm,
but they have no impact on the leading terms of our bounds. Static Kalman filter co-
incides with the ridge forecaster and similarly the static EKF may be seen as the online
approximation of a regularized empirical risk minimizer.
2
Stochastic Online Optimization using Kalman Recursion
where θ? reaches the minimum value of the cumulative loss and thus highly depends on
(`t )1≤t≤n . Under an assumption of bounded gradients, Zinkevich (2003) proved that a first-
√
order online gradient descent yields a regret bound in O( n). The Online Newton Step
(ONS) is a second-order online gradient descent that has been designed to obtain a regret
bound in O(ln n) (Hazan et al., 2007) under the assumption that the losses are exp-concave.
The improved guarantee comes at a cost of O(d2 ) operations per step instead of O(d), along
with a projection at each step whose cost depends on the data.
In the stochastic setting where the losses (`t ) are assumed i.i.d., the aim is to minimize
the risk L(θ) = E[`(θ)]. A natural candidate is the Empirical Risk Minimizer (ERM), whose
asymptotics are well understood (see for example Murata and Amari (1999)). Assuming the
∂2
existence of θ? minimizing the risk and that the Hessian matrix H ? = ∂θ ?
2 L(θ ) is positive
3
de Vilmarest and Wintenberger
Table 1: Summary of existing results along with the static EKF for which we prove the
bound for the cumulative excess risk. We focus on the dependence on n, and δ for
the bounds holding with probability 1 − δ (w.h.p.). S(δ) is the cumulative excess
risk of the convergence phase.
assumption that the loss is α-exp-concave, tr(G? H ?−1 ) ≤ d/α and for the averaged version
of the ONS the rate L(θn ) − L(θ? ) = O(d(ln n + ln δ −1 )/(αn)) with probability 1 − δ is
obtained. From our perspective, the result is stronger than what is claimed by Mahdavi
et al. (2015): the bound obtained is
n
X
L(θ̂n ) − L(θ? ) = O(ln n + ln δ −1 ) , (1)
t=1
holding simultaneously for any n with probability 1 − δ. Note that although averaging this
bound with Jensen’s inequality leads to a sub-optimal bound on the excess risk of the last
averaged estimate, it is conversely not possible to obtain Equation (1) from
L(θ̂n ) − L(θ? ) = O(ln δ −1 /n) ,
holding with probability 1 − δ.
1.2 Contributions
Our first contribution is a local analysis of the static EKF under assumptions defined in
Section 2, and provided that consecutive steps stay in a small ball around the optimum
θ? . We derive local bounds on the cumulative risk with high probability from a martingale
analysis. Our analysis of Section 3 is similar to the one of Mahdavi et al. (2015) and we
slightly refine their constants as a by-product.
We then show that the convergence property crucial in our analysis is reachable. To that
end we focus on linear regression and logistic regression as these two well-known problems
are challenging in the unconstrained setting. In linear regression, the gradient of the loss is
not bounded globally. In logistic regression, the loss is strictly convex, but neither strongly
convex nor exp-concave in the unconstrained setting. In Section 4, we develop a global
bound in the logistic setting on a slight modification of the algorithm introduced by Bercu
et al. (2020). We prove that this modified algorithm converges and stays into a local region
around θ? after a finite number of steps. Moreover we show that it coincides with the static
EKF and thus our local analysis applies. In Section 5, we apply our local analysis to the
quadratic setting. We rely on Hsu et al. (2012) to obtain the convergence of the algorithm
after exhibiting the correspondence between Kalman filter in constant dynamics and the
ridge forecaster, and we therefore obtain similarly a global bound.
4
Stochastic Online Optimization using Kalman Recursion
Finally, we demonstrate numerically the competitiveness of the static EKF for logistic
regression in Section 6.
Assumption 1 The observations (Xt , yt )t are i.i.d. copies of the pair (X, y) ∈ X × Y and
there exists DX such that kXt k ≤ DX almost surely (a.s.).
Working under Assumption 1, we define the risk function L(θ) = E `(y, θ> X) . Note that
in Section 3 we don’t need E[XX > ] invertible, but we will make such an assumption in
Sections 4 and 5 to prove the convergence of the algorithm in the logistic and quadratic
settings, respectively. In order to work on a well-defined optimization problem we assume
there exists a minimum:
5
de Vilmarest and Wintenberger
where (θ̂t )t are the estimates of the static EKF, and with the convention min ∅ = +∞. We
assume a bound on τ (ε) holding with high probability:
Assumption 5 For any δ, ε > 0, there exists T (ε, δ) ∈ N such that
P τ (ε) ≤ T (ε, δ) ≥ 1 − δ .
Assumption 5 states that with high probability there exists a convergence time after which
the algorithm stays trapped in a local region around the optimum. Sections 4 and 5 are
devoted to define explicitly such a convergence time for a modified EKF in the logistic
setting and for the true EKF in the quadratic setting.
6
Stochastic Online Optimization using Kalman Recursion
We present our result in the bounded and sub-gaussian settings. The results and their
proofs are very similar, but two crucial steps are different. First, Assumption 3 yields a
bound on the gradient holding almost surely. We relax the boundedness condition for the
quadratic loss with a sub-gaussian hypothesis, requiring a specific analysis. Second, our
analysis is based on a second-order expansion. The quadratic loss is equal to its second-
order Taylor expansion but we need Assumption 5 along with the third point of Assumption
3 otherwise.
The following theorem is our result in the bounded setting.
Theorem 1 Starting the static EKF from any θ̂1 ∈ Rd , P1 0, if Assumptions 1, 2, 3, 5
are satisfied and if ρε > 0.95, for any δ > 0, it holds for any n ≥ 1 simultaneously
T (ε,δ)+n 2
X 5 hε λmax (P1 )DX
L(θ̂t ) − L(θ? ) ≤ dκε ln 1 + n + 5λmax PT−1
(ε,δ)+1 ε
2
2 d
t=T (ε,δ)+1
+ 30 2κε + hε ε2 DX
2
ln δ −1 ,
7
de Vilmarest and Wintenberger
−1
Pt+1
Q
where is the projection on K for the norm k.kP −1 .
t+1
K
This result proved in Appendix B.1 is a corollary of a martingale inequality from Bercu and
Touati (2008) and a stopping time construction (Freedman, 1975).
We detail the key ideas of the proofs in the rest of the Section, and we defer to Appendix
B the proof of the intermediate results along with the detailed proof of Theorems 1 and 2.
Specifically, we display in Section 3.1 the parallel with the ONS, where we compare with
the existing result on the cumulative risk, and a similar analysis yields an adversarial bound
on a second-order expansion of the loss. In Section 3.2 we compare the excess risk with
its second-order expansion thanks to Assumption 5, and we use a martingale analysis to
obtain a bound on the cumulative excess risk.
Corollary 4 Assume the search region K has diameter D and the gradients are bounded
by R. Let (wt )t be the ONS estimates starting from w1 ∈ K, P1 = λI and using a step-size
γ = 12 min( 4RD
1
, α) with α the exp-concavity constant of ` on K. Then for any δ > 0, it
holds for any n ≥ 1 simultaneously
n
nR2 12 4γR2 D2
X 3 λγ 2
?
L(wt ) − L(θ ) ≤ d ln 1 + + D + + ln δ −1 ,
2γ λd 6 γ 3
t=1
For the sake of consistency, we display Corollary 4 as a bound on the cumulative excess risk,
whereas Theorem 3 of Mahdavi et al. (2015) is a bound on the excess risk of the averaged
8
Stochastic Online Optimization using Kalman Recursion
ONS. The latter follows directly from an application of Jensen’s inequality. The proof of
Corollary 4 consists in replacing Theorem 4 of Mahdavi et al. (2015) with Lemma 3. We
obtain similar constants in Theorem 1 and in Corollary 4, as κε is the inverse of the exp-
concavity constant α. The use of second-order methods with well-tuned preconditioning is
crucial in order to replace the leading constant R2 /µ obtained for first-order methods by
d/α (µ is the minimum eigenvalue of the hessian H ? ).
Our results on the static EKF are less general than the ones obtained on the ONS as
a control of the convergence time τ (ε) ≤ T (ε, δ) is required with high probability. On the
other hand the results obtained on the ONS require the knowledge of the exp-concavity
constant α whereas the static EKF is parameter-free. That is why we argue that the static
EKF provides an optimal way to tune the step size and the preconditioning matrix. Indeed,
as ε is a parameter of the EKF analysis but not of the algorithm, we can improve the
leading constant κε on a local region arbitrarily small around θ? , at a cost of a loose control
of the T (ε, δ) first steps. In the ONS the choice of a diameter D > kθ? k makes the gradient
step-size sub-optimal and impacts the leading constant.
Once the parallel between the ONS and the static EKF has been displayed (Algorithm 2),
it is natural to adopt an approach similar to the one in Hazan et al. (2007). The cornerstone
of our local analysis is the derivation of an adversarial bound on the second-order Taylor
expansion of `, from the recursive update formulae.
Lemma 5 For any sequence (Xt , yt )t , starting from any θ̂1 ∈ Rd , P1 0, it holds for any
θ? ∈ Rd and n ∈ N that
n >
X
0 > ? 1 ? >
00 > >
?
` (yt , θ̂t Xt )Xt (θ̂t − θ ) − (θ̂t − θ ) ` (yt , θ̂t Xt )Xt Xt (θ̂t − θ )
2
t=1
n
1X > kθ̂1 − θ? k2
≤ Xt Pt+1 Xt `0 (yt , θ̂t> Xt )2 + .
2 λmin (P1 )
t=1
We cannot compare the excess loss with the second-order Taylor expansion in general,
and it is natural to use a step size parameter. In Hazan et al. (2007), the regret analysis of
the ONS is based on a very similar bound on
> γ
`0 (yt , wt> Xt )Xt (wt − θ? ) − (wt − θ? )> `0 (yt , wt> Xt )2 Xt Xt> (wt − θ? ) ,
2
where γ is a step size parameter. Then the regret bound follows from the exp-concavity
property, bounding the excess loss `(yt , wt> Xt ) − `(yt , θ?> Xt ) with the previous quantity
for a specific γ. The dependence of γ on the exp-concavity constant and the bound on the
gradients require that the algorithm stays in a bounded region around the optimum θ? , and
a projection on this region is used, potentially at each step.
We follow a very different approach, to stay parameter-free, unconstrained and to avoid
any additional cost in the leading constant. In the stochastic setting, we observe that we can
upper-bound the excess risk with a second-order expansion, up to a multiplicative factor.
9
de Vilmarest and Wintenberger
Lemma 5 motivates the use of c > 21 , thus we need at least ρkθ−θ? k > 12 . In the quadratic
setting, it holds as an equality with ρ = 1 because the second derivative of the quadratic
loss is constant. In the bounded setting we need to control the second derivative in a small
range, and we can achieve that only locally, therefore we separate the condition ρkθ−θ? k > 12
between the third point of Assumption 3 and Assumption 5.
Then we are left to obtain a bound on the cumulative risk from Lemma 5. In order
to compare the derivatives of the risk and the losses, we need to control the martingale
difference adapted to the natural filtration (Ft ) and defined as
>
∂L
∆Mt = − ∇t (θ̂t − θ? ), where ∇t = `0 (yt , θ̂t> Xt )Xt .
∂θ θ̂t
We thus apply Lemma 3 to this martingale difference.
Lemma 8 Starting the static EKF from any θ̂1 ∈ Rd , P1 0, if Assumptions 1 and 2 are
satisfied, for any k ≥ 0 and δ, λ > 0, it holds simultaneously
k+n
ln δ −1
X 3
∆Mt − λ(θ̂t − θ? )> ∇t ∇>
t + E ∇ ∇
t t
>
| Ft−1 ( θ̂ t − θ ?
) ≤ , n ≥ 1,
2 λ
t=k+1
(2)
with probability at least 1 − δ, where we define ∇t = `00 (yt , θ̂t> Xt )Xt Xt> for any t. In the
last equation, we control (see Appendix B.4 and B.5) the quadratic term in θ̂t − θ? on the
2
left hand-side in terms of (θ̂t −θ? )> ∂∂θL2 |θ̂t (θ̂t −θ? ) in order to lower-bound the left expression
proportionally to the cumulative excess risk using Proposition 7 for well chosen λ.
10
Stochastic Online Optimization using Kalman Recursion
4. Logistic Setting
Logistic regression is a widely used statistical model in classification. The prediction of a
binary random variable y ∈ Y = {−1, 1} consists in modelling L(y | X) with
>
!
1 yθ> X − (2 ln(1 + eθ X ) − θ> X)
pθ (y | X) = = exp .
1 + e−yθ> X 2
>X
In the GLM notations, it yields a = 2 and b(θ> X) = 2 ln(1 + eθ ) − θ> X.
This modification yields Algorithm 3, where we keep the notations θ̂t , Pt , τ (ε) with some
abuse in the rest of this section. We impose a decreasing threshold on αt (β > 0) and we
prove that the recursion coincides with Algorithm 1 after some steps. The sensitivity of
the algorithm to β is discussed at the end of Section 4.2. Also, note that the threshold
could be c/tβ , c > 0, as in Bercu et al. (2020). We consider 1/tβ for clarity. We control the
convergence time τ (ε) of this version of the EKF:
Proposition 9 Starting Algorithm 3 from θ̂1 = 0 and any P1 0, if Assumptions 1 and 2
are satisfied and E[XX > ] is invertible, for any ε, δ > 0, it holds τ (ε) ≤ T (ε, δ) along with
1
∀t > T (ε, δ), αt = > >X
,
(1 + eθ̂t Xt )(1 + e−θ̂t t )
11
de Vilmarest and Wintenberger
Besides the convergence of the truncated EKF, the proposition states that the truncated
recursions coincide with the static EKF ones after the first T (ε, δ) steps. Thus we can apply
our analysis of Section 3. We state the global result for ε = 1/(20DX ):
Theorem 10 Under the assumptions of Proposition 9, for any δ > 0, it holds for any
n ≥ 1 simultaneously
n 2
λmax (P1−1 )
X
DX kθ? k λmax (P1 )DX ?
?
L(θ̂t ) − L(θ ) ≤ 3de ln 1 + n + 2 + 64eDX kθ k ln δ −1
4d 75DX
t=1
1 1 1 2 λ (P )D2
max 1
+T ,δ + DX kθ̂1 − θ? k + T ,δ X
,
20DX 300 20DX 2
Proposition 11 Under the assumptions of Proposition 9, for any δ > 0, it holds simulta-
neously
20DX4
625dDX 8 1/(1−β)
4
∀t > 2 ln 4 , λmax (Pt ) ≤ ,
Λmin Λmin δ Λmin t1−β
with probability at least 1 − δ.
This proposition justifies the choice β < 1/2 in the introduction of the truncated algo-
rithm to satisfy the condition t λmax (Pt )2 < +∞ with high probability. Motivated by
P
Proposition 11, we define, for C > 0, the event
∞
\ C
AC := λmax (Pt ) ≤ .
t1−β
t=1
To obtain a control on Pt holding for any t, we use the relation λmax (Pt ) ≤ λmax (P1 ) holding
almost surely. We thus define
20DX 4 625dD8
4 X
Cδ = max , λmax (P1 ) ln ,
Λmin Λ2min Λ4min δ
and we obtain P (ACδ ) ≥ 1 − δ. We obtain the following theorem under that condition.
12
Stochastic Online Optimization using Kalman Recursion
Theorem
12 Under the assumptions of Proposition 9, we have for any δ, ε > 0 and t ≥
8 C 2 (1+eDX (kθ ? k+ε) )3
28 DX
exp δ
Λ3 (1−2β)3/2 ε2
,
min
√ Λ6min (1 − 2β)ε4
? 2
P(kθ̂t − θ k > ε | ACδ ) ≤ ( t + 1) exp − 12 C 2 (1 + eDX (kθ? k+ε) )6
ln(t)
216 DX δ
Λ2min (1 − 2β)ε4 √
1−2β
+ t exp − 11 4 2 ( t − 1) .
2 DX Cδ (1 + eDX (kθ? k+ε) )2
The beginning of our convergence proof starts similarly as the analysis of Bercu et al.
(2020): we obtain a recursive inequality ensuring that (L(θ̂t ))t is decreasing in expectation.
However, in order to obtain a non-asymptotic result we cannot apply Robbins-Siegmund
Theorem. Instead we use the fact that the variations of the algorithm θ̂t are slow thanks
to the control on Pt . Thus, if the algorithm was far from the optimum, the last estimates
were far too which contradicts the decrease in expectation of the risk. Consequently, we
look at the last k ≤ t such that kθ̂k − θ? k < ε/2, if it exists. We decompose the probability
of being
√ outside the local region in two scenarii, yielding the two terms in Theorem 12. If
k < t, the recursive decrease in expectation√makes it unlikely that the estimate stays far
from the optimum for a long period. If k > t, the control on Pt allows a control on the
probability that the algorithm moves fast, in t − k steps, away from the optimum.
The following corollary explicitly defines a guarantee for the convergence time.
This definition of T (ε, δ) allows a discussion on the dependence of the bound Theorem 10
to the different parameters. Note that the choice ε = 1/(20DX ) in Theorem 10 is artificially
made for simplifying constants since the bound actually holds for any ε > 0 simultaneously.
The truncation has introduced an extra parameter 0 < β < 1/2 that does not impact the
leading term in Theorem 10. However, it impacts the first step control in an intricate way.
On the one hand, when β is close to 0, the algorithm is slow to coincide with the true EKF
as T (ε, δ) = eO(1)/β . On the other hand, the larger β, the larger our control on λmax (Pt )
3/2
and thus we get T (ε, δ) = eO(1)/(1−2β) . Practical considerations show that the truncation
is artificial and can even deteriorate the performence of the EKF, see Section 6. Thus Bercu
et al. (2020) suggest to choose β = 0.49.
The dependence on δ is even more complex. The third constraint on T (ε, δ) is O(δ −1 )
which should not be sharp. To improve this lousy dependence in the bound, one needs a
better control of Pt . It would follow from a specific analysis of the O(ln δ −1 ) first recursions
in order to ”initialize” the control on Pt . However the objective of Corollary 13 was to prove
Proposition 9 and not to get an optimal value of T (ε, δ). A refinement of our convergence
analysis following from a tighter control on Pt of the EKF than the one provided by Tropp
(2012) is a very important and challenging open question.
13
de Vilmarest and Wintenberger
5. Quadratic Setting
We obtain a global result for the quadratic loss where Algorithm 1 becomes the standard
Kalman filter (recall that we take σ 2 = 1, that is `(y, ŷ) = (y − ŷ)2 /2 and a = 1, b0 (θ̂t> Xt ) =
θ̂t> Xt , αt = 1).
The parallel with the ridge forecaster was evoked by Diderrich (1985), and it is crucial
that the static Kalman filter is the ridge regression estimator for a decaying regularization
parameter. It highlights that the static EKF may be seen as an approximation of the
regularized ERM:
Proposition 14 In the quadratic setting, for any sequence (Xt , yt ), starting from any θ̂1 ∈
Rd and P1 0, the static EKF satisfies the optimisation problem
t−1
!
1X 1
θ̂t = arg min (ys − θ> Xs )2 + (θ − θ̂1 )> P1−1 (θ − θ̂1 ) , t ≥ 1.
θ∈Rd 2 2
s=1
Note that the static Kalman filter provides automatically a right choice of the ridge
regularization parameter. This equivalence yields a logarithmic regret bound for the Kalman
filter (Theorem 11.7 of Cesa-Bianchi and Lugosi, 2006). It follows from Lemma 5 as the
quadratic loss coincides with its second-order Taylor expansion. The leading term of the
bound is d ln n maxt (yt − θ̂t> Xt )2 , thus yt − θ̂t> Xt needs to be bounded.
As the static Kalman filter estimator is exactly the ridge forecaster, we can also use
the regularized empirical risk minimization properties to control T (ε, δ). In particular, we
apply the ridge analysis of Hsu et al. (2012), and we check Assumption 5:
Proposition 15 Starting from any θ̂1 ∈ Rd and P1 0, if Assumptions 1, 2 and 4 hold and
if E[XX > ] is invertible then Assumption 5 holds for T (ε, δ) defined explicitly in Appendix
D, Corollary 26.
Up to universal constants, defining Λmin as the smallest eigenvalue of E[XX > ], we get
ε−1 kθ̂1 − θ? k2 D2 √
2
T (ε, δ) . h + X (1 + Dapp ) ln δ −1 + σ 2 d
Λmin p1 Λmin
!!
kθ̂1 − θ? k
DX ? 2 −1
+ √ (Dapp + DX kθ k) + √ + σ ln δ ,
Λmin p1
with h(x) = x ln x. We obtain a much less dramatic dependence in ε than in the logistic
setting. However we could not avoid a Λ−1 min factor in the definition of T (ε, δ). It is not
surprising since the convergence phase relies deeply on the behavior of Pt .
As for the logistic setting, we split the cumulative risk into two sums. The sum of the
first terms is roughly bounded by a worst case analysis, and the sum of the last terms is
estimated thanks to our local analysis (Theorem 2). However, as the loss and its gradient
are not bounded we cannot obtain a similar almost sure upper-bound on the convergence
phase. The sub-gaussian assumption provides a high probability bound instead.
14
Stochastic Online Optimization using Kalman Recursion
Theorem 16 Under the assumptions of Proposition 15, for any ε, δ > 0, it holds simulta-
neously
n 2
X 15 λmax (P1 )DX
L(θ̂t ) − L(θ? ) ≤ d 8σ 2 + Dapp
2
+ ε2 D X
2
+ 5λmax (P1−1 )ε2
ln 1 + n
2 d
t=1
λmax (P1 )DX 2
2
+ 115 σ (4 + ) + Dapp + 2ε DX ln δ −1
2 2 2
4
+ DX 2
5ε2 + 2(kθ̂1 − θ? k2 + 3λmax (P1 )DX σ ln δ −1 )2 T (ε, δ)
2λmax (P1 )2 DX
4 (3σ + D
app )
2
+ T (ε, δ)3 , n ≥ 1,
3
with probability at least 1 − 6δ.
Note that the dependence of the cumulative excess risk of the convergence phase in
terms of δ is O(log(δ −1 )3 ).
6. Experiments
We experiment the static EKF for logistic regression. Precisely, we compare the following
sequential algorithms that we all initialize at 0:
• The static EKF and the truncated version (Algorithm 3). We take the default value
P1 = Id along with the value β = 0.49 suggested by Bercu et al. (2020). Note that a
threshold 10−10 /t0.49 as recommended by Bercu et al. (2020) would coincide with the
static EKF.
• The ONS and the averaged version. The convex region of search is a ball centered
in 0 and of radius Dθ = 1.1kθ? k, a setting where we have good knowledge of θ? .
We consider two choices of the exp-concavity constant on which the ONS crucially
relies to define the gradient step size. First, we use the only available bound e−Dθ DX .
Second, in the settings where the step size is so small that the ONS doesn’t move, we
use the exp-concavity constant κ0 at θ? . This yields a bigger step size, though the
exp-concavity is not satisfied on the region of search.
15
de Vilmarest and Wintenberger
Figure 1: Density of the Bernoulli parameter on 107 samples: on the left and on the middle
?>
density of (1 + e−θ X )−1 for the two well-specified settings (left, the ordinate is
>
in log scale), and on the right density of (1 + e−θj Xt )−1 with j ∈ {1, 2} uniformly
at random for the misspecified setting. On the right we observe the two modes
> >
E[(1 + e−θ1 Xt )−1 ] ≈ 0.28 and E[(1 + e−θ2 Xt )−1 ] ≈ 0.79.
• Well-specified 1: we define θ? = (−9, 0, 3, −9, 4, −9, 15, 0, −7, 1, 0)> , and at each it-
?>
eration t, the variable yt ∈ {−1, 1} is a Bernoulli variable of parameter (1+e−θ Xt )−1 .
• Well-specified 2: in the first well-specified setting the Bernoulli parameter is mostly
distributed around 0 and 1 (see Figure 1), thus we try a less discriminated setting
1
with θ? = 10 (−9, 0, 3, −9, 4, −9, 15, 0, −7, 1, 0)> .
• Misspecified: In order to demonstrate the robustness of the EKF we test the algo-
rithms in a misspecified setting switching randomly between two well-specified logistic
processes. We define θ1 = 101
(−9, 0, 3, −9, 4, −9, 15, 0, −7, 1, 0)> and θ2 where we have
only changed the first coefficient from −9/10 to 15/10. Then yt is a Bernoulli random
> >
variable whose parameter is either (1 + e−θ1 Xt )−1 or (1 + e−θ2 Xt )−1 uniformly at
random. We checked that Assumption 2 is still satisfied.
We evaluate the different algorithms with the mean squared error E[kθ̂t − θ? k2 ] that we
approximate by its empirical version on 100 samples. We display the results in Figure 2.
16
Stochastic Online Optimization using Kalman Recursion
Figure 2: Mean squared error in log-log scale for the three synthetic settings. For the first
well-specified setting (left) the √ONS is applied using the exp-concavity constant
κ0 ≈ 1.7 · 10−15 instead of e−Dθ d to accelerate the algorithm, and both the ONS
and its averaged
√
version still don’t move. In the other two (middle and right)
we use e −D θ d for the ONS. We observe that the EKF and the truncated version
coincide in the two last settings.
obtain d = 98 once categorical variables are transformed into binary variables. We use
the canonical split between training (32561 instances) and testing (16281 instances).
For each data set, we standardize X such that each feature ranges from 0 to 1. At each step
we sample within the training set (with replacement). We evaluate through an empirical
version of E[L(θ̂n )] − L(θ? ) estimated on 100 samples and where L is estimated on the test
set, see Figure 3.
6.3 Summary
Our experiments show the superiority of the EKF for logistic regression compared to the
ONS or to averaged SGD in all the settings we tested. We display in Table 2 a few indicators
of the data sets. In particular, it is interesting that the static EKF works well even in a
setting where the Hessian matrix H ? is singular.
It appears clear that low exp-concavity constants are responsible of the poor perfor-
mances of the ONS. One may tune the gradient step size at the cost of losing the exp-
concavity property and thus the regret guarantee of (Hazan et al., 2007) or its analogous
for the cumulative risk (Mahdavi et al., 2015). Averaging is crucial for the ONS, whereas it
is useless for the static EKF. Indeed we chose not to plot the averaged version of the EKF
for clarity, but the EKF performs better than its averaged version.
It is important to note that in the first synthetic setting the truncation deteriorates
the performance of the EKF, as well as in the adult income data set to a lesser extent,
whereas the results are the same in the other settings. Bercu et al. (2020) argue that the
truncation is artificially introduced for the convergence property, thus they use the threshold
10−10 /t0.49 instead of 1/t0.49 and the truncated version almost coincides with the true EKF.
We confirm here that the truncation may be damaging if the threshold is set too high and
17
de Vilmarest and Wintenberger
Figure 3: Excess test risk for forest cover type (left) and adult income (right). As the
ONS doesn’t move when applied with the exp-concavity constant e−Dθ DX we use
instead the exp-concavity constant at θ? : κ0 ≈ 1.4 · 10−3 for forest cover type
and κ0 ≈ 5.5 · 10−6 for adult income. The EKF and the truncated version almost
coincide for both data sets.
Table 2: For the different experimental settings we display the dimension d and the condi-
tion number of the Hessian at θ? (λmax (H ? ) and µ are the maximal and minimal
eigenvalues of H ? ). We present the value of tr(G? H ?−1 ) which is bounded either
by R2 /µ, or by deDθ DX because e−Dθ DX bounds the exp-concavity constant on
the centered ball of radius Dθ . We add to the table dκ0 ≤ deDθ DX where κ0 is the
inverse of the exp-concavity constant of the loss at θ? .
18
Stochastic Online Optimization using Kalman Recursion
we recommend to use the EKF in practice, or equivalently the truncated version with the
low threshold suggested by Bercu et al. (2020).
7. Conclusion
We studied an efficient way to tackle some unconstrained optimization problems, in which we
get rid of the projection step of bounded algorithm such as the ONS. We presented a bayesian
approach where we transformed the loss into a negative log-likelihood. We used the Kalman
recursion to provide a parameter free approximation of the maximum-likelihood estimator.
We demonstrated the optimality of the local phase for locally exp-concave losses which can
be expressed as GLM log-likelihoods. We proved the finiteness of the convergence phase
in logistic and quadratic regressions. We illustrated our theoretical results with numerical
experiments for logistic regression. It would be interesting to generalize our results to a
larger class of optimization problems.
Finally, this article aimed at strengthening the bridge between Kalman recursion and
the optimization community. Therefore we made the i.i.d. assumption, standard in the
stochastic optimization literature and we focus on the static EKF. It may lead the way to
a risk analysis of the general EKF in non i.i.d. state-space models.
19
de Vilmarest and Wintenberger
We focus on the static setting where the state equation becomes θt+1 = θt , thus we have
θ̂t+1 = θ̂t|t and Pt+1 = Pt|t . We rewrite the update on Pt as follows:
Pt Xt Xt> Pt b00 (θ̂t> Xt )/a
Pt+1 = Pt − .
Xt> Pt Xt b00 (θ̂t> Xt )/a + 1
Moreover we have Pt+1 Xt = Pt Xt Ft−1 ab00 (θ̂t> Xt ) thus we can rewrite the update on θ̂t as
follows:
Pt+1 Xt (yt − b0 (θ̂t> Xt )
θ̂t+1 = θ̂t + .
a
This yields Algorithm 1.
20
Stochastic Online Optimization using Kalman Recursion
First, we have
k k−1
!
X [
P An−1 ∩ En−1 ≤ P An .
n=1 n=0
21
de Vilmarest and Wintenberger
The second line is obtained since En ⊂ En−1 ∩ An−1 . According to the tower property
and the super-martingale assumption,
h i h i
E Vn 1En−1 ∩An−1 = E E[Vn | Fn−1 , An−1 ]1En−1 ∩An−1
h i
≤ E E[Vn | Fn−1 , An−1 ]1En−1
h i
≤ E Vn−1 1En−1 .
k
X
P Vn > δ −1 ∩ En−1 ∩ An−1 ≤ δ .
n=1
k−1
!
[
P (Ek ) ≤ P An +δ
n=0
k+n !
λ2
X
Vn = exp λ∆Nt − ((∆Nt )2 + E[(∆Nt )2 | Ft−1 ]) .
2
t=k+1
Lemma B.1 of Bercu and Touati (2008) states that (Vn ) is a super-martingale adapted to
the filtration (Fk+n ). Moreover V0 = 1 and for any n, it holds Vn ≥ 0 almost surely. There-
fore we can apply Lemma 17.
22
Stochastic Online Optimization using Kalman Recursion
−1
We bound the telescopic sum: as Pn+1 < 0, we have
n
X
(θ̂t − θ? )> Pt−1 (θ̂t − θ? ) − (θ̂t+1 − θ? )> Pt+1
−1
(θ̂t+1 − θ? )
t=1
kθ̂1 − θ? k2
≤ (θ̂1 − θ? )> P1−1 (θ̂1 − θ? ) ≤ .
λmin (P1 )
The result follows from the identities
(y − b0 (θ?> X))X
E − = 0,
a
Considering the function f : λ → (θ − θ? )> E Xb0 (θ> X + λ(θ − θ? )> X)) , we know there
exists λ ∈ [0, 1] such that f 0 (λ) = f (1) − f (0). This translates into
∂L > 1 h i
(θ − θ? ) = (θ − θ? )> E Xb00 θ> X + λ(θ? − θ)> X (θ − θ? )> X .
∂θ θ a
Then we use Assumption 3:
23
de Vilmarest and Wintenberger
yielding
∂L > h i
(θ − θ? ) ≥ ρkθ−θ? k (θ − θ? )> E `00 (y, θ> X)XX > (θ − θ? )
∂θ θ
∂2L
= ρkθ−θ? k (θ − θ? )> (θ − θ? ) .
∂θ2 θ
>
∂L
Proof of Proposition 7. We first recall that L(θ)−L(θ? ) ≤ ∂θ (θ−θ? ), then Proposition
θ
6 yields
∂L > ∂2L c ∂L >
(θ − θ? ) − c(θ − θ? )> 2
(θ − θ? ) ≥ (1 − ) (θ − θ? ) ,
∂θ θ ∂θ θ ρkθ−θ? k ∂θ θ
It yields
(∆Mt )2 + E[(∆Mt )2 | Ft−1 ] ≤ (θ̂t − θ? )> 3E[∇t ∇>
t | Ft−1 ] + 2∇ ∇
t t
>
(θ̂t − θ? ) ,
We derive the following Lemma in order to control the right-hand side of Lemma 5, in
both settings.
Lemma 18 Assume the second point of Assumption 3 holds. For any k, n ≥ 1, if kθ̂t −
θ? k2 ≤ ε for any k < t ≤ k + n then we have
k+n 2
X
−1 −1
hε λmax (Pk+1 )DX
Tr Pt+1 (Pt+1 − Pt ) ≤ d ln 1 + n .
d
t=k+1
24
Stochastic Online Optimization using Kalman Recursion
k+n k+n
!
X
−1
X det(Pt−1 )
− Pt−1 ) =
Tr Pt+1 (Pt+1 1− −1
t=k+1 t=k+1
det(Pt+1 )
k+n
!
−1
X det(Pt+1 )
≤ ln −1
t=k+1
det(Pt )
−1
!
det(Pk+n+1 )
= ln −1
det(Pk+1 )
k+n
!
1/2 1/2
X
≤ ln det I + `00 (yt , θ̂t> Xt )(Pk+1 Xt )(Pk+1 Xt )>
t=k+1
d
X
= ln(1 + λi ) ,
i=1
k+n
1/2 1/2
`00 (yt , θ̂t> Xt )(Pk+1 Xt )(Pk+1 Xt )> . Therefore we
P
where λ1 , ..., λd are the eigenvalues of
t=k+1
have
k+n d
!
X
−1 1X
Pt−1 )
Tr Pt+1 (Pt+1 − ≤ d ln 1 + λi
d
t=k+1 i=1
1 2
≤ d ln 1 + nhε λmax (Pk+1 )DX .
d
T (ε,δ)+n
X 1
E[∇t | Ft−1 ]> (θ̂t − θ? ) − Qt
2
t=T (ε,δ)+1
3
− λ(θ̂t − θ? )> ∇t ∇> t + E ∇ ∇
t t
>
| F t−1 (θ̂ t − θ ?
)
2
T (ε,δ)+n
1 X kθ̂1 − θ? k2 ln δ −1
≤ Xt> Pt+1 Xt `0 (yt , θ̂t> Xt )2 + + , n ≥ 1, (3)
2 λmin (PT (ε,δ)+1 ) λ
t=T (ε,δ)+1
with probability at least 1−δ, where we define Qt = (θ̂t −θ? )> `00 (yt , θ̂t> Xt )Xt Xt> (θ̂t −θ? )
for any t.
25
de Vilmarest and Wintenberger
On the other hand, thanks to Assumption 3, we can apply Proposition 7 with c = 0.75
to obtain, for any t ≥ 1,
kθ̂t − θ? k ≤ ε
ρε ∂L > ∂2L
=⇒ L(θ̂t ) − L(θ? ) ≤ (θ̂t − θ? ) − 0.75(θ̂t − θ? )> 2 (θ̂t − θ? ) ,
ρε − 0.75 ∂θ θ̂t ∂θ θ̂t
=⇒ L(θ̂t ) − L(θ? ) ≤ 5 E[∇t | Ft−1 ]> (θ̂t − θ? ) − 0.75E[Qt | Ft−1 ] , (4)
es − 1
s ?
E exp 2 Qt − 2 E [Qt | Ft−1 ] | Ft−1 , k θ̂ t − θ k ≤ ε ≤ 1.
hε ε2 DX hε ε2 DX
The sequence (Vn ) is adapted to (FT (ε,δ)+n ), almost surely we have V0 = 1 and Vn ≥ 0.
Finally,
h i
E Vn | FT (ε,δ)+n−1 , kθ̂T (ε,δ)+n − θ? k ≤ ε ≤ Vn−1 ,
∞
We define Aεk = (kθ̂n − θ? k ≤ ε) for any k. The last inequality is equivalent to
T
n=k+1
∞
[ T (ε,δ)+n T (ε,δ)+n
X X
0.1 2 2 −1
P Qt > 10(e − 1) E [Qt | Ft−1 ] + 10hε ε DX ln δ ∩ AεT (ε,δ)
n=1 t=T (ε,δ)+1 t=T (ε,δ)+1
≤ δ. (5)
We then bound the two quadratic terms coming from Lemma 8: using Assumption 3
we have the implications
26
Stochastic Online Optimization using Kalman Recursion
∞ T (ε,δ)+n 1
[ X
P Qt + λ(θ̂t − θ? )> ∇t ∇> ?
t (θ̂t − θ )
2
n=1 t=T (ε,δ)+1
3
+ λ(θ̂t − θ? )> E ∇t ∇> ?
t | F t−1 (θ̂ t − θ ) >
2
1 3 T (ε,δ)+n
X
0.1
10(e − 1)( + λκε ) + λκε E [Qt | Ft−1 ]
2 2
t=T (ε,δ)+1
!
1
+ 10( + λκε )hε ε2 DX2
ln δ −1 ∩ AεT (ε,δ) ≤ δ .
2
0.75−5(e0.1 −1)
We set λ = (10(e0.1 −1)+ 23 )κε
, so that
1 3
10(e0.1 − 1)( + λκε ) + λκε = 0.75 ,
2 2
1 1 0.75 − 5(e0.1 − 1)
+ λκε = + ≈ 0.59 ≤ 0.6 ,
2 2 10(e0.1 − 1) + 32
and consequently
∞ T (ε,δ)+n
[ X
P E[∇t | Ft−1 ]> (θ̂t − θ? ) − 0.75E[Qt | Ft−1 ] > 6hε ε2 DX
2
ln δ −1
n=1 t=T (ε,δ)+1
T (ε,δ)+n
X 1
+ E[∇t | Ft−1 ]> (θ̂t − θ? ) − Qt
2
t=T (ε,δ)+1
! !
3
− λ(θ̂t − θ? )> ∇t ∇> > ?
t + E ∇t ∇t | Ft−1 (θ̂t − θ ) ∩ AεT (ε,δ) ≤ δ.
2
∞ T (ε,δ)+n
[ X
P (L(θ̂t ) − L(θ? )) > 30hε ε2 DX
2
ln δ −1
n=1 t=T (ε,δ)+1
T (ε,δ)+n
X 1
+5 E[∇t | Ft−1 ]> (θ̂t − θ? ) − Qt
2
t=T (ε,δ)+1
! !
3
− λ(θ̂t − θ? )> ∇t ∇> >
(θ̂t − θ? ) ∩ AεT (ε,δ) ≤ δ .
t + E ∇t ∇t | Ft−1
2
27
de Vilmarest and Wintenberger
∞ T (ε,δ)+n T (ε,δ)+n
[ X 5 X
P ?
(L(θ̂t ) − L(θ )) > Xt> Pt+1 Xt `0 (yt , θ̂t> Xt )2
2
n=1 t=T (ε,δ)+1 t=T (ε,δ)+1
! !
5kθ̂1 − θ? k2
+ + 30(2κε + hε ε2 DX
2
) ln δ −1 ∩ AεT (ε,δ) ≤ 2δ .
λmin (PT (ε,δ)+1 )
T (ε,δ)+n
!
X 2
hε λmax (PT (ε,δ)+1 )DX
Xt> Pt+1 Xt `0 (yt , θ̂t> Xt )2 ≤ dκε ln 1 + n .
d
t=T (ε,δ)+1
As PT (ε,δ)+1 4 P1 , we obtain
∞ T (ε,δ)+n 2
[ X
? 5 hε λmax (P1 )DX
P (L(θ̂t ) − L(θ )) > dκε ln 1 + n
2 d
n=1 t=T (ε,δ)+1
! !
5kθ̂1 − θ? k2 2 2 −1 ε
+ + 30(2κε + hε ε DX ) ln δ ∩ AT (ε,δ) ≤ 2δ .
λmin (PT (ε,δ)+1 )
The sub-gaussian hypothesis requires a different treatment of several steps in the proof. In
the following proofs, we use a consequence of the first points of Assumption 4. We apply
Lemma 1.4 of Rigollet and Hütter (2015): for any X ∈ Rd ,
First, we control the quadratic terms in ∇t = −(yt − θ̂t> Xt )Xt in the following lemma.
28
Stochastic Online Optimization using Kalman Recursion
k+n
! !
X
> 3 8σ + 2 2
Dapp +ε 2 2
DX Qt + 12ε 2
DX σ ln δ −1
2 2
∩ Aεk ≤ δ.
t=k+1
Proof
To obtain the last inequality, we use the second point of Assumption 4 to bound the
middle term. Then we use Taylor series for the exponential, and we apply Equation
(6). For any t and any µ satisfying 0 < µ ≤ 4Q1t σ2 , we have
29
de Vilmarest and Wintenberger
(Vn )n is adapted to the filtration (σ(X1 , y1 , ..., Xk+n , yk+n , Xk+n+1 )n , moreover V0 = 1
and Vn ≥ 0 almost surely, and
h i
E Vn | X1 , y1 , ..., Xk+n−1 , yk+n−1 , Xk+n , kθ̂k+n − θ? k ≤ ε ≤ Vn−1 .
which is equivalent to
∞ k+n k+n
! !
[ X X
P Qt (yt − E[yt | Xt ])2 > 8σ 2 Qt + 4ε2 DX σ ln δ −1
2 2
∩ Aεk ≤ δ.
n=1 t=k+1 t=k+1
Assumption 4 implies that for any Xt , E[(yt − E[yt | Xt ])2 | Xt ] ≤ σ 2 . Thus, the tower
property yields
Second, we bound the right-hand side of Lemma 5, that is the objective of the following
lemma.
∞
[ k+n
X
P Xt> Pt+1 Xt (yt − θ̂t> Xt )2 > 12λmax (P1 )DX σ ln δ −1
2 2
n=1 t=k+1
! !
2
λmax (Pk+1 )DX
2 2 2 2 ε
+ 3 8σ + Dapp +ε DX d ln 1 + n ∩ Ak ≤ δ .
d
Proof We apply a similar analysis as in the proof of Lemma 19 in order to use the sub-
gaussian assumption, and then we apply the telescopic argument as in the bounded setting.
30
Stochastic Online Optimization using Kalman Recursion
We decompose yt − θ̂t> Xt :
To control (yt − E[yt | Xt ])2 Xt> Pt+1 Xt , we use its positivity along with Equation (6).
Precisely, for any t and any µ > 0 satisfying 0 < µ ≤ 4X > P 1 X σ2 , we have
t t+1 t
h i
E exp µ(yt − E[yt | Xt ])2 Xt> Pt+1 Xt | Ft−1 , Xt
X µi (Xt> Pt+1 Xt )i E (yt − E[yt | Xt ])2i | Xt
=1+
i!
i≥1
X µi (X > Pt+1 Xt )i i!(2σ 2 )i
t
≤1+2
i!
i≥1
X i
=1+2 2µXt> Pt+1 Xt σ 2
i≥1
1
≤ 1 + 8µXt> Pt+1 Xt σ 2 , 0 < 2µXt> Pt+1 Xt σ 2 ≤
2
> 2
≤ exp 8µXt Pt+1 Xt σ .
1
We apply the previous bound with a uniform µ = 2 σ2 .
4λmax (P1 )DX
As λmax (Pt+1 ) ≤ λmax (P1 )
1
for any t, we get µ ≤ 4Xt> Pt+1 Xt σ 2
. Thus, we define
k+n
!
1 X
(yt − E[yt | Xt ])2 − 8σ 2
Xt> Pt+1 Xt
Vn = exp 2 σ2 , n ∈ N.
4λmax (P1 )DX
t=k+1
(Vn ) is a super-martingale adapted to the filtration (σ(X1 , y1 , ..., Xk+n−1 , yk+n−1 , Xk+n ))n
satisfying almost surely V0 = 1, Vn ≥ 0, thus we apply Lemma 17:
∞
!
[
P (Vn > δ −1 ) ≤ δ ,
n=1
or equivalently
∞
[ k+n
X
P Xt> Pt+1 Xt (yt − E[yt | Xt ])2
n=1 t=k+1
k+n
!!
X
> 8σ 2
Xt> Pt+1 Xt + 4λmax (P1 )DX σ ln δ −1
2 2
≤ δ.
t=k+1
31
de Vilmarest and Wintenberger
Then we apply Lemma 18: the second point of Assumption 3 holds with hε = 1, thus
k+n 2
X
−1 −1
λmax (Pk+1 )DX
Tr Pt+1 (Pt+1 − Pt ) ≤ d ln 1 + n , n ≥ 1.
d
t=k+1
−1
We conclude with Xt> Pt+1 Xt = Tr(Pt+1 (Pt+1 − Pt−1 )).
We sum up our findings and we prove the result for the quadratic loss. The structure
of the proof is the same as the one of Theorem 1.
Proof of Theorem 2. On the one hand, we sum Lemma 5 and Lemma 8: for any λ, δ > 0
T (ε,δ)+n
X
> ? 1 ? >
> 3
>
?
E[∇t | Ft−1 ] (θ̂t −θ )− Qt −λ(θ̂t −θ ) ∇t ∇t + E ∇t ∇t | Ft−1 (θ̂t −θ )
2 2
t=T (ε,δ)+1
T (ε,δ)+n
1 X kθ̂T (ε,δ)+1 − θ? k2 ln δ −1
≤ Xt> Pt+1 Xt (yt − θ̂t> Xt )2 + + , n ≥ 1 , (9)
2 λmin (PT (ε,δ)+1 ) λ
t=T (ε,δ)+1
1 T (ε,δ)+n
X
> + 3λ(8σ 2 + Dapp
2
+ ε2 D X
2
) Qt
2
t=T (ε,δ)+1
T (ε,δ)+n
!
9 X
+ λ σ 2 + Dapp
2
+ ε2 D X
2
E [Qt | Ft−1 ] + 12λε2 DX σ ln δ −1
2 2
2
t=T (ε,δ)+1
!
∩ AεT (ε,δ) ≤ δ.
32
Stochastic Online Optimization using Kalman Recursion
As in the proof of Theorem 1 we apply Lemma A.3 of (Cesa-Bianchi and Lugosi, 2006)
and Lemma 17: for any δ > 0,
∞ T (ε,δ)+n T (ε,δ)+n
[ X X
0.1 2 2 −1
P Qt > 10(e − 1) E[Qt | Ft−1 ] + 10ε DX ln δ
n=1 t=T (ε,δ)+1 t=T (ε,δ)+1
!
∩ AεT (ε,δ) ≤ δ.
∞ T (ε,δ)+n
[ X1 ? > > 3 >
?
P Qt + λ(θ̂t − θ ) ∇t ∇t + E ∇t ∇t | Ft−1 (θ̂t − θ )
2 2
n=1 t=T (ε,δ)+1
1 9
0.1 2 2 2 2 2 2 2 2
> 10(e − 1) + 3λ(8σ + Dapp + ε DX ) + λ(σ + Dapp + ε DX )
2 2
T (ε,δ)+n
X
E [Qt | Ft−1 ]
t=T (ε,δ)+1
!
2 2 1
2 2 2 2 2 2 2 −1
+ 10ε DX + 3λ(8σ + Dapp + ε DX ) + 12λε DX σ ln δ
2
!
∩ AεT (ε,δ) ≤ 2δ . (11)
We set
−1
0.1
0.1 2 2 2 2 9 2 2 2 2
λ = 0.8 − 5(e − 1) 30(e − 1)(8σ + Dapp + ε DX ) + (σ + Dapp + ε DX )
2
in order to obtain
1 9
10(e0.1 − 1) + 3λ(8σ 2 + Dapp2
+ ε2 DX2
) + λ(σ 2 + Dapp
2
+ ε2 D X
2
) = 0.8 ,
2 2
1 1
2 2 2 2 <λ< 2 2 2 ,
109σ + 28Dapp + 28ε DX 108σ + 27Dapp + 27ε2 DX
2 1
10ε2 DX + 3λ(8σ 2 + Dapp
2
+ ε 2 DX
2
) + 12λDX2 2 2
ε σ ≤ 8ε2 DX2
2
1
≤ 28(4σ 2 + Dapp
2
+ ε2 D X2
).
λ
33
de Vilmarest and Wintenberger
2 2 −1 2 2 −1
+ 8ε DX ln δ + 6λmax (P1 )DX σ ln δ , n ≥ 1,
≤ DX kθ̂t − θ? k
≤ DX (kθ̂1 − θ? k + (t − 1)λmax (P1 )DX ) ,
34
Stochastic Online Optimization using Kalman Recursion
because for any s ≥ 1, we have Ps 4 P1 and therefore kθ̂s+1 − θ̂s k ≤ λmax (P1 )DX . Summing
1
from 1 to T ( 20D X
, δ) yields the result.
C.2 Concentration of Pt
We prove a concentration result based on Tropp (2012), which will be used on the inverse
of Pt .
Lemma 21 If Assumption 1 is satisfied, then for any 0 ≤ β < 1 and t ≥ 41/(1−β) , it holds
t−1
! !
Xs X > Λmin t1−β 2
1−β Λmin
X
s
P λmin < ≤ d exp −t 4 .
sβ 4(1 − β) 10DX
s=1
Proof We wish to center the matrices Xs Xs> by subtracting their (common) expected
value. We use that if A and B are symmetric, λmin (A − B) ≤ λmin (A) − λmin (B). Indeed,
denoting by v any eigenvector of A associated with its smallest eigenvalue,
x> (A − B)x
λmin (A − B) = min
x kxk2
v > (A − B)v
≤
kvk2
v > Bv
= λmin (A) −
kvk2
x> Bx
≤ λmin (A) − min
x kxk2
= λmin (A) − λmin (B) .
We obtain:
35
de Vilmarest and Wintenberger
Therefore, we obtain
t−1
! !
X Xs X >sΛmin (t1−β − 2)
P λmin <
sβ 2(1 − β)
s=1
t−1 ! !
Xs Xs> Xs Xs> Λmin (t1−β − 2) t1−β − 1
X
≤ P λmin −E < − Λmin
sβ sβ 2(1 − β) 1−β
s=1
t−1
! !
Xs Xs> Xs Xs> Λmin t1−β
X
= P λmax E β
− β
> .
s s 2(1 − β)
s=1
t−1
" 2 # t−1
!
Xs Xs> Xs Xs> D4 1−β
4 t
X X
X
04 E E − 4 I4 DX I.
sβ sβ sβ 1−β
s=1 s=1
Λ2 1/(1 − β)2
= d exp −t1−β min4 1/(1 − β) + Λ 2
8DX min /(6DX (1 − β))
2
−1 !
Λ Λ min (1 − β)
= d exp −t1−β min4 1−β+ 2 .
8DX 6DX
therefore
t−1
! !
Xs Xs> Λmin (t1−β − 2) 2
1−β Λmin
X
P λmin < ≤ d exp −t 4 .
sβ 2(1 − β) 10DX
s=1
36
Stochastic Online Optimization using Kalman Recursion
Proof of Proposition 11. We first move our problem to the setting of Lemma 21:
t−1
!−1
X
λmax (Pt ) = λmin P1−1 + Xs Xs> αs
s=1
t−1
!−1
X Xs X >
≤ λmin P1−1 + s
,
sβ
s=1
37
de Vilmarest and Wintenberger
1−β c+1
rbk
rm+1 =
P
where the second line is obtained deriving both sides of 1−r with
m≥bk1−β c
respect to r. Also, as 1 − e−x ≥ xe−x for any x ∈ R, we get
4
P ∃t > k, λmax (Pt ) >
Λmin t1−β
4 4 k1−β
Λ2min
2
Λ2min
10DX 1−β 10DX Λmin
≤ 4d 2 exp 2 4 (k + 2 exp 4 ) exp − 4 .
Λmin 10DX Λmin 10DX 10DX
Furthermore, as xe−x ≤ e−1 for any x ≥ 0, we get for any k ≥ 7:
10DX4 2 2
1−β Λmin 1−β Λmin
k + 2 exp 4 exp −k 4
Λmin 10DX 20DX
4 e−1 4
Λ2min Λ2min
20DX 10DX
≤ exp exp
Λ2min Λ2min 10DX 4 20DX 4
4 e−1
Λ2min
20DX 1
= 2 exp exp 4 .
Λmin 2 10DX
Combining the last two inequalities, we obtain
4
P ∃t > k, λmax (Pt ) >
Λmin t1−β
8 e−1
Λ2min
2 2
800DX 1 Λmin 1−β Λmin
≤d exp 2 4 + 2 exp 10D 4 exp −k
Λ4min 10DX X 20DX4
625D8 Λ2
≤ d 4 X exp −k 1−β min4 ,
Λmin 20DX
2 and consequently
and the result follows. The last line comes from Λmin ≤ DX
Λ2min
2
−1 1 Λmin 0.1
800e exp 2 4 + exp 4 ≤ 800e−1+0.2+0.5e ≈ 624.7 ≤ 625 .
10DX 2 10DX
The condition k ≥ 7 is not necessary because
20DX 4
625dDX 8 1/(1−β)
ln ≥ 20 ln(625δ −1 ) ,
Λ2min Λ4min δ
38
Stochastic Online Optimization using Kalman Recursion
and either δ ≥ 1 and the result is trivial, either δ < 1 and 20 ln(625δ −1 ) ≥ 128.
Lemma 22 Let θ ∈ Rd .
∂L
L(θ) − L(θ? ) > η =⇒ ≥ Dη
∂θ θ
√
Λmin η
where Dη = √
√ .
DX (kθ ? k+ 8η/D 2 )
2DX (1+e X )
Λmin
kθ − θ? k > ε =⇒ L(θ) − L(θ? ) > ε2 .
4(1 + eDX (kθ? k+ε) )
Proof Both points derive from a second-order identity, turned in an upper-bound in the
one case and in a lower-bound in the other. Using ∂L ?
∂θ (θ ) = 0, there exists 0 ≤ λ ≤ 1 such
that
? 1 ? > 1 >
L(θ) = L(θ ) + (θ − θ ) E XX (θ − θ? ) .
2 (1 + e(λθ+(1−λ)θ? )> X )(1 + e−(λθ+(1−λ)θ? )> X )
1. We first have
DX 2
kθ − θ? k2 .
L(θ) − L(θ? ) ≤
8
q
Assume L(θ) − L(θ? ) > η. Then kθ − θ? k ≥ 8η/DX 2 . Also, using the Taylor
∂L > 1 h i
L(θ? ) ≥ L(θ0 )+ (θ? −θ0 )+ (θ0 −θ? )> E XX > (θ0 −θ? ) ,
∂θ θ0 4(1 + eDX (kθ? k+kθ0 −θ? k) )
∂L > Λmin
(θ0 − θ? ) ≥ L(θ0 ) − L(θ? ) + kθ0 − θ? k2 .
∂θ θ0 4(1 + e (kθ? k+kθ0 −θ? k) )
DX
∂L Λmin
≥ kθ0 − θtrue k .
∂θ θ0 4(1 + e (kθ? k+kθ0 −θ? k) )
DX
39
de Vilmarest and Wintenberger
∂L ∂L
≥ min
√
∂θ θ ?
kθ0 −θ k= 2
8η/DX ∂θ θ0
Λmin
q
≥ √ 2
8η/DX
? k+ 2 )
4(1 + eDX (kθ 8η/DX
)
Λmin √
≥√ √ 2 )
η.
(kθ? k+
2DX (1 + eDX 8η/DX
)
Λmin
L(θ) − L(θ? ) > min L(θ0 ) − L(θ? ) ≥ ε2 .
kθ0 −θ? k=ε 4(1 + eDX (kθ? k+ε) )
Proof of Theorem 12. We prove the convergence of (L(θ̂t ))t to L(θ? ) and then the
convergence of (θ̂t )t to θ? follows. The convergence of (L(θ̂t ))t comes from the first point of
Lemma 22. The link between the two convergences is stated in the second point.
To study the evolution of L(θ̂t ) we first apply a second-order Taylor expansion: for any
t ≥ 1 there exists 0 ≤ αt ≤ 1 such that
∂L > 1 ∂2L
L(θ̂t+1 ) = L(θ̂t ) + (θ̂t+1 − θ̂t ) + (θ̂t+1 − θ̂t )> 2 (θ̂t+1 − θ̂t ) . (12)
∂θ θ̂t 2 ∂θ θ̂t +αt (θ̂t+1 −θ̂t )
2
We have ∂∂θL2 4 14 E[XX > ], therefore, using the update formula on θ̂, the second-order
term is bounded with
Pt Xt Xt> Pt
yt Xt
θ̂t+1 − θ̂t = Pt − >
αt >
,
1 + Xt Pt Xt αt 1 + eyt θ̂t Xt
and as αt ≤ 1,
Pt Xt Xt> Pt yt Xt 3
−αt >
≤ DX λmax (Pt )2 .
1 + Xt Pt Xt αt 1 + eyt θ̂t> Xt
40
Stochastic Online Optimization using Kalman Recursion
∂L
Also, ∂θ ≤ DX . Substituting our findings in Equation (12), we obtain
∂L > yt Xt 4
L(θ̂t+1 ) ≤ L(θ̂t ) + Pt >
+ 2DX λmax (Pt )2 . (13)
∂θ θ̂t 1+ eyt θ̂t Xt
We define
∂L > yt Xt ∂L > yt Xt
Mt = Pt >X
−E Pt | X1 , y1 , ..., Xt−1 , yt−1
∂θ θ̂t 1 + eyt θ̂t t ∂θ θ̂t 1 + eyt θ̂t> Xt
∂L > yt Xt ∂L > ∂L
= Pt >X
+ Pt .
∂θ θ̂t 1 + eyt θ̂t t ∂θ θ̂t ∂θ θ̂t
Hence we have
> 2 2
∂L yt Xt ∂L 1 ∂L
Pt >X
≤ Mt − λmin (Pt ) ≤ Mt − 2 ,
∂θ θ̂t 1 + eyt θ̂t t ∂θ θ̂t tDX ∂θ θ̂t
because Ps < sDI 2 . Combining it with Equation (13) and summing consecutive terms, we
X
obtain, for any k < t,
t−1
!
2
X 1 ∂L 4 2
L(θ̂t ) − L(θ̂k ) ≤ Ms − 2 + 2DX λmax (Ps ) . (14)
sDX ∂θ θ̂s
s=k
On the previous inequality, we see that the right-hand side is the sum of a martingale and
a term which is negative for s large enough, under the event ACδ .
We are then interested in P((L(θ̂t ) − L(θ? ) > η) | ACδ ) for some η > 0. For 0 ≤ k ≤ t,
we define Bk,t the event (∀k < s < t, L(θ̂s ) − L(θ? ) > η/2). Then we use the law of total
probability:
Lemma 22 yields
η ∂L
L(θ̂s ) − L(θ? ) > =⇒ ≥ Dη .
2 ∂θ θ̂s
41
de Vilmarest and Wintenberger
We combine the last equation, along with Equation (14) and the definition of ACδ to
get, for any 1 ≤ k < t,
t−1
!
X
P (L(θ̂t ) − L(θ̂k ) > η/2) ∩ Bk,t | ACδ ≤ P Ms > f (k, t) ∩ Bk,t | ACδ
s=k
t−1
!
X
≤P Ms > f (k, t) | ACδ ,
s=k
η Dη2 t−1
P 1 4 C2
t−1
P 1
where f (k, t) = 2 + 2
DX s − 2DX δ s2(1−β)
for any 1 ≤ k < t.
s=k s=k
Similarly, we get
t−1
!
X
P (L(θ̂t ) − L(θ? ) > η) ∩ B0,t | AC ≤ P Ms > f0 (t) | AC ,
s=1
Dη2 t−1
P 1 t−1
1
with f0 (t) = η − (L(θ̂1 ) − L(θ? )) + 4 C2
P
2
DX s − 2DX δ s2(1−β)
for any t ≥ 1.
s=1 s=1
We have E[Ms | X1 , y1 , ..., Xs−1 , ys−1 ] = 0, and almost surely |Ms | ≤ 2DX2 λ
max (Ps ). We
can therefore apply Azuma-Hoeffding inequality: for t, k such that f (k, t) > 0,
t−1
! !
1−2β
2 (1 − 2β) max 1/2, (k − 1)
X
P Ms > f (k, t) | ACδ ≤ exp −f (k, t) 4 C2 ,
8DX δ
s=k
+∞
P 1 1
because s2(1−β)
≤ (1−2β) max(1/2,(k−1)1−2β )
. Similarly, for t such that f0 (t) > 0,
s=k
t−1
!
2 1 − 2β
X
P Ms > f0 (t) | ACδ ≤ exp −f0 (t) 4 C2 .
s=1
16DX δ
We need to control f (k, t), f0 (t). We see that for t large enough, when k is small
D2
compared to t, f (k, t) is driven by D2η ln(t) and when k ≈ t, f (k, t) is driven by η/2. The
X
following Lemma formally states these approximations as lower-bounds. We prove it right
after the end of this proof.
6 C2
4 2 1 2
16DX
!
δ
2 8D C 1−2β
Lemma 23 For t ≥ max e Dη (1−2β) , 1 + η(1−2β) X δ
, it holds
Dη2 √
f (k, t) ≥ 2 ln(t), 1≤k< t,
4DX
η √
f (k, t) ≥ , t ≤ k < t.
4
2 4 C2
2DX 4DX
2 L(θ̂1 )−L(θ? )+ 1−2β δ
Dη
Similarly, for t ≥ e , we have
Dη2
f0 (t) ≥ 2 ln(t) .
2DX
42
Stochastic Online Optimization using Kalman Recursion
Λmin ε2
Finally, Point 2 of Lemma 22 allows to obtain the result: defining η = ?
4(1+eDX (kθ k+ε) )
,
we obtain
Λ6min (1 − 2β)ε4
C1 ≥ 12 C 2 (1 + eDX (kθ? k+ε) )6
,
216 DX δ
Λ2min (1 − 2β)ε4
C2 ≥ 4 C 2 (1 + eDX (kθ? k+ε) )2
,
211 DX δ
t ≥ 1 + ,
(1 − 2β)Λmin ε2
? k+ε)
!
4 (1 + eDX (kθ 4 C2
)3
32DX 4D
t ≥ exp L(θ̂1 ) − L(θ? ) + X δ
.
Λ3min ε2 1 − 2β
43
de Vilmarest and Wintenberger
Therefore:
η Dη2 2DX4 C2
δ 1
f (k, t) ≥ + 2 (ln t − ln k) − ,
2 DX 1 − 2β max(1/2, (k − 1)1−2β )
Dη2 4DX4 C2
f0 (t) ≥ η − (L(θ̂1 ) − L(θ? ) + 2 ln t − δ
.
DX 1 − 2β
44
Stochastic Online Optimization using Kalman Recursion
√ 1
• For any 1 ≤ k < t, it holds ln k ≤ 2 ln t, and we have
Dη2 4DX4 C2
δ
f (k, t) ≥ 2 ln(t) − .
2DX 1 − 2β
16DX 6 C2
δ
2 (1−2β) Dη2
Taking t ≥ e Dη
yields f (k, t) ≥ 4DX2 ln(t).
√
• For t ≥ 2 and any k ≥ t, we have
2DX4 C2 4 C2
2DX
η δ η δ
f (k, t) ≥ − ≥ − √ .
2 (1 − 2β)(k − 1)1−2β 2 (1 − 2β)( t − 1)1−2β
1 2
4 C2
8DX 1−2β
Then if t ≥ 1+ δ
η(1−2β) , we get f (k, t) ≥ η4 .
Dη2 4 C2
4DX
• Last point comes from f0 (t) ≥ 2
DX
ln t − (L(θ̂1 ) − L(θ? ) − 1−2β .
δ
√ √
P(kθ̂t − θ? k > ε | ACδ/2 ) ≤ ( t + 1) exp −C1 ln(t)2 + t exp −C2 ( t − 1)1−2β ,
where
∞
!
[ X√
?
( t + 1) exp −C1 ln(t)2
P (kθ̂t − θ k > ε) | ACδ/2 ≤
t=T +1 t>T
X √
+ t exp −C2 ( t − 1)1−2β .
t>T
3
• If T ≥ e 2C1 , we have
X√ X√ 1
( t + 1) exp −C1 ln(t)2 ≤ ( t + 1) 5/2 ≤ 2/T .
t>T t>T
t
45
de Vilmarest and Wintenberger
√
12
4/(1−2β)
• For t ≥ 4, 1 − 1/ t ≥ 1/2, then for t ≥ C2 (1−2β) ,
√
3
1−2β
C2 (1−2β)/2
t exp −C2 ( t − 1) ≤ exp 3 ln(t) − t
2
12 12 6 12
≤ exp ln −
1 − 2β C2 (1 − 2β) 1 − 2β C2 (1 − 2β)
≤ 1,
4/(1−2β)
12
Thus for T ≥ C2 (1−2β)
X √
t exp −C2 ( t − 1)1−2β ≤ 1/T .
t>T
∞
!
[
?
P (kθ̂t − θ k > ε) | ACδ/2 ≤ 3/T ≤ δ/2 .
t=T +1
? k+ε) ? k+ε)
28 DX
8 C 2 (1 + eDX (kθ )3 3 · 215 DX
12 C 2 (1 + eDX (kθ )6
! !
δ/2 δ/2
exp ≤ exp .
Λ3min (1 − 2β)3/2 ε2 Λ6min (1 − 2β)3/2 ε4
Furthermore, as 1 − 2β ≤ 1, we have
? k+ε)
3 · 215 DX
12 C 2 (1 + eDX (kθ )6
!
3 δ/2
exp = exp
2C1 Λ6min (1 − 2β)ε4
? k+ε)
3 · 215 DX
12 C 2 (1 + eDX (kθ )6
!
δ/2
≤ exp .
Λ6min (1 − 2β)3/2 ε4
46
Stochastic Online Optimization using Kalman Recursion
Finally,
4/(1−2β)
12 4 12
= exp ln
C2 (1 − 2β) 1 − 2β C2 (1 − 2β)
? k+ε)
12 · 211 DX
4 C 2 (1 + eDX (kθ )2
!
4 δ/2
= exp ln
1 − 2β Λ2min (1 − 2β)2 ε4
12 · 211 DX 4 C 2 (1 + eDX (kθ? k+ε) )2
!
8 δ/2
= exp ln
1 − 2β Λ2min (1 − 2β)ε4
s
3 · 2 13 D 4 C 2 (1 + eDX (kθ? k+ε) )2
8 X δ/2
≤ exp
1 − 2β Λ2min (1 − 2β)ε4
√ 9 2 ?
!
62 DX Cδ/2 (1 + eDX (kθ k+ε) )
= exp
Λmin (1 − 2β)3/2 ε2
3 · 215 DX12 C 2 (1 + eDX (kθ? k+ε) )6
!
δ/2
≤ exp .
Λ6min (1 − 2β)3/2 ε4
t−1 t−1
X 1 X
arg min (ys − θ> Xs )2 + (θ − θ̂1 )> P1−1 (θ − θ̂1 ) = θ̂1 + Pt (ys − θ̂1> Xs )Xs .
θ∈Rd 2
s=1 s=1
θ̂t+1 − θ̂1 = (I − Pt+1 Xt Xt> )(θ̂t − θ̂1 ) + Pt+1 yt Xt − Pt+1 Xt Xt> θ̂1
t−1
X
= (I − Pt+1 Xt Xt> )Pt (ys − θ̂1> Xs )Xs + Pt+1 (yt − θ̂1> Xt )Xt .
s=1
47
de Vilmarest and Wintenberger
We apply Lemma 1.4 of Rigollet and Hütter (2015) in the second line of the following:
for any µ such that 0 < µ < 2√12σ ,
X µi E[|yt − E[yt | Xt ]|i ]
E [exp(µ|yt − E[yt | Xt ]|)] = 1 +
i!
i≥1
X µi (2σ 2 )i/2 iΓ(i/2)
≤1+
i!
k≥1
X √ i
≤1+ 2µσ , because Γ(i/2) ≤ Γ(i) = (i − 1)!
i≥1
√ √ 1
≤ 1 + 2 2µσ, because 0 < 2µσ ≤
√ 2
≤ exp 2 2µσ .
t √
1 P
Therefore exp 2√2σ (|ys − E[ys | Xs ]| − 2 2σ) is a super-martingale to which we
s=1 t
can apply Lemma 17. We obtain, for any δ > 0,
t−1
X √ √
|yt − E[yt | Xt ]| ≤ 2 2(t − 1)σ + 2 2σ ln δ −1 , t ≥ 1,
s=1
48
Stochastic Online Optimization using Kalman Recursion
√
with probability at least 1 − δ. The result follows from Equation (17) and 2 2 ≤ 3.
Proof of Theorem 16. We first apply Theorem 2: with probability at least 1 − 5δ, it
holds simultaneously for all n ≥ T (ε, δ)
n 2
X
? 15 2 2 2 2
λmax (P1 )DX
L(θ̂t ) − L(θ ) ≤ d 8σ + Dapp + ε DX ln 1 + (n − T (ε, δ))
2 d
t=T (ε,δ)+1
+ 5λmax PT−1 (ε,δ)+1 ε
2
λmax (P1 )DX2
2
+ 115 σ (4 + ) + Dapp + 2ε DX ln δ −1 .
2 2 2
4
Moreover, λmax PT−1 −1
(ε,δ)+1 ≤ λmax (P1 ) + T (ε, δ)DX .
2
Then we derive a bound on the first T (ε, δ) terms. For any t ≥ 1, we have L(θ̂t )−L(θ? ) ≤
2 kθ̂
DX − θ? k2 , thus, using (a + b)2 ≤ 2(a2 + b2 ) and applying Lemma 24 we obtain the
t
simultaneous property
L(θ̂t ) − L(θ? ) ≤ 2DX
2
(kθ̂1 − θ? k + 3λmax (P1 )DX σ ln δ −1 )2
+ 2λmax (P1 )2 DX
4
(3σ + Dapp )2 (t − 1)2 , t ≥ 1,
with probability at least 1 − δ. A summation argument yields, for any δ > 0,
T (ε,δ)
X
L(θ̂t ) − L(θ? ) ≤ 2DX
2
(kθ̂1 − θ? k + 3λmax (P1 )DX σ ln δ −1 )2 T (ε, δ)
t=1
(T (ε, δ) − 1)T (ε, δ)(2T (ε, δ) − 1)
+ λmax (P1 )2 DX
4
(3σ + Dapp )2 ,
3
with probability at least 1 − δ.
12 kθ̂1 − θ? k2 DX2 √
+ (1 + 8 ln δ −1 )
0.072 t2 p1 Λmin
!
kθ̂1 − θ? k 2
DX ? −1 2
+ √ (Dapp + DX kθ k) + √ (ln δ ) ,
Λmin 2p1
with probability at least 1 − 4δ.
49
de Vilmarest and Wintenberger
therefore we apply the ridge analysis of Hsu et al. (2012) to (Xs , ys − β̂1> Xs ). We note that
(ys − β̂1> Xs ) has the same variance proxy and the same approximation error, therefore it
only amounts to translate the optimal w, that is denoted by β.
For any λ > 0, we observe that
DX
d2,λ ≤ d1,λ ≤ d , ρλ ≤ p , bλ ≤ ρλ (Dapp + DX kβ − β̂1 k) .
d1,λ Λmin
Therefore we can apply Theorem 16 of Hsu et al. (2012): for 0 < δ < e−2.6 and t ≥
6 √DX
Λmin
(ln(d)+ln δ −1 ), it holds that kβ̂t+1,λ −βk2Σ = 3(kβλ −βk2Σ +εbs +εvr ) with probability
1 − 4δ, with
DX2 DX2
| X] − β > X)2 ] + (1 + Λmin )kβλ − βk2Σ √
4 Λmin E[(E[y
εbs ≤ (1 + 8 ln δ −1 )
0.072 t
( √D
Λ
X
(Dapp + DX kβ − β̂1 k) + kβλ − βkΣ )2
min −1 2
+ (ln δ ) ,
t2
r
DX4
√ +1 4 2
Λmin d
1 DX 1
δf ≤ √ √ (1 + 8 ln δ −1 ) + ln δ −1 ,
t Λmin t 3
p
σ 2 d(1 + δf ) 2σ 2 d(1 + δf ) ln δ −1 2σ 2 ln δ −1
εvr ≤ + + .
0.072 t 0.073/2 t 0.07t
Moreover E[(E[y | X]−β > X)2 ] ≤ Dapp
2 and Λmin ≤ DX 2 , hence, using kβ −βk ≤ λkβ −
λ Σ
β̂1 k we transfer the result in our KF notations, that is, θ̂t = β̂t,p−1 /2(t−1) , β̂1 = θ̂1 , β = θ? .
1
We obtain, for any 0 < δ < e−2.6 and t ≥ 6 √D
Λ
X
(ln(d) + ln δ −1 ),
min
2 2
DX 2 DX kθ̂1 −θ? k2
4 Λmin Dapp + Λmin p1 t
√
εbs ≤ (1 + 8 ln δ −1 )
0.072 t
kθ̂1 −θ? k 2
( √D
Λ
X
(Dapp + DX kθ? k) + √
2p1 t
)
+ min
(ln δ −1 )2 ,
t2 r
DX4
√ +1 4 2
Λmin d
1 DX 1
δf ≤ √ √ (1 + 8 ln δ −1 ) + ln δ −1 ,
t Λmin t 3
p
σ 2 d(1 + δf ) 2σ 2 d(1 + δf ) ln δ −1 2σ 2 ln δ −1
εvr ≤ + + ,
0.072 t 0.073/2 t ! 0.07t
kθ̂1 − θ? k2
kθ̂t+1 − θ? k2Σ ≤ 3 + εbs + εvr ,
2p1 t
50
Stochastic Online Optimization using Kalman Recursion
DX2
with probability at least 1 − 4δ. For t ≥ Λmin ln δ −1 , as ln δ −1 ≥ 1, we get
√ √
√
r
1 14 1 1+ 8 2 2
δf ≤ √ (1 + 8 ln δ −1 ) + +1≤ √ + ≈ 1.9 ≤ 2 .
6 ln δ −1 63 d 6 9
√ a+b
Thus, as ab ≤ 2 for any a, b > 0, we have
r !
σ2 3d 3d ln δ −1
εvr ≤ +2 + 2 ln δ −1
0.07t 0.07 0.07
σ2
6d −1
≤ + 3 ln δ
0.07t 0.07
3σ 2 (d/0.035 + ln δ −1 )
≤ .
0.07t
It yields the result.
Lemma 25 allows the definition of an explicit value for T (ε, δ), as displayed in the
following Corollary.
Corollary 26 Assumption 5 is satisfied for T (ε, δ) = max(T1 (δ), T2 (ε, δ), T3 (ε, δ)) where
we define
DX2 2 2
−1 48DX 24DX
T1 (δ) = max 12 (ln d + ln δ ), ln ,
Λmin Λmin Λmin
√ !
24ε−1 kθ̂1 − θ? k2 DX2
4(1 + 8 ln δ −1 ) 3σ 2 (d/0.035 + ln δ −1 )
T2 (ε, δ) = + D2 +
Λmin 2p1 Λmin app 0.072 0.07
√ !
12ε−1 kθ̂1 − θ? k2 DX2
4(1 + 8 ln δ −1 ) 3σ 2 (d/0.035 + ln δ −1 )
ln + D2 + ,
Λmin 2p1 Λmin app 0.072 0.07
s
96ε−1 kθ̂1 − θ? k2 DX 2 √
T3 (ε, δ) = (1 + 8 ln δ −1 )
0.072 Λmin p1 Λmin
!1/2
kθ̂1 − θ? k 2
DX ? −1 2
+ √ (Dapp + DX kθ k) + √ (ln δ )
Λmin 2p1
96ε−1 kθ̂1 − θ? k2 D2 √
ln (1 + X )(1 + 8 ln δ −1 )
0.072 Λmin 2p1 Λmin
!
?k 2
DX kθ̂ 1 − θ
+ √ (Dapp + DX kθ? k) + √ (ln δ −1 )2 .
Λmin 2p1
We recall that for any η ≤ 1, we have lnt t ≤ η for t ≥ 2η −1 ln(η −1 ), and we use it in the
following proof.
Proof of Corollary 26. We define δt = δ/t2 for any t ≥ 1. In order to apply Lemma
D2 D2
25 with a union bound, we need t ≥ 6 Λmin
X
(ln d + ln δt−1 ). If t ≥ 12 Λmin
X
(ln d + ln δ −1 ) and
51
de Vilmarest and Wintenberger
2
48DX 2
24DX
t≥ Λmin ln Λmin , we obtain
√
t t√
t≥ + t
2 2
D2 12DX 2 √
≥ 6 X (ln d + ln δ −1 ) + ln t, as ln t ≤ t
Λmin Λmin
DX2
=6 (ln d + ln δt−1 ) .
Λmin
2 48D2 24D2
DX
Therefore, we define T1 (δ) = max 12 Λmin (ln d + ln δ −1 ), ΛminX ln ΛminX , and we apply
Lemma 25. We get the simultaneous property
q
−1 −1
3 kθ̂1 − θ k? 2 D 2 4(1 + 8 ln δt ) 2
3σ (d/0.035 + ln δt )
kθ̂t+1 − θ? k2Σ ≤ + X Dapp 2
2
+
t 2p1 Λmin 0.07 0.07
2
12 kθ̂1 − θ? k2 DX
q
+ (1 + 8 ln δt−1 )
0.072 t2 p1 Λmin
!
?k 2
DX k θ̂ 1 − θ
+ √ (Dapp + DX kθ? k) + √ (ln δt−1 )2
Λmin 2p1
for all t ≥ T1 (δ, with probability at least 1 − δ. Finally, both terms of the last inequality
are bounded by ε/2.
From Corollary 26, we obtain the asymptotic rate by comparing T2 (δ) and T3 (δ). We
write T2 (δ) = 2A2 (δ) ln A2 (δ), T3 (δ) = 2A3 (δ) ln A3 (δ) with
!
ε−1 kθ̂1 − θ? k2 DX2 √
A2 (δ) . + D2 ln δ −1 + σ 2 (d + ln δ −1 )
Λmin p1 Λmin app
v !
u ε−1 kθ̂1 − θ? k2 D2 √
u
? kθ̂1 − θ? k 2
app + DX kθ k)
D (D
X −1 X −1 2
A3 (δ) . t ln δ + √ + √ (ln δ ) .
Λmin p1 Λmin Λmin p1
52
Stochastic Online Optimization using Kalman Recursion
√ √ √
where
√ the symbol . means less than up to universal constants. As a + b . a + b and
ab . a + b, we obtain
s s
2 √
ε−1 kθ̂1 − θ? k2 DX
A3 (δ) . ln δ −1
Λmin p1 Λmin
!
?k
DX k θ̂ 1 − θ
+ √ (Dapp + DX kθ? k) + √ ln δ −1
Λmin p1
s
ε−1 kθ̂1 − θ? k2 D2 √
. + X ln δ −1
Λmin p1 Λmin
!
?k
DX k θ̂ 1 − θ
+ √ (Dapp + DX kθ? k) + √ ln δ −1 .
Λmin p1
ε−1
Thus, as long as Λmin ≤ 1, we get
ε−1 kθ̂1 − θ? k2 D2 √
2
A2 (δ), A3 (δ) . + X (1 + Dapp ) ln δ −1 + σ 2 d
Λmin p1 Λmin
!
kθ̂1 − θ? k
DX ? 2 −1
+ √ (Dapp + DX kθ k) + √ + σ ln δ .
Λmin p1
References
Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10
(2):251–276, 1998.
Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity
for logistic regression. Journal of Machine Learning Research, 15(1):595–627, 2014.
Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation
with convergence rate o(1/n). In Advances in Neural Information Processing Systems,
pages 773–781, 2013.
Bernard Bercu and Abderrahmen Touati. Exponential inequalities for self-normalized mar-
tingales with applications. The Annals of Applied Probability, 18(5):1848–1869, 2008.
Bernard Bercu, Antoine Godichon, and Bruno Portier. An efficient stochastic newton al-
gorithm for parameter estimation in logistic regressions. SIAM Journal on Control and
Optimization, 58(1):348–367, 2020.
Jock A. Blackard and Denis J. Dean. Comparative accuracies of artificial neural networks
and discriminant analysis in predicting forest cover types from cartographic variables.
Computers and Electronics in Agriculture, 24(3):131–151, 1999.
Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge
university press, 2006.
53
de Vilmarest and Wintenberger
George T. Diderrich. The Kalman filter from the perspective of Goldberger–Theil estima-
tors. The American Statistician, 39(3):193–198, 1985.
James Durbin and Siem J. Koopman. Time Series Analysis by State Space Methods. Oxford
university press, 2012.
Ludwig Fahrmeir. Posterior mode estimation by extended Kalman filtering for multivariate
dynamic generalized linear models. Journal of the American Statistical Association, 87
(418):501–509, 1992.
David A. Freedman. On tail probabilities for martingales. the Annals of Probability, pages
100–118, 1975.
Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online
convex optimization. Machine Learning, 69(2-3):169–192, 2007.
Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression.
In Conference on Learning Theory, pages 9–1, 2012.
Sham M. Kakade and Andrew Y. Ng. Online bounds for bayesian algorithms. In Advances
in Neural Information Processing Systems, pages 641–648, 2005.
Rudolph E. Kalman and Richard S. Bucy. New results in linear filtering and prediction
theory. Journal of Basic Engineering, 83(1):95–108, 1961.
Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Lower and upper bounds on the generaliza-
tion of stochastic exponentially concave optimization. In Conference on Learning Theory,
pages 1305–1320, 2015.
Peter McCullagh and John A. Nelder. Generalized Linear Models. London Chapman and
Hall, 2nd ed edition, 1989.
Noboru Murata and Shun-ichi Amari. Statistical analysis of learning dynamics. Signal
Processing, 74(1):3–28, 1999.
Yann Ollivier. Online natural gradient as a Kalman filter. Electronic Journal of Statistics,
12(2):2930–2961, 2018.
Dmitrii M. Ostrovskii and Francis Bach. Finite-sample analysis of m-estimators using self-
concordance. Electronic Journal of Statistics, 15(1):326–391, 2021.
54
Stochastic Online Optimization using Kalman Recursion
Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. Lecture notes for
course 18S997, 2015.
Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of
Mathematical Statistics, pages 400–407, 1951.
Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of
Computational Mathematics, 12(4):389–434, 2012.
Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent.
In International Conference on Machine Learning, pages 928–936, 2003.
55