A note on the accelerated proximal gradient method for nonconvex optimization
A note on the accelerated proximal gradient method for nonconvex optimization
ro/
Volume 34 (2018), No. 3, Print Edition: ISSN 1584 - 2851; Online Edition: ISSN 1843 - 4401
Pages 449 - 457 DOI: https://doi.org/10.37193/CJM.2018.03.22
A BSTRACT. We improve a recent accelerated proximal gradient (APG) method in [Li, Q., Zhou, Y., Liang,
Y. and Varshney, P. K., Convergence analysis of proximal gradient with momentum for nonconvex optimization, in
Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017]
for nonconvex optimization by allowing variable stepsizes. We prove the convergence of the APG method
for a composite nonconvex optimization problem under the assumption that the composite objective function
satisfies the Kurdyka-Łojasiewicz property.
1. I NTRODUCTION
In this paper we consider a composite optimization problem of the form
(1.1) min F (x) := f (x) + g(x),
x∈H
449
450 H. Wang and H. K. Xu
2. P RELIMINARIES
Let d ≥ 1 be a given integer and consider the Euclidean d-space Rd with inner product
⟨·, ·⟩ and norm ∥ · ∥ (i.e., ∥ · ∥2 ). By Γ(Rd ) we denote the family of all functions f : Rd →
(−∞, ∞] =: R which are proper and lower-semicontinuous (lsc).
2.1. Subdifferential of Nonconvex Functions.
Definition 2.1. Let f ∈ Γ(Rd ) and x ∈ dom(f ) be given. We say that x∗ ∈ Rd is a Fréchet
derivative of f at x if
f (z) − f (x) − ⟨x∗ , z − x⟩
lim inf ≥ 0.
z→x ∥z − x∥
ˆ (x), is said to be the Fréchet differential
The set of Fréchet derivatives of f at x, denoted ∂f
of f at x.
The (Mordukhovich) limiting-subdifferential (or simply, subdifferential) of f at x, de-
noted ∂f (x), is defined as
f
ˆ (xk ) with xk −→
∂f (x) = {x∗ ∈ H : ∃ x∗k → x∗ , x∗k ∈ ∂f x}.
f f
Here “−→” means f -convergence, that is, xk −→ x if and only if xk → x and f (xk ) →
f (x).
Definition 2.2. We say that a point x is a critical point of f if 0 ∈ ∂f (x). The lazy slope of
f at a point x is defined as
|∂f (x)| := inf{∥z∥ : z ∈ ∂f (x)} = dist(0, ∂f (x)).
Proposition 2.1. [7] Let f, g ∈ Γ(Rd ) and x ∈ Rd be given.
b (x) ⊂ ∂f (x). Moreover, ∂f
(i) We have ∂f b (x) is convex and ∂f (x) is closed (not necessarily
convex). If f is convex, then both sets are reduced to the subdifferential in the sense of
convex analysis.
f
(ii) If the sequences (xk ) an (yk ) are such that xk −→ x, yk → y, and yk ∈ ∂f (xk ) for all k,
then y ∈ ∂f (x).
(iii) The Fermat’s rule remains true: if x is a local minimizer of f , then x is a critical point (or
stationary point) of f , that is, 0 ∈ ∂f (x).
(iv) If g is continuously differentiable, then ∂(f + g)(x) = ∂f (x) + ∇g(x).
(v) We have that ∂f is closed in the sense that if {(xk , yk )} is a sequence in the graph of ∂f ,
f
G(∂f ) := {(z, w) : z ∈ dom(∂f ), w ∈ ∂f (z)}, such that xk −→ x and yk → y, it
follows that (x, y) ∈ G(∂f ).
f
(vi) If xk −→ x and lim inf k→∞ |∂f (xk )| = 0, then x is a critical point of f .
2.2. Kurdyka-Łojasiewicz Property. The Kurdyka-Łojasiewicz property [8, 9] plays a cen-
tral part in the nonconvex optimization theory.
Definition 2.3. [2] We say that a function f ∈ Γ(Rd ) satisfies the Kurdyka-Łojasiewicz
property (KŁP) at x∗ ∈ dom(∂f ) if there exist η ∈ (0, ∞], a neighborhood U of x∗ , and a
continuous concave function φ : [0, η) → R+ such that
(i) φ(0) = 0,
(ii) φ ∈ C 1 (0, η),
(iii) φ′ (t) > 0 for all t ∈ (0, η),
(v) there holds the Kurdyka-Łojasiewicz inequality:
(2.2) φ′ (f (x) − f (x∗ ))|∂f (x)| ≥ 1
for all x ∈ U ∩ {x : f (x∗ ) < f (x) < f (x∗ ) + η}.
Accelerated proximal gradient method 451
Lemma 2.3. Let f, g ∈ Γ(Rd ) and set F = f + g. Let λ > 0 be given. Assume that the gradient
∇f of f is L-Lipschitz continuous. Then, for any u ∈ Rd and setting
û := proxλg (u − λ∇f (u)),
we have
1 1
(2.8) F (û) ≤ F (u) − − L ∥û − u∥2 .
2 λ
Proof. We have
1
û = arg min g(z) + ∥z − u + λ∇f (u)∥2
z 2λ
1
= arg min g(z) + ∥z − u∥2 + ⟨z − u, ∇f (u)⟩.
z 2λ
Since û is a minimizer of the function
1
(2.9) z 7→ ψ(z) := g(z) + ∥z − u∥2 + ⟨z − u, ∇f (u)⟩
2λ
it turns out that
1
(2.10) g(û) + ∥û − u∥2 + ⟨û − u, ∇f (u)⟩ ≤ g(u).
2λ
On the other hand, since ∇f is L-Lipschitz, we can use Lemma 2.2 to get the inequality:
L
(2.11) f (û) ≤ f (u) + ⟨∇f (u), û − u⟩ + ∥û − u∥2 .
2
Adding up (2.10) and (2.11) immediately yields (2.8). □
2.4. Convergence Lemma. We also need the following basic result regarding conver-
gence of nonnegative series.
Lemma 2.4. Let {an } be a sequence of real nonnegative numbers such that
(2.12) ak+1 ≤ γak + bk , k ≥ 0,
P∞ P∞
where γ ∈ [0, 1) and bk ≥ 0 such that k=0 bk < ∞. Then k=0 ak < ∞.
3. M AIN R ESULTS
Consider the following composite optimization problem
(3.13) min F (x) := f (x) + g(x),
x∈Rd
where f, g ∈ Γ(Rd ).
The following accelerated proximal gradient (APG) algorithm is introduced by Li, et al
[10, Algorithm 3].
The main results in this paper show that the conclusions of Theorem 3.1 remain true
for variable stepsizes, and moreover, convergence of the trajectories is guaranteed if, in
addition, the composite function F satisfies the Kurdyka-Łojasiewicz property.
Theorem 3.2. Consider a sequence {xk } generated by Algorithm 2. Assume the conditions (A1)
and (A2) of Theorem 3.1 hold, and in addition, the stepsize sequence {λk } satisfies the property:
1
0 < a ≤ λk ≤ b < for all k. Then the following conclusions hold.
L
(i) {xk } is a bounded sequence;
(ii) The set Ω of limit points of {xk } forms a compact set, on which the objective function F is
constant;
(iii) All elements of Ω are critical points of F .
Moreover, if, in addition, F = f + g is a KL function, then {xk } converges to a critical point of F
with finite length, that is,
X∞
(3.14) ∥xk+1 − xk ∥ < ∞.
k=0
In particular,
(3.17) lim ∥xk − yk ∥ = 0.
k→∞
Here x̄ ∈ Ω and xki → x̄ as above. Since F (x̄) = f (x̄) + g(x̄) and since f is continuous, all
we need to prove is that
lim g(xki ) = g(x̄).
i→∞
Accelerated proximal gradient method 455
Note that yk+1 is either xk if F (xk ) ≤ F (vk ) or vk = xk +βk (xk −xk−1 ) if F (xk ) > F (vk ).
In the latter case, we have ∥xk − xk+1 ∥ ≤ ∥yk+1 − xk+1 ∥ + βk ∥xk − xk−1 ∥ and we have
estimates on the partial sums:
k
X k
X k
X
∥xi − xi+1 ∥ ≤ ∥yi+1 − xi+1 ∥ + βi ∥xi − xi−1 ∥.
i=1 i=1 i=1
It turns out that
k−1
X k
X
(1 − βi+1 )∥xi − xi+1 ∥ ≤ ∥yi+1 − xi+1 ∥ + β1 ∥x1 − x0 ∥.
i=1 i=1
1 i+1 2
Since βi+1 = , 1 − βi+1 = ≥ for i ≥ 1. Consequently, we derive from the last
i+2 i+2 3
inequality that
∞ ∞
!
X 3 X
∥xi − xi+1 ∥ ≤ ∥yi+1 − xi+1 ∥ + ∥x1 − x0 ∥ < ∞
i=1
2 i=1
and (3.14) is proved. □
Acknowledgement. We were grateful to the reviewers for their helpful suggestions and
comments which improved the presentation of this manuscript.
R EFERENCES
[1] Attouch, H., Bolte, J., Redont, P. and Soubeyran, A., Proximal alternating minimization and projection methods
for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality, Math. Operations Research,
35 (2010), No. 2, 438–457
[2] Attouch, H., Bolte, J. and Svaiter, B. F., Convergence of descent methods for semi-algebraic and tame problems:
proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods, Math. Program., 137
(2013), 91–129
[3] Beck A. and Teboulle, M., A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J.
Imaging Sci., 2 (2009), No. 1, 183–202
[4] Bolte, J., Daniilidis, A. and Lewis, A., The Łojasiewicz inequality for nonsmooth sub-analytic functions with
applications to subgradient dynamical systems, SIAM J. Optim., 17 (2007), 1205–1223
[5] Bolte, J., Sabach, S. and Teboulle, M., Proximal alternating linearized minimization for nonconvex and nonsmooth
problems, Math. Program., Ser. A , 146 (2014), 459–494
[6] Combettes, P. L. and Wajs, R., Signal recovery by proximal forward-backward splitting, Multiscale Model. Simul.,
4 (2005), No. 4, 1168–1200
[7] Frankel, P., Garrigos, G. and Peypouquet, J., Splitting methods with variable Metric for Kurdyka-Łojasiewicz
functions and general convergence rates, J. Optim. Theory Appl., 65 (2015), No. 3, 874–900
[8] Kurdyka, K., On gradients of functions definable in o-minimal structures, Ann. Inst. Fourier (Grenoble), 48
(1998), No. 3, 769–783
[9] Łojasiewicz, S., Une propriete topologique des sous-ensembles analytiques reels, in: Les Euations aux De-
rivees Partielles, Editions du centre National de la Recherche Scientifique, Paris, 1963, pp. 87–89
[10] Li, Q., Zhou, Y., Liang, Y. and Varshney, P. K., Convergence analysis of proximal gradient with momentum for
nonconvex optimization, in Proceedings of the 34th International Conference on Machine Learning, Sydney,
Australia, PMLR 70, 2017
Accelerated proximal gradient method 457
[11] Nesterov, Y. E., Introductory Lectures on Convex Optimization: a basic course Kluwer Academic Publishers,
Massachusetts, 2004
[12] Rockafellar, R. T. and Wets, R. J.-B., Variational Analysis, Grundlehren der Mathematischen Wissenschaften,
vol. 317, Springer, Berlin, 1998
D EPARTMENT OF M ATHEMATICS
S CHOOL OF S CIENCE
H ANGZHOU D IANZI U NIVERSITY
H ANGZHOU 310018 C HINA
Email address: [email protected]
Email address: [email protected]