Bellocruz 2016
Bellocruz 2016
Anal
DOI 10.1007/s11228-016-0376-5
1 Introduction
The purpose of this paper is to study the convergence properties of a variant of the proximal
forward-backward splitting method for solving the following optimization problem:
min f (x) + g(x) s.t. x ∈ H, (1)
where H is a nontrivial real Hilbert space, and f : H → R := R ∪ {+∞} and g : H → R
are two proper lower semicontinuous and convex functions. We are interested in the case
where both functions f and g are nondifferentiable, and when the domain of f contains
the domain of g. The solution set of this problem will be denoted by S∗ , which is a closed
and convex subset of the domain of g. Problem (1) has recently received much attention
from the optimization community due to its broad applications to several different areas
such as control, signal processing, system identification, machine learning and restoration
of images; see, for instance, [18, 19, 24, 32] and the references therein.
A special case of problem (1) is the nonsmooth constrained optimization problem, taking
g = δC where δC is the indicator function of a nonempty closed and convex set C in H,
defined by δC (y) := 0, if y ∈ C and +∞, otherwise. Then, problem (1) reduces to the
constrained minimization problem
min f (x) s.t. x ∈ C. (2)
Another important case of problem (1), which has had much interest in signal denoising
and data mining, is the following optimization problem with 1 -regularization
min f (x) + λx1 s.t. x ∈ H, (3)
where λ > 0 and the norm · 1 is used to induce the sparsity in the solutions. Moreover,
problem (3) covers the important and the well-studied 1 -regularized least square minimiza-
tion problem, when H = Rn and f (x) = Ax − b22 where A ∈ Rm×n , m << n, and
b ∈ Rm , which is just a convex approximation of the very famous 0 -minimization prob-
lem; see [12]. Recently, this problem became popular in signal processing and statistical
inference; see, for instance, [23, 43].
We focus here on the so-called proximal forward-backward splitting iteration [32], which
contains a forward gradient step of f (an explicit step) followed by a backward proximal
step of g (an implicit step). The main idea of our approach consists of replacing, in the
forward step of the proximal forward-backward splitting iteration, the gradient of f by a
subgradient of f (note that here f is assumed nondifferentiable in general). In the particular
case that g is the indicator function, the proposed iteration reduces to the classical projected
subgradient iteration.
To describe and motivate our iteration, first we recall the definition of the so-called prox-
imal operator as proxg : H → H associated to g a proper lower semicontinuous convex
function, where proxg (z), z ∈ H is the unique solution of the following strongly convex
optimization problem
1
min g(y) + y − z2 s.t. y ∈ H. (4)
2
√
Note that the norm · is induced by this inner product of H, i.e., x := x, x
for all x ∈ H. The proximal operator proxg is well-defined and has many attractive prop-
erties, e.g., it is continuous and firmly nonexpansive, i.e., for all x, y ∈ H, proxg (x) −
proxg (y)2 ≤ x − y2 − [x − proxg (x)] − [y − proxg (y)]2 . This nice property can be
used to construct algorithms to solve optimization problems [39]; for other properties and
algebraic rules see [3, 18, 19]. If g = δC is the indicator function, the orthogonal projection
onto C, PC (x) := {y ∈ C : x − y = dist(x, C)} is the same as proxδC (x) for all x ∈ H
[2]. For an exhaustive discussion about the evaluation of the proximity operator of a wide
variety of functions see Section 6 of [32]. Now, let us recall the definition of the subdiffer-
ential operator ∂g : H ⇒ H by ∂g(x) := {w ∈ H : g(y) ≥ g(x) + w, y − x
, ∀ y ∈ H} .
We also present the relation of the proximal operator proxαg with the subdifferential oper-
ator ∂g, i.e., proxαg = (Id + α∂g)−1 and as a direct consequence of the first optimality
condition of (4), we have the following useful inclusion:
z − proxαg (z)
∈ ∂g(proxαg (z)), (5)
α
On Proximal Subgradient Splitting Method for Minimizing the sum...
for any z ∈ H and α > 0. The iteration proposed in this paper, called Proximal Subgradient
Splitting Method, is motivated by the well-known fact that x ∈ S∗ if and only if there exists
u ∈ ∂f (x) such that x = proxαg (x − αu). Thus, the iteration generalizes the proximal
forward-backward splitting iteration for the differentiable case, as a fixed point iteration of
the above equation, which is defined as follows: starting at x 0 belonging to the domain of
g, set
x k+1 = proxαk g (x k − αk uk ), (6)
where uk ∈ ∂f (x k ) and the stepsize αk is positive for all k ∈ N. Iteration (6) recovers the
classical subgradient iteration [38], when g = 0, and the proximal point iteration [39], when
f = 0. Moreover, it covers important situations in which f is nondifferentiable and it can
also be seen as a forward-backward Euler discretization of the subgradient flow differential
inclusion
ẋ(t) ∈ −∂[f (x(t)) + g(x(t))],
with variable x : R+ → H; see [32]. Actually, if the derivative on the left side is replaced by
the divided difference (x k+1 − x k )/αk , then the discretization obtained is (x k − x k+1 )/αk ∈
∂f (x k ) + ∂g(x k+1 ), which is the proximal subgradient iteration (6).
The nondifferentiability of the function f has a direct impact on the computational effort
and the importance of such problems, when f is nonsmooth, is underlined because they
occur frequently in applications. Nondifferentiability arises, for instance, in the problem of
minimizing the total variation of a signal over a convex set, in the problem of minimizing
the sum of two set-distance functions, in problems involving maxima of convex functions,
the Dantzing selector-type problems, the non-Gaussian image denoising problem and in
Tykhonov regularization problems with L1 norms and others; see, for instance, [4, 13, 17,
26]. The iteration of the proximal subgradient splitting method, proposed in (6), can be
applied in these important instances, extending the classical subgradient iteration for more
general problems as (3). In problem (1), f is usually assumed to be differentiable as in [35],
which is not necessarily the case in this work. Moreover, the convergence of the iteration (6)
to a solution of (1) has been established in the literature, when the gradient of f is globally
Lipschitz continuous and the stepsizes αk , k ∈ N have to be chosen very small, i.e., for
all k, αk is less than some constant related with the Lipschitz constant of the gradient of
f ; see, for instance, [19]. Recently, when f is continuously differentiable but the Lipschitz
constant is not available, the steplengths can be chosen using backtracking procedures; see
[6, 10, 32, 35].
It is important to mention that the forward-backward iteration also finds applications in
solving more general problems, like the variational inequality and inclusion problems; see,
for instance, [9, 11, 14, 15, 42] and the references therein. On the other hand, the standard
convergence analysis of this iteration, for solving these general problems, requires at least
a co-coercivity assumption of the operator and the stepsizes to lie within a suitable inter-
val; see, for instance, Theorem 25.8 of [3]. Note that co-coercive operators are monotone
and Lipschitz continuous, but the converse does not hold in general; see [44]. Although, for
gradients of lower semicontinuous, proper and convex functions, the co-coercivity is equiv-
alent to the global Lipschitz continuity assumption. This nice and surprising fact, which
is strongly used in the convergence analysis of the proximal forward-backward method
for problem (1), when f is differentiable, is known as the Baillon-Haddad Theorem; see
Corollary 18.16 of [3].
The main aim of this work is to remove the differentiability assumption from f in the
forward-backward splitting method, extending the classical projected subgradient method
J. Y. B. Cruz
and containing, as particular case, a new proximal subgradient iteration for more general
problems.
This work is organized as follows. The next subsection provides our notations and
assumptions, and some preliminaries results that will be used in the remainder of this
paper. The proximal subgradient splitting method and its weak convergence are analyzed by
choosing different stepsizes in Section 2. Finally, Section 3 gives some concluding remarks.
In this section, we present our assumptions, classical definitions and some results needed
for the convergence analysis of the proposed method.
We start by recalling some definitions and notation used in this paper, which are standard
and follows from [3, 32]. Throughout this paper, we write p := q to indicate that p is
defined to be equal to q. We write N for the nonnegative integers {0, 1, 2, . . .} and remind
that the extended-real number system is R := R ∪{+∞}. The closed ball centered at x ∈ H
with radius γ > 0 will be denoted by B[x; γ ], i.e., B[x; γ ] := {y ∈ H : y − x ≤ γ }. The
domain of any function h : H → R, denoted by dom(h), is defined as dom(h) := {x ∈ H :
h(x) < +∞}. The optimal value of problem (1) will be denoted by s∗ := inf{(f + g)(x) :
x ∈ H}, noting that when S∗ = ∅, s∗ = min{(f + g)(x) : x ∈ H} = (f + g)(x∗ ) for any
x∗ ∈ S∗ . Finally, 1 (N) denotes the set of summable sequences in [0, +∞).
Throughout this paper we assume the following:
A1. ∂f is bounded on bounded sets on the domain of g, i.e., there exists ζ = ζ (V ) > 0
such that ∂f (x) ⊆ B[0; ζ ] for all x ∈ V , where V is any bounded and closed subset of
dom(g).
A2. ∂g has bounded elements on the domain of g, i.e., ∃ρ ≥ 0 such that ∂g(x) ∩
B[0; ρ] = ∅ for all x ∈ dom(g).
In connection with Assumption A1, we recall that ∂f is locally bounded on its open
domain. In finite dimension spaces, this result implies that A1 always holds when dom(f ) is
open. A widely used sufficient condition for A1 is the Lipschitz continuity of f on dom(g).
Furthermore, the boundedness of the subgradients is crucial for the convergence analysis of
many classical subgradient methods in Hilbert spaces and it has been widely considered in
the literature; see, for instance, [1, 8, 9, 38].
Regarding Assumption A2, we emphasize that it holds trivially for important instances
of problem (1), e.g., problems (2) and (3) because ∂δC (x) = NC (x) and ∂x1 = {u ∈ H :
u∞ ≤ 1 , u, x
= x1 }, respectively, or when dom(g) is a bounded set or also when
H is a finite dimensional space. Note that Assumption A2 allows instances where ∂g is an
unbounded set as is the particular case when g is the indicator function. It is an existence
condition, which is in general weaker than A1.
Let us end the section by recalling the well-known concepts so-called quasi-Fejér and
Fejér convergence.
The definition originates in [22] and has been elaborated further in [16]. This definition,
originated in [22], has been elaborated further in [16]. In the following we present two
well-known fact for quasi-Fejér convergent sequences.
On Proximal Subgradient Splitting Method for Minimizing the sum...
Proof Item (a) follows from Proposition 3.3(i) of [16], and Item (b) follows from Theorem
3.8 of [16].
In this section we propose the proximal subgradient splitting method extending the classi-
cal subgradient iteration. We prove that the sequence of points generated by the proposed
method converges weakly to a solution of (1) using different strategies for choosing of the
stepsizes. Moreover, we show the complexity analysis for the generated sequence.
The method is formally stated as follows:
If PSS Method stops at step k, then x k = proxαk g x k − αk uk with uk ∈ ∂f (x k ), implying
that x k is solution of problem (1). Then, from now on, we assume that PSS Method gener-
ates an infinite sequence (x k )k∈N . Moreover, it follows directly from (6) that the sequence
(x k )k∈N belongs to dom(g).
Before the formal analysis of the convergence properties of PSS Method, we discuss
below about the necessity of taking a (forward) subgradient step of f instead of another
(backward) proximal step.
Remark 2.1 To evaluate the proximal operator of f is necessary to solve a strongly con-
vex minimization problem as (4). Thus, in the context of problem (1), we assume that it is
hard to evaluate the proximal operator of f , leaving out the possibility to use the standard
and very powerful iteration so-called Douglas-Rachford splitting method presented in [17].
Such situations appear mainly when f has a complicated algebraic expression and therefore
it may impossibility to solve, explicitly or efficiently, subproblem (4). Indeed, very often
in the applications, the formula for the proximity operator is not available in closed form
and ad hoc algorithms or approximation procedures have be used to compute proxαf . This
happens for instance when applying proximal methods to image deblurring with total vari-
ation [5], or to structured sparsity regularization problems in machine learning and inverse
problems [31]. A classical problem of the form of (1), when the subgradient of f is easily
available and proxf does not has explicitly formula is the dual formulation of the following
constrained convex problem:
min h0 (y) subject that hi (y) ≤ 0 (i = 1, . . . , n), (7)
J. Y. B. Cruz
This last problem is a particular case of (1), by taking g = δRn+ + λg0 . Note that if
dom(g0 ) ⊆ Rn+ then g = λg0 .
Thus, PSS Method uses the proximal operator of g and the explicit subgradient iteration
of f (i.e., the proximal operator of f is never evaluated), which is, in general, much easier
to implement than the proximal operator of f + g or f , as happens in the standard proxi-
mal point iteration or the Douglas-Rachford algorithm, respectively for solving nonsmooth
problems, as (1); see, for instance, [17]. Furthermore, note that in our case the subgradient
iteration for the sum f +g is not possible, because the domains of f and g are not the whole
space.
In the following we prove a crucial property of the iterates generated by PSS Method.
Lemma 2.1 Let (x k )k∈N and (uk )k∈N be the sequences generated by PSS Method. Then,
for all k ∈ N and x ∈ dom(g),
x k+1 − x2 ≤ x k − x2 + 2αk (f + g)(x) − (f + g)(x k ) + αk2 uk + w k 2 ,
x k − x k+1 x k − x k+1
= 2αk u , x − x
+ 2αk
k k
− u ,x
k k+1
− x + 2αk − u ,x − x
k k k+1
αk αk
= 2αk uk , x k − x
+ 2αk w̄ k+1 , x k+1 − x + 2αk uk , x k+1 − x k
+ 2x k − x k+1 2 .
On Proximal Subgradient Splitting Method for Minimizing the sum...
x k − x k+1
Now using again that − uk = w̄ k+1 ∈ ∂g(x k+1 ) and the convexity of g and f ,
αk
the above equality leads us
2 x k − x k+1 , x k − x
≥ 2αk f (x k ) − f (x) + g(x k+1 ) − g(x) + uk , x k+1 − x k
+ 2x k − x k+1 2
= 2αk (f + g)(x k ) − (f + g)(x) + g(x k+1 ) − g(x k ) + uk , x k+1 − x k
+ 2αk2 uk + w̄k+1 2
≥ 2αk (f + g)(x k ) − (f + g)(x) + wk + uk , x k+1 − x k
+ 2αk2 uk + w̄k+1 2 ,
+2αk2 uk + w k , uk + w̄ k+1
− αk2 uk + w̄ k+1 2
= x k − x2 + 2αk (f + g)(x) − (f + g)(x k ) + αk2 uk + w k 2 − αk2 w k − w̄ k+1 2 .
Since subgradient methods are not descent methods, as the proposed method here, it is
common to keep track of the best point found so far, i.e., the one with minimum function
value among the iterates. At each step, we set it recursively as (f + g)0best := (f + g)(x 0 )
and
(f + g)kbest := min (f + g)k−1best , (f + g)(x k
) , (9)
for all k. Since (f + g)kbest k∈N is a decreasing sequence, it has a limit (which can be −∞).
When the function f is differentiable and its gradient Lipschitz continuous, it is possible to
prove the complexity of the iterates generated by PSS Method; see [35]. In our instance (f
is not necessarily differentiable) we expect, of course, slower convergence.
Next we present
a convergence rate result for the sequence of the best functional values
(f + g)kbest k∈N to min{(f + g)(x) : x ∈ H}.
Lemma 2.2 Let (f + g)kbest k∈N be the sequence defined by (9). If S∗ = ∅ then, for all
k ∈ N,
[ dist(x 0 , S∗ )]2 + Ck ki=0 αi2
(f + g)best − min (f + g)(x) ≤
k
,
x∈H 2 ki=0 αi
where Ck := max ui + w i 2 : 0 ≤ i ≤ k with w i ∈ ∂g(x i ) (i = 0, . . . , k) are arbitrary.
Proof Define x∗ := PS∗ (x 0 ). Note that x∗ exists because S∗ is a nonempty closed and
convex set of H. By applying Lemma 2.1, k + 1 times, for i ∈ {0, 1, . . . , k} at x∗ ∈ S∗ , we
get
x k+1 − x∗ 2 ≤ x k − x∗ 2 + 2αk (f + g)(x∗ ) − (f + g)(x k ) + αk2 uk + w k 2
k k
≤ x 0 − x∗ 2 + 2 αi (f + g)(x∗ ) − (f + g)(x i ) + αi2 ui + w i 2
i=0 i=0
k
k
≤ [dist(x 0 , S∗ )]2 + 2 min (f + g)(x) − (f + g)kbest αi + Ck αi2 , (10)
x∈H
i=0 i=0
where (f + g)kbest is defined by (9) and the result follows after simple algebra.
J. Y. B. Cruz
Next we establish the rate of convergence of the rate of the objective values at the ergodic
sequence (x̄ k )k∈N of (x k )k∈N , which is defined recursively as x̄ 0 = x 0 and given σ0 = α0
and σk = σk−1 + αk , we define
αk αk
x̄ k = 1 − x̄ k−1 + x k .
σk σk
k
After easy induction, we have σk = i=0 αi and
1
k
x̄ k = αi x i , (11)
σk
i=0
for all k ∈ N.
The following result is similar to Lemma 2.2, considering now the ergodic sequence
defined by (11).
Lemma 2.3 Let (x̄ k )k∈N be the ergodic sequence defined by (11). If S∗ = ∅, then
[ dist(x 0 , S∗ )]2 + Ck ki=0 αi2
(f + g)(x̄ ) − min (f + g)(x) ≤
k
,
x∈H 2 ki=0 αi
where Ck = max ui + w i 2 : 0 ≤ i ≤ k with w i ∈ ∂g(x i ) (i = 0, . . . , k) are arbitrary.
Proof
Proceeding as in the proof of Lemma 2.2 until inequality (10) and after dividing by
2 ki=0 αi , we get
1 Ck 2
k k
αi
(f + g)(x i ) − min (f + g)(x) ≤ [ dist(x 0 , S∗ )]2 − x k+1 − x∗ 2 + αi
σk x∈H 2σk 2σk
i=0 i=0
1 k
≤ [ dist(x , S∗ )] + Ck
0 2 2
αi , (12)
2σk
i=0
k αi
where σk := i=0 αi . Using the convexity of f + g after note that σk ∈ [0, 1] for all i ∈
k αi
{0, 1, . . . , k} and i=0 σk = 1 and (11) in the above inequality (12), the result follows.
Next we focus on constant stepsizes, which is motivated by the fact that we are interested
in quantifying the progress of the proposed method to find an approximate solution.
Proof If we consider constant stepsizes, i.e., αk = α for all k ∈ N, then the optimal rate
0
is obtained when α = dist(x
√ ,S∗ ) · √ 1
Ck
from minimizing the right part of Lemmas 2.2
k+1
and 2.3.
Note that under Assumption A2, Ck ≤ (max1≤i≤k ui +ρ)2 . Hence when ∂f is bounded
on the dom(g) (which occurs when dom(g) is bounded over our assumptions), Assump-
tion A1 implies that Ck ≤ (ζ + ρ)2 for all k ∈ N. In this case, our analysis showed that
the expected error of the iterates
generated by PSS Method with constant stepsizes after
k iterations is O (k + 1)−1/2 . Hence, we can search an ε-solution of problem (1) with
O ε−2 iterations. Of course, this is worse than the rate O(k −1 ) and O ε−1 iterations of
the proximal forward-backward iteration for the differentiable and convex f with Lipschitz
continuous gradient; see, for instance, [35]. However, as was showed in Section 3.2.1, The-
orem 3.2.1 of [37], the worst expected
error after k iterations of the classical subgradient
iteration is attainable equal to O (k + 1)−1/2 for general nonsmooth problems.
In this subsection we analyze the convergence of PSS Method using exogenous stepsizes,
βk
i.e., the positive exogenous sequence of stepsizes (αk )k∈N satisfies that αk = where
ηk
ηk := max{1, uk } for all k, and
∞
∞
βk2 < +∞ and βk = +∞. (13)
k=0 k=0
Proof The result follows by noting that ηk ≥ uk , ηk ≥ 1 for all k ∈ N and letting
w k ∈ ∂g(x k ) such that w k ≤ ρ for all k ∈ N in view of Assumption A2. Then,
uk + w k 2 uk 2 uk w k w k 2
2
≤ 2
+2 + ≤ 1 + 2ρ + ρ 2 .
ηk ηk ηk2 ηk2
Now, Lemma 2.1 implies the desired result.
When the solution set of problem (1) is nonempty, Slev (x 0 ) = ∅ because S∗ ⊆ Slev (x 0 ).
Next, we prove the two main results of this subsection.
Theorem 2.6 Let (x k )k∈N be the sequence generated by PSS Method with exogenous
stepsizes. If there exists x̄ ∈ Slev (x 0 ), then:
J. Y. B. Cruz
Proof By assumption there exists x̄ ∈ Slev (x 0 ), i.e., (f +g)(x̄) ≤ (f +g)(x k ), for all k ∈ N.
(a) To show that (x k )k∈N is quasi-Fejér convergent to Lf +g (x̄) (which is nonempty
because x̄ ∈ Lf +g (x̄)), we use Corollary 2.5, for any x ∈ Lf +g (x̄) ⊆ dom(g), estab-
lishing that x k+1 − x2 ≤ x k − x2 + (1 + 2ρ + ρ 2 )βk2 , for all k ∈ N. Thus, (x k )k∈N
is quasi-Fejér convergent to Lf +g (x̄).
(b) The sequence (x k )k∈N is bounded from Fact 1.1a, and hence it has accumulation
points in the sense of the weak topology. To prove that
lim (f + g)(x k ) = (f + g)(x̄), (15)
k→∞
m
m
βk (f + g)(x k ) − (f + g)(x̄) ≤ 1
2 (x
0 − x̄2 − x m+1 − x̄2 ) + 12 (1 + 2ρ + ρ 2 ) βk2 ,
k=0 k=0
Then, (16)i together with (13) implies that there exists a subsequence
(f + g)(x k ) k∈N of (f + g)(x k ) k∈N such that
lim inf (f + g)(x ik ) − (f + g)(x̄) = 0. (17)
k→∞
Indeed, if (17) does not hold, then there exist σ > 0 and k ≥ k̃, such that (f +
g)(x k ) − (f + g)(x̄) ≥ σ and using (16), we get
∞
∞
+∞ > βk (f + g)(x k ) − (f + g)(x̄) ≥ σ βk ,
k=k̃ k=k̃
x = x k , we have x k − x k+1 ≤ 1 + 2ρ + ρ 2 · βk , which together with (18) implies
that
ϕk − ϕk+1 ≤ 1 + 2ρ + ρ 2 · (ζ + ρ)βk := ρ̄βk (19)
for all k ∈ N. From (17), there exists a subsequence ϕik k∈N of (ϕk )k∈N such that
limk→∞ ϕik = 0. Ifthe claim given in (15) does not hold, then there exists some δ > 0
and a subsequence ϕk k∈N of (ϕk )k∈N , such that ϕk ≥ δ for all k ∈ N. Thus, we can
construct a third subsequence ϕjk k∈N of (ϕk )k∈N , where the indices jk are chosen in
the following way:
j0 := min{m ≥ 0 | ϕm ≥ δ},
j2k+1 := min{m ≥ j2k | ϕm ≤ δ/2},
j2k+2 := min{m ≥ j2k+1 | ϕm ≥ δ},
for each k. The existence of the subsequences ϕik k∈N , ϕk k∈N of (ϕk )k∈N , guaran-
tees that the subsequence ϕjk k∈N of (ϕk )k∈N is well-defined for all k ≥ 0. It follows
from the definition of jk that
ϕm ≥ δ for j2k ≤ m ≤ j2k+1 − 1 (20)
δ
ϕm ≤ for j2k+1 ≤ m ≤ j2k+2 − 1
2
for all k, and hence
δ
ϕj2k − ϕj2k+1 ≥
, (21)
2
for all k ∈ N. In view of (16) and remind that ϕk = (f + g)(x k ) − (f + g)(x̄) ≥ 0
for all k ∈ N,
∞
−1
∞ j2k+1
∞ j2k+1 −1
δ
+∞ > βk ϕk ≥ βm ϕm ≥ βm
2
k=0 k=0 m=j2k k=0 m=j2k
δ −1
∞ j2k+1
δ −1
∞ j2k+1
δ
∞
= ρ̄βm ≥ (ϕm − ϕm+1 ) = (ϕj2k − ϕj2k+1 )
2ρ̄ 2ρ̄ 2ρ̄
k=0 m=j2k k=0 m=j2k k=0
∞
δ δ
≥ = +∞,
2ρ̄ 2
k=0
where we have used (20) in the second inequality and (19) in the third inequality and
(21) in the last one. Thus, lim (f + g)(x k ) = (f + g)(x̄), establishing (b).
k→∞
(c) Let x̃ be a weak accumulation point of (x k )k∈N , and note that x̃ exists by Item (a) and
Fact 1.1a. From now on, we use (x ik )k∈N to denote any subsequence of (x k )k∈N that
converges weakly to x̃. Since f + g is weakly lower semicontinuous, using (15), we
get
(f + g)(x̃) ≤ lim inf(f + g)(x ik ) = lim (f + g)(x k ) = (f + g)(x̄),
k→∞ k→∞
Theorem 2.7 Let (x k )k∈N be the sequence generated by PSS Method with exogenous
stepsizes. Then,
J. Y. B. Cruz
Proof (a) Since (x k )k∈N ⊂ dom(g), we get s∗ ≤ lim infk→∞ (f + g)(x k ). Suppose that
s∗ < lim infk→∞ (f + g)(x k ). Hence, there exists x̂ such that
It follows from (22) that there exists k̄ ∈ N such that (f + g)(x̂) ≤ (f + g)(x k ) for all
k ≥ k̄. Since k̄ is finite we can assume without loss of generality that (f + g)(x̂) ≤
(f + g)(x k ) for all k ∈ N. Using the definition of Slev (x 0 ), given in (14), we have that
x̂ ∈ Slev (x 0 ). By Theorem 2.6b limk→∞ (f + g)(x k ) = (f + g)(x̂), in contradiction
with (22).
(b) Since S∗ = ∅, take x∗ ∈ S∗ and note that this implies Lf +g (x∗ ) = S∗ . Since
(x k )k∈N ⊂ dom(g), we get (f + g)(x∗ ) ≤ (f + g)(x k ) for all k ∈ N implying that
x∗ ∈ Slev (x 0 ). By applying items (b) and (c) of Theorem 2.6, at x̄ = x∗ , we get that
limk→∞ (f + g)(x k ) = (f + g)(x∗ ) and (x k )k∈N converges weakly to some x̃ ∈ S∗ ,
respectively.
(c) Assume that S∗ is empty but the sequence (x k )k∈N is bounded. Let (x k )k∈N be a sub-
sequence of (x k )k∈N such that limk→∞ (f + g)(x k ) = lim infk→∞ (f + g)(x k ). Since
(x k )k∈N is bounded, without loss of generality (i.e., refining (x k )k∈N if necessary),
we may assume that (x k )k∈N converges weakly to some x̄ ∈ dom(g). By the weak
lower semicontinuity of f + g on dom(g),
For exogenous stepsizes, Theorem 2.7a guarantees the convergence of (f + g)(x k ) k∈N
to the optimal
value of problem (1), i.e., lim infk→∞ (f + g)(x k ) = s∗ , implying the conver-
gence of (f + g)kbest k∈N , defined in (9), to s∗ . It is important to mention that in the proof
of the above two crucial results, we have used a similar idea recently presented in [7] for a
different instance.
In the following we present a direct consequence of Lemmas 2.2 and 2.3, when the
stepsizes satisfy (13).
Corollary 2.8 Let (x̄ k )k∈N be the ergodic sequence defined by (11) and (βk )k∈N as (13). If
S∗ = ∅, then, for all k ∈ N,
k
[ dist(x 0 , S∗ )]2 + (1 + 2ρ + ρ 2 ) 2
i=0 βi
(f + g)kbest − min (f + g)(x) ≤ ζ
x∈H 2 ki=0 βi
On Proximal Subgradient Splitting Method for Minimizing the sum...
and
k
[ dist(x 0 , S∗ )]2 + (1 + 2ρ + ρ 2 ) 2
i=0 βi
(f + g)(x̄ ) − min (f + g)(x) ≤ ζ
k
,
x∈H 2 ki=0 βi
where ζ > 0 and ρ ≥ 0 are as in Assumptions A1 and A2, respectively.
The above corollary shows that if we assume existence of solutions, the expected error of
the iterates
generated by PSS Method with the exogenous stepsizes (13) after k iterations
is O ( ki=0 βi )−1 . Since (βk )k∈N satisfies (13) the best performance of the iteration (in
term of functional values) is archived for example taking βk ∼
= 1/k r with r bigger than 1/2,
but near of this value, for all k.
In this subsection we analyze the convergence of PSS Method using Polyak stepsizes. Hav-
ing chose any w k ∈ ∂g(x k ) and denoted ρk := w k for all k ∈ N. Then define, for all
k ∈ N,
(f + g)(x k ) − sk
α k = γk , (24)
u 2 + 2ρk uk + ρk2
k
Corollary 2.9 Suppose that limk→∞ sk = s̃ ≥ s∗ and let any x ∈ Lf +g (s̃). Then,
2
sk − (f + g)(x k )
x k+1 − x2 ≤ x k − x2 − γ (2 − γ ) ,
uk 2 + 2ρk uk + ρk2
for all k ∈ N.
Proof Take x ∈ Lf +g (s̃) = {x ∈ dom(g) : (f + g)(x) ≤ s̃}. Since (sk )k∈N is a monotone
decreasing sequence convergent to s̃, which is less than the function values of the iterates,
(f + g)(x k ) ≥ sk ≥ s̃ ≥ (f + g)(x), ∀x ∈ Lf +g (s̃), (26)
J. Y. B. Cruz
for all k ∈ N. Then, applying Lemma 2. and using (26), we get, for all k ∈ N,
sk − (f + g)(x k ) (f + g)(x) − (f + g)(x k )
x k+1
− x ≤ x − x − 2γk
2 k 2
uk 2 + 2ρk uk + ρk2
2
sk − (f + g)(x k )
+γk2
uk 2 + 2ρk uk + ρk2
2
sk − (f + g)(x k )
≤ x − x − γk (2 − γk )
k 2
uk 2 + 2ρk uk + ρk2
2
sk − (f + g)(x k )
≤ x − x − γ (2 − γ )
k 2
, (27)
uk 2 + 2ρk uk + ρk2
where we used that x ∈ Lf +g (s̃), (24) and (26) in the second inequality. The result follows
from (27).
Now we prove the first main result of this subsection in the following theorem.
Theorem 2.10 Let (x k )k∈N be the sequence generated by PSS Method with αk as in (24).
If limk→∞ sk = s̃ ≥ s∗ and Lf +g (s̃) = ∅, then
(a) (x k )k∈N is Fejér convergent to Lf +g (s̃).
(b) limk→∞ (f + g)(x k ) = s̃.
(c) (x k )k∈N is weakly convergent to some x̃ ∈ Lf +g (s̃).
where the last inequality follows from Assumptions A1 and A2 (uk ≤ ζ and
ρk = w k ≤ ρ for all k ∈ N). Summing (29), over k = 0 to m, we obtain
m 2
γ (2 − γ ) sk − (f + g)(x k ) ≤ ρ̂ x 0 − x2 − x m+1 − x2
k=0
≤ ρ̂x 0 − x2 .
Taking limits, when m goes to ∞, we get the desired result.
(c) From Item (b), if s̃ = limk→∞ sk then limk→∞ (f + g)(x k ) = s̃. Let x̃ be a weak
accumulation point of (x k )k∈N , which exists by the boundedness of (x k )k∈N direct
consequence of Item (a). From now on, we denote (x k )k∈N any subsequence of
(x k )k∈N , which converges weakly to x̃. Since f + g is weakly lower semicontinuous,
we get (f + g)(x̃) ≤ lim infk→∞ (f + g)(x k ) = limk→∞ (f + g)(x k ) = s̃, implying
that (f + g)(x̃) ≤ s̃ and thus x̃ ∈ Lf +g (s̃). The result follows from Fact 1.1(b) and
Item (a).
On Proximal Subgradient Splitting Method for Minimizing the sum...
Before the analysis of the inconsistent case when s̃ = limk→∞ sk is strictly less than
s∗ = inf{(f + g)(x) : x ∈ H}, we present a useful corollary which is a direct consequence
of Theorem 2.10, that shall be used for the analysis of this case, s̃ < s∗ . In the next corollary,
we show the special case when the optimal value s∗ is known and finite and the stepsize αk
is defined by (25), i.e., for all k ∈ N,
(f + g)(x k ) − s∗
α k = γk ,
uk 2 + 2ρk uk + ρk2
where 0 < γ ≤ γk ≤ 2 − γ .
Corollary 2.11 Let (x k )k∈N be the sequence generated by PSS Method with αk given by
(25). Assume that S∗ = ∅. Then,
(a) (x k )k∈N is Fejér convergent to S∗ .
(b) limk→∞ (f + g)(x k ) = minx∈H (f + g)(x).
(c) (x k )k∈N is weakly
√ convergent
to some x̃ ∈ S∗ .
(d) lim infk→∞ k + 1 · (f + g)(x k ) − minx∈H (f + g)(x) = 0.
∞
2 ∞
1
(f + g)(x k ) − min (f + g)(x) ≥ δ 2 = +∞. (30)
x∈H k+1
k=k̄ k=k̄
On the other hand, by substituting the expression for the stepsize αk given by (25), in (29)
(sk = minx∈H (f + g)(x) for all k ∈ N), we get, for all k ≥ k̄,
∞
2
(f + g)(x k ) − min (f + g)(x) < +∞,
x∈H
k=k̄
Lemma 2.12 Let (x k )k∈N be the sequence generated by PSS Method with αk , given by
(24). If limk→∞ sk = s̃ ≥ s∗ and Lf +g (s̃) = ∅, then, for all k ∈ N,
Dk dist(x 0 , Lf +g (s̃))
(f + g)best − s̃ ≤
k
· √ ,
γ (2 − γ ) k+1
where Dk := max ui 2 + 2ρi ui + ρi2 : 1 ≤ i ≤ k with ρi := w i and w i ∈ ∂g(x i )
(i = 0, . . . , k) are arbitrary. Moreover,
Proof Repeating the proof of Theorem 2.10, with x̃ := PLf +g (s̃) (x 0 ) ∈ Lf +g (s̃), until (28),
we obtain
2
k 2 Dk 2
(k + 1) (f + g)kbest − s̃ ≤ (f + g)(x i ) − sk ≤ dist(x 0 , Lf +g (s̃)) ,
γ (2 − γ )
i=0
where Dk := max ui 2+ 2ρi ui + ρi2 : 1 ≤ i ≤ k with ρi = w i and w i ∈ ∂g(x i )
(i = 0, . . . , k) are arbitrary. After simple algebra the result follows.
Our analysis proved that the expected error of the iterates generated
by PSS Method
with the Polyak stepsizes (24) after k iterations is O (k + 1)−1/2 if we assume sk ≥ s∗ for
all k ∈ N and (Dk )k∈N is bounded.
Now we are ready to prove the last main result of this subsection.
Theorem 2.13 Let (x k )k∈N be the sequence generated by PSS Method with αk , given by
(24). If S∗ = ∅ and limk→∞ sk = s̃ < minx∈H (f + g)(x), then
2−γ
lim (f +g)kbest = lim min (f +g)(x i ) ≤ min (f +g)(x)+ min (f + g)(x) − s̃ .
k→∞ k→∞ 0≤i≤k x∈H γ x∈H
Proof Suppose that (f + g)(x k ) > minx∈H (f + g)(x), otherwise the result holds trivially.
It is clear that, for all k ∈ N,
(f + g)(x k ) − sk (f + g)(x k ) − minx∈H (f + g)(x)
α k = γk
(f + g)(x k ) − minx∈H (f + g)(x) uk 2 + 2ρk uk + ρk2
(f + g)(x k ) − minx∈H (f + g)(x)
:= γ̃k ,
uk 2 + 2ρk uk + ρk2
where
(f + g)(x k ) − sk
γ ≤ γ̃k = γk ,
(f + g)(x k ) − minx∈H (f + g)(x)
which implies that γ̃k is greater than 2 − γ for some k̄ ∈ N. Otherwise, if
γ̃k ≤ 2 − γ (31)
for all k ∈ N, we can apply Corollary 2.11b to get limk→∞ (f + = minx∈H (f +
g)(x k )
g)(x), implying that γ̃k goes to +∞ (note that for all sufficiently large k, sk < minx∈H (f +
g)(x) ≤ (f + g)(x k ), because s̃ < minx∈H (f + g)(x)), which is a contradiction with (31).
Thus, there exist k̄ and δ > 0 arbitrary such that
(f + g)(x k̄ ) − sk̄
γk̄ = γ̃k̄ > 2 − δ.
(f + g)(x k̄ ) − minx∈H (f + g)(x)
After simple algebra and using that sk̄ ≥ s̃, we get that
γk̄
(f + g)(x k̄ ) < min (f + g)(x) + [min (f + g)(x) − s̃]
x∈H 2 − δ − γk̄ x∈H
2−γ
≤ min (f + g)(x) + [min (f + g)(x) − s̃],
x∈H γ − δ x∈H
since δ > 0 was arbitrary and the result follows.
On Proximal Subgradient Splitting Method for Minimizing the sum...
Corollary 2.14 Let (x k )k∈N be the sequence generated by PSS Method with αk , given by
(24). If S∗ = ∅ and limk→∞ sk = s̃, then
⎧
⎨ = k→∞
⎪ lim (f + g)(x k ) = s̃, if s̃ ≥ min (f + g)(x)
x∈H
lim (f + g)kbest
⎪ 2−γ
k→∞ ⎩ ≤ min (f + g)(x) + min (f + g)(x) − s̃ , if s̃ < min (f + g)(x).
x∈H γ x∈H x∈H
3 Final Remarks
In this work we dealt with the weak convergence and the complexity analysis of the new
approach called the Proximal Subgradient Splitting (PSS) Method for minimizing the sum
of two nonsmooth and convex functions under standard assumptions (namely Assumptions
A1 and A2). It worth mentioning that, these kind of boundedness assumptions, are needed
even for the convergence analysis of the classical subgradient iteration, and hopefully its
relaxation will be addressed in future research. Adding that, in the iteration of the proposed
iteration, none of the functions need be differentiable or finite on H and, therefore, a broad
class of problems can be solved. PSS Method is very useful when the proximal operator of
f is complicated to evaluate and its (sub)gradient is simple to compute.
As future research, we will investigate variations of our scheme for solving structured
convex optimization problems with the aim of finding new methods, like the coordinate
gradient method, which have been proposed, for instance, in [36] only for the differentiable
case. We also look at the incremental subgradient method [28, 33] for problem (1), when
f is the sum of a large number of nonsmooth convex functions. The idea is to perform
subgradient iterations incrementally, by sequentially taking steps along the subgradients of
the component functions, followed by proximal steps. On the other hand, it is important
to mention that the main drawback of subgradient iterations is their slow rate of conver-
gence. However, subgradient methods are distinguished by their applicability, simplicity
and efficient use of memory, which is very important for large scale problems; especially
if the required accuracy for the solution is not too high; see, for instance, [34] and the
references therein. We also will intend to study fast and variable metric versions of the prox-
imal subgradient splitting method proposed here to achieve better performance, as in the
differentiable case; see [20].
Finally, we hope that this study serves as a basis for future research on other more
efficient variants on the proximal subgradient iteration, like cutting-plane method,
-
subgradients and proximal bundle method and its variations; see [28, 29, 40]. Moreover, in
future work we discuss useful modifications on the proximal subgradient iteration adding
conditional, ergodic and deflected techniques combining the ideas presented in [21, 30].
Acknowledgments The author was partially supported by CNPq grants 303492/2013-9, 474160/2013-0
and 202677/2013-3. This work was partially completed while the author was visiting the University of British
Columbia. The author is very grateful for the warm hospitality of the Irving K. Barber School of Arts and
Sciences, Mathematics at the University of British Columbia Okanagan and particularly to Professors Heinz
H. Bauschke and Shawn Wang for the generous hospitality. The author would like to thank to anonymous
referees and associate editors whose suggestions helped us to improve the presentation of this paper.
J. Y. B. Cruz
References
1. Alber, Y.I., Iusem, A.N., Solodov, M.V.: On the projected subgradient method for nonsmooth convex
optimization in a Hilbert space. Math Program 81, 23–37 (1998)
2. Bauschke, H.H., Borwein, J.: On projection algorithms for solving convex feasibility problems. SIAM
Rev 38, 367–426 (1996)
3. Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces.
Springer, New York (2011)
4. Bauschke, H.H., Koch, V.R., Phan, H.M.: Stadium norm and Douglas-Rachford splitting: a new approach
to road design optimization. Operations Research (2016) in press
5. Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising
and deblurring. IEEE Trans Image Process 18, 2419–2434 (2009)
6. Beck, A., Teboulle, M.: Gradient-Based Algorithms with Applications to Signal Recovery Problems. In:
Palomar, D., Eldar, Y. (eds.) Convex Optimization in Signal Processing and Communications, pp. 42–88.
University Press, Cambribge (2010)
7. Bello Cruz, J.Y.: A subgradient method for vector optimization problems. SIAM J Optim 23, 2169–2182
(2013)
8. Bello Cruz, J.Y., Iusem, A.N.: A strongly convergent method for nonsmooth convex minimization in
Hilbert spaces. Numer Funct Anal Optim 32, 1009–1018 (2011)
9. Bello Cruz, J.Y., Iusem, A.N.: Convergence of direct methods for paramonotone variational inequalities.
Comput Optim Appl 46, 247–263 (2010)
10. Bello Cruz, J.Y., Nghia, T.T.A.: On the convergence of the proximal forward-backward splitting method
with linesearches. Technical report, 2015, Available in arXiv:1501.02501.pdf
11. Bot, R., Csetnek, E.R.: Forward-Backward and Tseng’s type penalty schemes for monotone inclusion
problems. Set-Valued and Variational Analysis 22, 313–331 (2014)
12. Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Trans Inf Theory 51, 4203–4215 (2005)
13. Chavent, G., Kunisch, K.: Convergence of Tikhonov regularization for constrained ill-posed inverse
problems. Inverse Prob 10, 63–76 (1994)
14. Chen, G.H.-G., Rockafellar, R.T.: Convergence rates in forward-backward splitting. SIAM J Optim 7,
421–444 (1997)
15. Combettes, P.L.: Solving monotone inclusions via compositions of nonexpansive averaged operators.
Optimization 53, 475–504 (2004)
16. Combettes, P.L.: Quasi-Fejérian analysis of some optimization algorithms. Inherently Parallel Algo-
rithms in Feasibility and Optimization and Their Applications. Studies in Computational Mathematics 8
115–152 North-Holland Amsterdam (2001)
17. Combettes, P.L., Pesquet, J.-C.: A Douglas-Rachford splitting approach to nonsmooth convex variational
signal recovery. IEEE Journal of selected topics in signal precessing 1, 564–574 (2007)
18. Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing. In: Fixed-Point Algo-
rithms for Inverse Problems. Science and Engineering. Springer Optimization and Its Applications 49
185–212 Springer New York (2011)
19. Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multiscale Model
Simul 4, 1168–1200 (2005)
20. Combettes, P.L., Vũ, B.C.: Variable metric forward-backward splitting with applications to monotone
inclusions in duality. Optimization 63, 1289–1318 (2014)
21. D’Antonio, G., Frangioni, A.: Convergence analysis of deflected conditional approximate subgradient
methods. SIAM J Optim 20, 357–386 (2009)
22. Ermoliev, Y.U.M.: On the method of generalized stochastic gradients and quasi-Fejér sequences.
Cybernetics 5, 208–220 (1969)
23. Figueiredo, M., Novak, R., Wright, S.J.: Gradient projection for sparse reconstruction: application to
compressed sensing and other inverse problems. IEEE J Sel Top Sign Proces 1, 586–597 (2007)
24. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images.
IEEE Trans Pattern Anal Mach Intell 6, 721–741 (1984)
25. Held, M., Wolfe, P., Crowder, H.: Validation of subgradient optimization. Math Program 6, 66–68 (1974)
26. James, G.M., Radchenko, P., Lv, J.: DASSO: connections between the Dantzig selector and lasso. J R
Stat Soc Ser B Stat Methodol 71, 127–142 (2009)
27. Kim, S., Ahn, H., Cho, S.-C.: Variable target value subgradient method. Math Program 49, 359–369
(1991)
28. Kiwiel, K.C.: Convergence of approximate and incremental subgradient methods for convex optimiza-
tion. SIAM J Optim 14, 807–840 (2006)
On Proximal Subgradient Splitting Method for Minimizing the sum...
29. Kiwiel, K.C.: The efficiency of subgradient projection methods for convex optimization, Parts I: general
level methods. SIAM J Control Optim 34, 660–676 (1996)
30. Larson, T., Patriksson, M., Stromberg, A.-B.: Conditional subgradient optimization - Theory and
application. Eur J Oper Res 88, 382–403 (1996)
31. Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S., Sebag, M.: Solving structured sparsity reg-
ularization with proximal methods. In: Balczar, J., Bonchi, F., Gionis, A. (eds.) Machine Learning
and Knowledge Discovery in Databases, 6322 of Lecture Notes in Computer Science, Springer, 2010,
418–433
32. Neal, P., Boyd, S.: Proximal Algorithms. Foundations and Trends in Optimization 1, 127–239 (2014)
33. Nedic, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM
J Optim 12, 109–138 (2001)
34. Nesterov, Y.U.: Subgradient methods for huge-scale optimization problems. Math Program 146, 275–
297 (2014)
35. Nesterov, Y.U.: Gradient methods for minimizing composite functions. Math Program 140, 125–161
(2013)
36. Nesterov, Y.U.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM
J Optim 22, 341–362 (2012)
37. Nesterov, Y.U.: Introductory lectures on convex optimization: A Basic Course. Kluwer Academic
Publishers Norwel (2004)
38. Polyak, B.T.: Minimization of unsmooth functionals U.S.S.R. Comput Math Math Phys 9, 14–29 (1969)
39. Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J Control Optim 14,
877–898 (1976)
40. Sagastizb̈al, C.: Composite proximal bundle method. Math Program 140, 189–233 (2013)
41. Sherali, H.D., Choi, G., Tuncbilek, C.H.: A variable target value method for nondifferentiable optimiza-
tion. A variable target value method for nondifferentiable optimization, 1–8 (1997)
42. Svaiter, B.F.: A class of Fejér convergent algorithms, approximate resolvents and the hybrid Proximal-
Extragradient method. J Optim Theory Appl 162, 133–153 (2014)
43. Tropp, J.: Just relax: convex programming methods for identifying sparse signals. IEEE Trans Inf Theory
51, 1030–1051 (2006)
44. Zhu, D.L., Marcotte, P.: Co-coercivity and its role in the convergence of iterative schemes for solving
variational inequalities. SIAM J Optim 6, 714–726 (1996)