Gradient Methods For Nonsmooth Problems
Gradient Methods For Nonsmooth Problems
Abstract. The aim of this paper is to explore a peculiar regularization effect that occurs in the
sensitivity analysis of certain elliptic variational inequalities of the second kind. The effect causes
the solution operator of the variational inequality at hand to be continuously Fréchet differentiable
although the problem itself contains non-differentiable terms. Our analysis shows in particular that
standard gradient-based algorithms can be used to solve bilevel optimization and optimal control
problems that are governed by elliptic variational inequalities of the considered type - all without
regularizing the non-differentiable terms in the lower-level problem and without losing desirable
properties of the solution as, e.g., sparsity. Our results can, for instance, be used in the optimal control
of Casson fluids and in bilevel optimization approaches for parameter learning in total variation image
denoising models.
Key words. optimal control, non-smooth optimization, bilevel optimization, elliptic variational
inequality of the second kind, Casson fluid, total variation, machine learning, parameter identification
Note that, if the operator A : Rn → Rn can be identified with the gradient field
of a convex and Fréchet differentiable function a : Rn → R, i.e., if a0 (v) = A(v) ∈ Rn
holds for all v ∈ Rn , then (P) is equivalent to the bilevel minimization problem
min J(y, u, α, β)
s.t. y, u ∈ Rn , α, β ∈ Rm , (u, α, β) ∈ Uad ,
(1.1) m
X
1+γ
y = arg min a(v) + ω α kG vk + β kG vk − hBu, vi .
k k k k k
n
v∈R
k=1
For a proof of this equivalence, we refer to [10, Section 1.2]. Since elliptic variational
inequalities involving non-conservative vector fields A : Rn → Rn appear naturally in
some applications (cf. the references in [24, Section II-2.1]), we work with the more
general formulation (P) in this paper and not with (1.1).
Optimization problems of the type (P) arise, for instance, in the optimal control
of non-Newtonian fluids, in glaciology, and in bilevel parameter learning approaches
for variational image denoising models. See, e.g., [6, 7, 8, 15, 17, 18, 20, 21, 25,
26, 28, 30, 31, 32, 33, 35, 36] and the references therein, and also the two tangible
examples in Section 2. The main difficulty in the study of the problem (P) is the non-
smoothness of the Euclidean norms present on the lower level. Because of these non-
differentiable terms, standard results and solution methods are typically inapplicable,
and the majority of authors resorts to regularization techniques to determine, e.g.,
stationary points of (P), cf. the approaches in [8, 15, 20, 28, 36].
The aim of this paper is to demonstrate that, in the situation of Assumption 1.1,
problems of the type (P) can also be solved without replacing the involved Euclidean
norms with smooth approximations. To be more precise, in what follows, we prove
the rather surprising fact that the solution operator S : (u, α, β) 7→ y associated with
the inner elliptic variational inequality in (P) is continuously Fréchet differentiable as
a function S : Rn × [0, ∞)m × (0, ∞)m → Rn (see Theorem 3.3 for the main result).
This very counterintuitive behavior makes it possible to tackle minimization problems
of the type (P) with gradient-based solution algorithms, even without regularizing the
non-smooth terms on the lower level. Avoiding such a regularization is highly desirable
in many situations as the Euclidean norms in (P) typically cause the inner solutions y
to have certain properties (sparsity etc.) that are very important from the application
point of view. Before we begin with our investigation, we give a short overview of the
content and the structure of this paper:
In Section 2, we first give two tangible examples of problems that fall under the
scope of our analysis - one arising in the optimal control of non-Newtonian fluids
and one from the field of bilevel parameter learning in variational image denoising
models. The examples found in this section illustrate that our results are not only
of academic interest but also of relevance in practice. In Section 3, we then address
the sensitivity analysis of the inner elliptic variational inequality in (P). Here, we
prove that the solution operator S : (u, α, β) 7→ y is indeed continuously Fréchet
differentiable as a function S : Rn × [0, ∞)m × (0, ∞)m → Rn and also give some
comments, e.g., on the extension of our results to the infinite-dimensional setting.
Section 4 is concerned with the consequences that the results of Section 3 have for
the study and the numerical solution of bilevel optimization problems of the form
(P). Lastly, in Section 5, we demonstrate by means of a numerical example that the
differentiability of the solution map S indeed allows to solve problems of the type (P)
with standard gradient-based algorithms.
GRADIENT-BASED ALGORITHMS FOR NON-SMOOTH BILEVEL PROBLEMS 3
nucleus
ȳ ≡ c
∇ȳ ≡ 0
stagnation
zone ȳ
Ω ȳ ≡ 0
Fig. 1. Typical flow behavior in the situation of the two-dimensional Mosolov problem with a
constant pressure drop ū. The viscoplastic medium forms a solid nucleus in the middle of the fluid
domain that moves with a constant velocity c along the pipe axis and sticks to the boundary in those
regions of Ω where the pressure gradient ū is too low to move the fluid (so-called stagnation zones).
An analogous behavior can be observed in the case d = 1 for the flow of a Casson fluid between two
plates, cf. the numerical results in Section 5.
4 CONSTANTIN CHRISTOF
1 µ
min kȳ − ȳD k2L2 (Ω) + kūk2L2 (Ω) + α2
2 2
1 2
(2.2) s.t. ȳ ∈ H0 (Ω), ū ∈ L (Ω), α ∈ Ũad ,
Z
1 4
ȳ = arg min k∇v̄k2 + α1/2 k∇v̄k3/2 + αk∇v̄k − ūv̄ dx,
v̄∈H01 (Ω) Ω 2 3
where Ũad is some non-empty, convex and closed subset of (0, ∞) and where µ > 0 is
a fixed Tychonoff parameter. Let us briefly check that the above problem is indeed
sensible:
Theorem 2.1 (Solvability of (2.2)). Assume that ȳD , Ω, µ and Ũad are as
before. Then, (2.2) admits at least one solution (ū∗ , α∗ ) ∈ L2 (Ω) × Ũad .
Proof. From standard arguments (as found, e.g., in [24, Lemma 4.1]), it follows
straightforwardly that the lower-level problem in (2.2) possesses a well-defined solution
operator S : L2 (Ω) × [0, ∞) → H01 (Ω), (ū, α) 7→ ȳ. It is further easy to check (using
the weak lower semicontinuity of convex and continuous functions) that this solution
map is weak-to-weak continuous, i.e., for every sequence {(ūi , αi )} ⊂ L2 (Ω) × [0, ∞)
with ūi * ū in L2 (Ω) and αi → α in R for i → ∞, we have S(ūi , αi ) * S(ū, α)
in H01 (Ω). The claim now follows immediately from the direct method of calculus of
variations and the compactness of the embedding H01 (Ω) ,→ L2 (Ω).
To transform (2.2) into a problem that can be solved numerically, we consider
a standard finite element discretization with piecewise linear ansatz functions. More
precisely, we assume the following:
Assumption 2.2. (Assumptions and Notation for the Discretization of (2.2))
1. T = {Tk }m k=1 , m ∈ N, is a triangulation of Ω consisting of simplices Tk (see,
e.g., [9, Definition 2] for the precise definition of the term “triangulation”),
2. {xi }ni=1 , n ∈ N, are the nodes of T that are contained in Ω,
3. Vh := {v̄ ∈ C(cl(Ω)) | v̄ is affine on Tk for all k = 1, ..., m and v̄|∂Ω = 0},
4. {ϕi } is the nodal basis of Vh , i.e., ϕi (xi ) = 1 for all i, ϕi (xj ) = 0 for i 6= j.
By replacing the spaces H01 (Ω) and L2 (Ω) in (2.2) with Vh , we now arrive at a
finite-dimensional minimization problem of the following form:
(2.3)
1 µ
hBu, ui + α2
min hB(y − yD ), y − yD i +
2 2
s.t. y, u ∈ Rn , α ∈ Ũad ,
m
1 4 1/2
X
3/2
y = arg min hAv, vi + |Tk | αkGk vk + α kGk vk − hBu, vi .
v∈Rn 2 3
k=1
Here, y, u and yD are the coordinate vectors of the discretized state, the discretized
control and the Lagrange interpolate of ȳD w.r.t. the nodal basis {ϕi }, respectively,
GRADIENT-BASED ALGORITHMS FOR NON-SMOOTH BILEVEL PROBLEMS 5
|Tk | is the d-dimensional volume of the simplex Tk , and Gk ∈ Rd×n is the matrix
that maps a coordinate vector v ∈ Rn to the gradient of the associated finite element
function on the cell Tk , i.e.,
n
!
X
Gk v = ∇ vi ϕi ∈ Rd ∀v ∈ Rn ∀k = 1, ..., m.
i=1 Tk
Note that (2.3) is precisely of the form (1.1) (with ωk := |Tk | and an appropriately
defined Uad ⊂ Rn × [0, ∞)m × (0, ∞)m ). This shows that, after a discretization,
the minimization problem (2.2) indeed falls under the scope of the general setting
introduced in Section 1, and that our analysis indeed allows to study optimal control
problems for Casson fluids. We will get back to this topic in Section 5, where (2.3)
will serve as a model problem for our numerical experiments.
2.2. Bilevel Optimization Approaches for Parameter Learning. As a
second application example, we consider a bilevel optimization problem that has been
proposed in [28] as a framework for parameter learning in variational image denoising
models (cf. also with [6, 36]). The problem takes the form
min ky − gk2
s.t. y ∈ Rn , ϑ ∈ [0, ∞)q ,
(2.4) q
r
1 X X p
y = arg min kv − f k2 + |(Ki v)j | .
ϑi
v∈Rn 2
i=1 j=1
n
Pr in [1, ∞),p g ∈ R is the given
Here, n, q and r are natural numbers, p is some exponent
ground truth data, f is the noisy image, the terms j=1 |(Ki v)j | , i = 1, ..., q, are so-
called analysis-based P
priors involving matrices Ki ∈ Rr×n , ϑ is the learning parameter,
q Pr p
and 12 kv − f k2 and i=1 ϑi ( j=1 |(Ki v)j | ) are the fidelity and the regularization
term of the underlying denoising model, respectively (cf. the classical TV-denoising
method). For more details on the background of (2.4), we refer to [28] and the
references therein.
Suppose now that we enrich the model (2.4) by allowing the exponent p to depend
on i and by doubling the number of priors in the lower-level problem. Then, we may
choose half of the exponents p to be one and half of the exponents p to be 1 + γ for
some γ ∈ (0, 1) to arrive at a problem of the type
(2.5)
min ky − gk2
s.t. y ∈ Rn , ϑ, ϑ̃ ∈ [0, ∞)q ,
q r r
1 X X X 1+γ
y = arg min kv − f k2 + |(Ki v)j | + ϑ̃i |(Ki v)j |
ϑi .
v∈Rn 2
i=1 j=1 j=1
Note that it makes sense to consider the exponent p = 1 here since this choice ensures
that the priors are sparsity promoting (due to the induced non-smoothness, cf. [41]).
6 CONSTANTIN CHRISTOF
If we replace the constraint on ϑ̃ in (2.5) with ϑ̃ ∈ [ε, ∞)q for some 0 < ε 1, define
use the binomial identities, exploit that terms which depend only on f are irrelevant
in the lower-level problem, and identify f with u, then (2.5) can be recast as
min ky − gk2
s.t. y, u ∈ Rn , α, β ∈ Rrq , (u, α, β) ∈ Uad ,
q X r
1 X 1+γ
y = arg min kvk2 + αij |Gij v| + βij |Gij v| − hu, vi
v∈Rn 2 i=1 j=1
with an appropriately defined admissible set Uad ⊂ Rn × [0, ∞)rq × (0, ∞)rq which
ensures the equality u = f and enforces that αij and βij depend only on i (cf.
(2.6)). The above problem is again of the form (1.1) and satisfies all conditions in
Assumption 1.1. This shows that the general setting of Section 1 can also be used to
study parameter learning problems for variational image denoising models.
3. Solvability and Sensitivity Analysis of the Inner Problem in (P).
Having demonstrated that the general framework of Section 1 indeed covers situations
that are relevant for practical applications, we now turn our attention to the inner
elliptic variational inequality in (P), i.e., to the problem
y ∈ Rn ,
m
X
ωk αk kGk vk + βk kGk vk1+γ
hA(y), v − yi +
(V) k=1
Xm
ωk αk kGk yk + βk kGk yk1+γ ≥ hBu, v − yi ∀v ∈ Rn .
−
k=1
Here and in what follows, we always assume that l, m, n, ω, B, γ, Gk and A satisfy the
conditions in Assumption 1.1. Let us first check that (V) is well-posed:
Proposition 3.1 (Solvability). The variational inequality (V) admits a unique
solution y ∈ Rn for all u ∈ Rn , α ∈ [0, ∞)m and β ∈ [0, ∞)m . This solution satisfies
Xm
hA(y), zi + ωk αk k · k0 (Gk y; Gk z) + βk (k · k1+γ )0 (Gk y; Gk z) ≥ hBu, zi
(3.1) k=1
∀z ∈ Rn ,
Proposition 3.2 (Lipschitz Continuity of the Solution Map). For every M > 0
there exists a constant C > 0 depending only on ω, M , A, B and Gk such that the
solution map S : Rn × [0, ∞)m × [0, ∞)m → Rn , (u, α, β) 7→ y, associated with (V)
satisfies
(3.2) kS(u1 , α1 , β1 ) − S(u2 , α2 , β2 )k ≤ C ku1 − u2 k + kα1 − α2 k + kβ1 − β2 k
for all (u1 , α1 , β1 ), (u2 , α2 , β2 ) ∈ Rn × [0, ∞)m × [0, ∞)m with ku1 k, ku2 k ≤ M .
Proof. We proceed along the lines of [12, Theorem 2.6]: Suppose that a constant
M > 0 is given, consider some (u1 , α1 , β1 ), (u2 , α2 , β2 ) ∈ Rn × [0, ∞)m × [0, ∞)m
with ku1 k, ku2 k ≤ M , and denote the solutions of (V) associated with the triples
(u1 , α1 , β1 ) and (u2 , α2 , β2 ) with y1 and y2 , respectively. Then, (3.1) yields that
m
X
hA(y1 ), zi + ωk α1,k k · k0 (Gk y1 ; Gk z) + β1,k (k · k1+γ )0 (Gk y1 ; Gk z) ≥ hBu1 , zi
k=1
and
m
X
hA(y2 ), zi + ωk α2,k k · k0 (Gk y2 ; Gk z) + β2,k (k · k1+γ )0 (Gk y2 ; Gk z) ≥ hBu2 , zi
k=1
holds for all z ∈ Rn . In particular, we may choose the vectors z = ±(y1 − y2 ) and
add the above two inequalities to obtain
(3.3)
hA(y1 ) − A(y2 ), y1 − y2 i
≤ hB(u1 − u2 ), y1 − y2 i
Xm
+ ωk (α1,k − α2,k ) k · k0 (Gk y1 ; Gk (y2 − y1 ))
k=1
Xm
+ ωk (β1,k − β2,k ) (k · k1+γ )0 (Gk y1 ; Gk (y2 − y1 ))
k=1
Xm
+ ωk α2,k k · k0 (Gk y1 ; Gk (y2 − y1 )) + k · k0 (Gk y2 ; Gk (y1 − y2 ))
k=1
Xm
+ ωk β2,k (k · k1+γ )0 (Gk y1 ; Gk (y2 − y1 )) + (k · k1+γ )0 (Gk y2 ; Gk (y1 − y2 )) .
k=1
Due to the convexity of the functions k · k and k · k1+γ and the non-negativity of the
vectors α, β and ω, the last two sums on the right-hand side of (3.3) are non-positive
and can be ignored (see [12, Lemma 2.3e)]. Further, we obtain from the Lipschitz
continuity of k · k and k · k1+γ on bounded subsets of Rl and Proposition 3.1 that there
exists a constant C = C(ω, M, Gk ) > 0 with
m
X
ωk (α1,k − α2,k ) k · k0 (Gk y1 ; Gk (y2 − y1 ))
k=1
m
X
+ ωk (β1,k − β2,k ) (k · k1+γ )0 (Gk y1 ; Gk (y2 − y1 ))
k=1
≤ C (kα1 − α2 k + kβ1 − β2 k) ky1 − y2 k.
8 CONSTANTIN CHRISTOF
The claim now follows immediately from (3.3), the strong monotonicity of A and the
inequality of Cauchy-Schwarz.
We are now in the position to prove the main result of this paper:
Theorem 3.3 (Continuous Fréchet Differentiability of the Solution Map). The
solution operator S : (u, α, β) 7→ y associated with the variational inequality (V) is
continuously Fréchet differentiable as a function S : Rn × [0, ∞)m × (0, ∞)m → Rn ,
i.e., there exists a continuous map S 0 which maps Rn × [0, ∞)m × (0, ∞)m into the
space L(Rn × Rm × Rm , Rn ) of continuous and linear operators from Rn × Rm × Rm
to Rn such that, for every w ∈ Rn × [0, ∞)m × (0, ∞)m , we have
Moreover, for every triple (u, α, β) ∈ Rn × [0, ∞)m × (0, ∞)m , the Fréchet derivative
S 0 (u, α, β) ∈ L(Rn × Rm × Rm , Rn ) is precisely the solution map (h1 , h2 , h3 ) 7→ δ of
the elliptic variational equality
(3.5)
δ ∈ W (y),
hA0 (y)δ, zi
X
+ ωk αk k · k00 (Gk y)(Gk δ, Gk z) + βk (k · k1+γ )00 (Gk y)(Gk δ, Gk z)
k : Gk y6=0
X
= hBh1 , zi − ωk h2,k k · k0 (Gk y)(Gk z) + h3,k (k · k1+γ )0 (Gk y)(Gk z)
k : Gk y6=0
∀z ∈ W (y).
Here, y := S(u, α, β) is the solution of (V) associated with the triple (u, α, β), A0 (y) is
the Fréchet derivative of A in y, k·k0 and (k·k1+γ )0 (respectively, k·k00 and (k·k1+γ )00 )
are the first (respectively, second) Fréchet derivatives of the functions k · k and k · k1+γ
away from the origin, and W (y) is the subspace of Rn defined by
Note that, by direct calculation, we obtain that the variational equality (3.5) can
also be written in the following, more explicit form:
δ ∈ W (y),
X kGk yk2 hGk δ, Gk zi − hGk y, Gk δi hGk y, Gk zi
hA0 (y)δ, zi + ωk αk
kGk yk3
k : Gk y6=0
X kGk yk2 hGk δ, Gk zi − hGk y, Gk δi hGk y, Gk zi
+ ωk βk (1 + γ)
kGk yk3−γ
k : Gk y6=0
X hGk y, Gk δi hGk y, Gk zi
+ ωk βk (γ 2 + γ)
kGk yk3−γ
k : Gk y6=0
X hGk y, Gk zi hGk y, Gk zi
= hBh1 , zi − ωk h2,k + h3,k (1 + γ)
kGk yk kGk yk1−γ
k : Gk y6=0
∀z ∈ W (y).
GRADIENT-BASED ALGORITHMS FOR NON-SMOOTH BILEVEL PROBLEMS 9
Proof of Theorem 3.3. To prove the different claims in Theorem 3.3, we proceed
in several steps:
Step 1 (Gâteaux Differentiability): We begin by showing that the solution map S
associated with (V) is Gâteaux differentiable everywhere in Rn × [0, ∞)m × (0, ∞)m .
The approach that we use in the following to establish the Gâteaux differentiability is
fairly standard and relies heavily on the Lipschitz estimate (3.2) in Proposition 3.2.
Compare, e.g., with [19, 29] in this context, and also with the more general theory for
infinite-dimensional problems in [2, 10, 14].
Suppose that an arbitrary but fixed triple w := (u, α, β) ∈ Rn ×[0, ∞)m ×(0, ∞)m
is given, and let h := (h1 , h2 , h3 ) be a vector such that w+t0 h ∈ Rn ×[0, ∞)m ×(0, ∞)m
holds for some t0 > 0. Then, the convexity of the set Rn × [0, ∞)m × (0, ∞)m implies
that w + th is an element of Rn × [0, ∞)m × (0, ∞)m for all t ∈ (0, t0 ), and we may
define
Note that we have added several terms here (e.g., the norms kGk yk), so that the
expressions on the left-hand side of (3.7) take the form of classical difference quotients.
An important observation at this point is that all of the terms in the large round
brackets on the left-hand side of (3.7) are non-negative (the second-order difference
quotients of the functions k · k and k · k1+γ due to convexity and the terms in the last
two lines of the left-hand side of (3.7) due to (3.1)). This allows us to deduce the
following from (3.7) when we choose z to be zero:
(3.8)
X kGk δj k1+γ
0≤ ωk β k
k : Gk y=0
t1−γ
j
Since the right-hand side of (3.8) remains bounded for tj & 0, since γ ∈ (0, 1) and
since ωk βk > 0 for all k, the above implies that the limit δ of the difference quotients
δj is contained in the set W (y) = {z ∈ Rn | Gk z = 0 for all k with Gk y = 0}.
From (3.7), the fact that (3.1) holds with equality for all z ∈ W (y), and again the
information about the signs of the terms in (3.7), we now obtain that δj satisfies
(3.9)
A(y + tj δj ) − A(y)
, z − δj
tj
X 1 kGk y + tj Gk zk − kGk yk
+ ωk αk − k · k0 (Gk y)(Gk z)
tj tj
k : Gk y6=0
≥ hBh1 , z − δj i
m
X kGk y + tj Gk δj k − kGk yk kGk y + tj Gk zk − kGk yk
+ ωk h2,k −
tj tj
k=1
m
kGk y + tj Gk δj k1+γ − kGk yk1+γ kGk y + tj Gk zk1+γ − kGk yk1+γ
X
+ ωk h3,k −
tj tj
k=1
for all z ∈ W (y). Note that all of the expressions in (3.9) are well-behaved for
j → ∞ (since the Euclidean norm k · k is smooth away from the origin and Hadamard
GRADIENT-BASED ALGORITHMS FOR NON-SMOOTH BILEVEL PROBLEMS 11
∀z ∈ W (y).
Since the Fréchet derivative A0 (y) : Rn → Rn inherits the strong monotonicity of
the original operator A (see [10, Lemma 1.2.3]), the problem (3.10) can have at most
one solution δ ∈ W (y) (cf. step 3 in the proof of [10, Theorem 1.2.2]), and we may
deduce that the limit δ of the difference quotients δj is independent of the choice of the
(sub)sequence {tj } ⊂ (0, t0 ) that we started with. The latter implies, in combination
with classical contradiction arguments, that the whole family of difference quotients
{δt } converges to the unique solution δ ∈ W (y) of (3.10) for t & 0, and that the
solution operator S associated with (V) is directionally differentiable in the point w
in every direction h with w + t0 h ∈ Rn × [0, ∞)m × (0, ∞)m for some t0 > 0. By
choosing test vectors of the form δ + sz, z ∈ Rn , s > 0, in (3.10), by dividing by s,
by passing to the limit s & 0, and by exploiting that W (y) is a subspace, we obtain
further that (3.10) can be rewritten as (3.5). Since (3.5) has a linear and continuous
solution operator (h1 , h2 , h3 ) 7→ δ, it now follows immediately that the map S is
Gâteaux differentiable in w and that the derivative S 0 (w) ∈ L(Rn × Rm × Rm , Rn ) is
characterized by (3.5). This completes the first step of the proof.
Step 2 (Fréchet Differentiability): The Fréchet differentiability of the solution map
S on Rn ×[0, ∞)m ×(0, ∞)m follows immediately from the Gâteaux differentiability of
S, the Lipschitz estimate (3.2) and standard arguments. We include the proof for the
convenience of the reader: Suppose that there exists a w ∈ Rn × [0, ∞)m × (0, ∞)m
such that S is not Fréchet differentiable in w in the sense of (3.4). Then, there exist
an ε > 0 and sequences {hj } ⊂ Rn × Rm × Rm , {tj } ⊂ (0, ∞) such that tj & 0 for
j → ∞, khj k = 1 for all j, w + tj hj ∈ Rn × [0, ∞)m × (0, ∞)m for all j, and
Step 3 (Continuity of the Fréchet Derivative): It remains to prove that the map
S 0 : Rn × [0, ∞)m × (0, ∞)m → L(Rn × Rm × Rm , Rn ) is continuous. To this end,
we consider an arbitrary but fixed sequence {wj } ⊂ Rn × [0, ∞)m × (0, ∞)m which
satisfies wj = (uj , αj , βj ) → w = (u, α, β) for some w ∈ Rn × [0, ∞)m × (0, ∞)m . Note
that, to prove the continuity of S 0 , it suffices to show that S 0 (wj )h → S 0 (w)h holds
for all h ∈ Rn × Rm × Rm (since this convergence already implies S 0 (wj ) → S(w) in
the operator norm). So let us assume that an h = (h1 , h2 , h3 ) ∈ Rn × Rm × Rm is
given. Then, (3.5) yields that the vectors ηj := S 0 (wj )h satisfy
and
Here, yj is short for S(wj ). From the strong monotonicity of A0 (yj ) (which is uniform
in j by our assumptions and [10, Lemma 1.2.3]) and the inequalities of Cauchy-Schwarz
and Young, we may now deduce that there exist constants c, C > 0 independent of j
with
X kGk ηj k2
(3.11) ckηj k2 + ωk βj,k (γ 2 + γ) ≤ C.
kGk yj k1−γ
k : Gk yj 6=0
The above implies that the sequence {ηj } is bounded and that we may pass over to
a subsequence (still denoted by the same symbol) with ηj → η for some η. We claim
that this η satisfies η ∈ W (y), where y := S(w) is the solution associated with the
limit point w. To see this, we consider an arbitrary but fixed k ∈ {1, ..., m} with
Gk y = 0 and distinguish between two cases: If we can find a subsequence ji such that
Gk yji = 0 holds for all i, then we trivially have Gk ηji = 0 for all i (since ηji ∈ W (yji )),
and we immediately obtain from the convergence ηji → η that Gk η = 0. If, on the
other hand, we can find a subsequence ji such that Gk yji 6= 0 holds for all i, then
(3.11) yields
and we may use the convergences yji → y and βji → β ∈ (0, ∞)m to conclude that
Gk η = 0. This shows that Gk η = 0 holds for all k with Gk y = 0 and that η is
indeed an element of W (y). Suppose now that j is so large that Gk yj 6= 0 holds for
all k ∈ {1, ..., m} with Gk y 6= 0 (this is the case for all large enough j due to the
convergence yj → y). Then, it clearly holds W (y) ⊂ W (yj ), and we may deduce the
GRADIENT-BASED ALGORITHMS FOR NON-SMOOTH BILEVEL PROBLEMS 13
∀z ∈ W (y).
If we pass to the limit j → ∞ in the above, then it follows that η solves the variational
problem (3.5) which characterizes S 0 (w)h. This shows that η = S 0 (w)h has to hold
and that S 0 (wj )h converges to S 0 (w)h for j → ∞. Using the same arguments as in
the first part of the proof, we obtain that this convergence also holds for the whole
original sequence S 0 (wj )h and not just for the subsequence that we have chosen after
(3.11). This proves the continuity of S 0 and completes the proof.
Some remarks are in order regarding Theorem 3.3:
Remark 3.4.
1. It seems to be a common believe that minimization problems and elliptic
variational inequalities which involve non-differentiable terms necessarily also
have non-differentiable solution operators. Theorem 3.3 shows that there is,
in fact, no such automatism, and that it is perfectly possible that the solution
map of a non-smooth problem is continuously Fréchet differentiable. In the
fields of bilevel optimization and optimal control, this observation is, of course,
very valuable.
2. We would like to point out that the solution map S associated with (V) is
typically not Fréchet differentiable in points (u, α, β) with βk = 0 for some k.
Theorem 3.3 thus does not hold anymore in general when we replace the set
Rn × [0, ∞)m × (0, ∞)m with Rn × [0, ∞)m × [0, ∞)m . Similarly, we cannot
expect Fréchet differentiability anymore when the exponent γ is equal to zero
or one. The fact that S is not Fréchet differentiable for γ = 0 and γ = 1
but for all γ between these values is quite counterintuitive. Note that, in
all of the above cases, the solution operator is still Hadamard directionally
differentiable in the sense of [5, Definition 2.45] as one may easily check using
the same arguments as in the first step of the proof of Theorem 3.3.
3. What we observe in Theorem 3.3 can be interpreted as a non-standard regu-
larization effect. Consider, for instance, the simple model problem
2 µ
kuk2 + α2
min ky − yD k +
2
(3.12) s.t. y ∈ Rn , u ∈ Rn , α ∈ [0, ∞),
1
y = arg min kv − uk2 + αkvk1 ,
v∈Rn 2
14 CONSTANTIN CHRISTOF
where yD ∈ Rn and µ > 0 are given, and where k · k1 denotes the 1-norm
on the Euclidean space. Then, it is easy to check that the solution operator
S : Rn × [0, ∞) → Rn , (u, α) 7→ y, associated with the inner minimization
problem in (3.12) is non-smooth. In fact, in the special case n = 1, we can
derive the following closed formula for the solution map:
u + α if u ≤ −α
(3.13) S(u, α) = 0 if u ∈ (−α, α) .
u − α if u ≥ α
Suppose now that we modify (3.12) by adding a term of the form εkvkpp for
some p ∈ (1, 2) in the lower-level problem, where ε > 0 is an arbitrary but
fixed small number and where k · kp denotes the p-norm on Rn . Then, the
resulting bilevel minimization problem
µ
min ky − yD k2 + kuk2 + α2
2
(3.14) s.t. y ∈ Rn , u ∈ Rn , α ∈ [0, ∞),
1
y = arg min kv − uk2 + αkvk1 + εkvkpp
2
v∈R n
can also be written in the form (P) (cf. the second example in Section 2), and
we obtain from Theorem 3.3 that the solution operator Sε : Rn ×[0, ∞) → Rn ,
(u, α) 7→ y, associated with the lower level of (3.14) is continuously Fréchet
differentiable. By adding the term εk · kpp , we have thus indeed regularized
the original problem (3.12). What is appealing about the above method of
regularization is that it is “minimally invasive”. It produces an approximate
problem whose reduced objective function possesses C 1 -regularity (and which
is thus amenable to gradient-based solution algorithms, see Section 4) while
preserving the non-smooth features and, e.g., the sparsity promoting nature
on the lower level. Note that the addition of the term εk·kpp in particular does
not change the subdifferential at zero of the objective function of the inner
problem in (3.12). We would like to emphasize at this point that the lower-
level problems in (3.12) and (3.14) can be solved easily with various standard
algorithms (e.g., semi-smooth Newton, subgradient or bundle methods). The
major difficulty in (3.12) is handling the non-smoothness of the solution map
S : (u, α) 7→ y on the upper level. In view of these facts, the regularization
effect observed above is the best that we can hope for: By adding the term
εk · kpp , we regularize the solution operator of the inner problem in (3.12)
without regularizing the non-differentiable terms in the inner problem. This
removes the non-smoothness where it is problematic (in the solution map)
while preserving it where it can be handled (in the lower-level problem). At
least to the author’s best knowledge, similar effects have not been documented
so far in the literature (where primarily Huber-type regularizations are used
which do not preserve sparsity promoting effects, see [8, 15, 20, 28, 36]).
4. In the context of the general sensitivity analysis for elliptic variational in-
equalities of the first and the second kind developed in [2, 10, 39], the differ-
entiability result in Theorem 3.3 can be explained as follows: The singular
curvature properties of the terms kGk (·)k1+γ at the origin enforce that the
second subderivative of the non-smooth functional in (V) is generated by a
GRADIENT-BASED ALGORITHMS FOR NON-SMOOTH BILEVEL PROBLEMS 15
F 0 (u, α, β)
= (∂u J)(y, u, α, β) + B ∗ p1 , (∂α J)(y, u, α, β) + p2 , (∂β J)(y, u, α, β) + p3 .
16 CONSTANTIN CHRISTOF
−ω hGk y, Gk p1 i
k if Gk y 6= 0
(p2 )k := kGk yk , k = 1, ..., m,
0 else
and
W (y) denotes the space in (3.6), B ∗ and A0 (y)∗ are the adjoints of B and A0 (y)
(w.r.t. the Euclidean scalar product), and ∂y J, ∂u J, ∂α J and ∂β J denote the partial
derivatives of the function J w.r.t. the first, the second, the third and the fourth
argument, respectively.
Proof. From the chain rule, see [5, Proposition 2.47], it follows straightforwardly
that, for every point (u, α, β) ∈ Rn × [0, ∞)m × (0, ∞)m and every h = (h1 , h2 , h3 ) ∈
Rn × Rm × Rm , we have
hF 0 (u, α, β), hi = h(∂y J)(y, u, α, β), S 0 (u, α, β)hi + h(∂u J)(y, u, α, β), h1 i
+ h(∂α J)(y, u, α, β), h2 i + h(∂β J)(y, u, α, β), h3 i .
Further, we obtain from the variational equality (3.5) for δ := S 0 (u, α, β)h ∈ W (y)
and the definitions of the vectors p1 , p2 , and p3 that
Here, we have chosen the dimension d to be one (so that the Euclidean norms in (2.1)
are just absolute value functions), κ ∈ (0, ∞) is a given constant (a lower bound for
the Oldroyd number α) and the quantities A, B etc. are defined as in Section 2.1. To
solve the problem (5.1), we will employ a standard gradient projection method in the
spirit of [4, 23, 27], see Algorithm 5.4 below. We would like to emphasize that the
subsequent analysis should be understood as a feasibility study. With Theorems 3.3
and 4.1 at our disposal, we could also consider more complicated problems and more
sophisticated algorithms at this point (e.g., non-linear conjugated gradient or trust-
region methods). To avoid overloading this paper, we leave a detailed discussion of the
various possible applications of Theorems 3.3 and 4.1 (e.g., in the fields of parameter
learning and identification, see Section 2.2) for future research.
Before we state the algorithm that we use for the solution of the optimization
problem (5.1), we prove some auxiliary results:
Lemma 5.1 (Multiplier System for the Lower-Level Problem in (5.1)). A vector
y ∈ Rn solves the lower-level problem in (5.1) for a given tuple (u, α) ∈ Rn × (0, ∞)
if and only if there exist multipliers λ1 , ..., λm , η1 , ..., ηm ∈ R such that
Xm
|Tk | αG∗k λk + 2α1/2 G∗k ηk − Bu = 0,
Ay +
k=1
(5.2)
2
max λk − 1, |Gk y| − (G k y)λ k = 0, k = 1, ..., m,
max η 2 − |Gk y|, |Gk y|3/2 − (Gk y)ηk = 0, k = 1, ..., m.
k
Proof. From standard calculus rules for the convex subdifferential (see, e.g., [22]),
we obtain that a vector y ∈ Rn solves the lower-level problem in (5.1) if and only if
there exist λ1 , ..., λm , η1 , ..., ηm ∈ R with
Xm
|Tk | αG∗k λk + 2α1/2 G∗k ηk − Bu = 0,
Ay +
k=1
(5.3) λk ∈ ∂| · |(Gk y), ∀k = 1, ..., m,
2 3/2
η k ∈ ∂ |·| (Gk y), ∀k = 1, ..., m.
3
If we plug in explicit formulas for the convex subdifferentials of the functions | · | and
| · |3/2 in the above, then it follows straightforwardly that y and the multipliers λk
18 CONSTANTIN CHRISTOF
and ηk satisfy the system (5.2). This proves the first implication. If, conversely, we
start with the system (5.2), then the conditions on λk yield |λk | ≤ 1 and
and, as a consequence, ηk ∈ ∂ 23 | · |3/2 (Gk y) for all k = 1, ..., m. This shows that
p ∈ W (y),
(5.5) X (Gk p)(Gk z)
hAp, zi + |Tk |α1/2 = hB(y − yD ), zi ∀z ∈ W (y)
|Gk y|1/2
k : G y6=0
k
Lemma 5.3. The discretized optimal control problem (5.1) admits at least one
solution (u∗ , α∗ ) ∈ Rn × [κ, ∞). Further, every solution (u∗ , α∗ ) of (5.1) satisfies
ζ(u∗ , α∗ ) = 0, where ζ : Rn × [κ, ∞) → [0, ∞) is the function defined by
(
kF 0 (u, α)k if α > κ,
(5.6) ζ(u, α) :=
(∂u F )(u, α), min(0, (∂α F )(u, α)) if α = κ.
Proof. To show that the problem (5.1) admits a solution, we can use exactly
the same argumentation as in the proof of Theorem 2.1. It remains to prove that
ζ(u, α) = 0 is a necessary optimality condition. This, however, follows immediately
from the equivalence
for all (u, α) ∈ Rn × [κ, ∞), where NRn ×[κ,∞) (u, α) denotes the normal cone to the
set Rn × [κ, ∞) at (u, α).
We are now in the position to state the algorithm that we use for the solution of
the problem (5.1):
Fi − F (ui − σj gi , max(κ, αi − σj hi ))
(αi − κ)
− θσj ζi kgi k2 + min(0, hi )2 + min max(0, hi )2 , max(0, hi )
σj
=: ej .
if ej < 0 then
Define σj+1 := νσj .
else
Define τi := σj and break.
end if
end for
Define ui+1 := ui − τi gi and αi+1 := max(κ, αi − τi hi ).
end for
Lemma 5.5. For every arbitrary but fixed tuple (ui , αi ) ∈ Rn ×[κ, ∞) that satisfies
ζ(ui , αi ) 6= 0, and every choice of parameters σ > 0 and ν, θ ∈ (0, 1), the Armijo-type
line-search in Algorithm 5.4 (i.e., the inner for-loop with index j) terminates after
finitely many steps.
Proof. From the Fréchet differentiability of the reduced objective function F , the
definitions ζi := ζ(ui , αi ), gi := (∂u F )(ui , αi )/ζi , hi := (∂α F )(ui , αi )/ζi , and simple
distinctions of cases, it follows straightforwardly that
holds for all s ∈ (0, ∞), where the Landau symbol refers to the limit s & 0. Since
ζi 6= 0 and, as a consequence,
2 2 2 (αi − κ)
lim inf kgi k + min(0, hi ) + min max(0, hi ) , max(0, hi ) > 0,
s&0 s
the above yields that there exists an s0 > 0 such that, for all s ∈ (0, s0 ), we have
produces an infinite sequence {(ui , αi )}. In this situation, it follows from the sufficient
decrease condition that is used for the calculation of the step sizes τi that the sequence
{F (ui , αi )} is monotonously decreasing, and we obtain from the structure of the
objective function in (5.1) that there exists a constant C > 0 independent of i with
The above implies that the sequence {(ui , αi )} is bounded, that the function values
F (ui , αi ) converge for i → ∞, and that the sequence {(ui , αi )} possesses at least
one accumulation point. To prove that every accumulation point of the iterates is
stationary in the sense of Lemma 5.3, we argue by contradiction: Suppose that there
exists an accumulation point (u∗ , α∗ ) ∈ Rn × [κ, ∞) of the sequence {(ui , αi )} which
satisfies ζ(u∗ , α∗ ) > 0, and let (uij , αij ) be a subsequence with (uij , αij ) → (u∗ , α∗ )
for j → ∞. Then, it follows from the continuous Fréchet differentiability of F ,
the definition of the stationarity measure ζ and the boundedness of {(ui , αi )} that
the sequence {ζ(uij , αij )} ⊂ [0, ∞) is bounded, and we may assume w.l.o.g. that
ζij := ζ(uij , αij ) → ζ ∗ holds for some ζ ∗ ≥ 0. Note that we can ignore the case ζ ∗ = 0
here since this equality would imply ζ(u∗ , α∗ ) = 0 (see (5.6), the continuity of F 0 and
a simple distinction of cases). Thus, w.l.o.g. ζ ∗ > 0 and ζij ≥ ε > 0 for some constant
ε > 0. We now consider two different situations: If there exists a subsequence of ij
(still denoted by the same symbol) such that the step sizes τij ∈ (0, σ] satisfy τij → τ ∗
for some τ ∗ > 0, then we obtain from our line-search procedure, the definitions of gij
and hij , the convergence of the function values {F (ui , αi )} and the continuity of the
derivative F 0 that
0 = lim F (uij , αij ) − F (uij +1 , αij +1 )
j→∞
(αi − κ)
≥ lim θτij ζij kgij k2 + min(0, hij )2 + min max(0, hij )2 , max(0, hij ) j
j→∞ τij
θτ ∗
≥ k(∂u F )(u∗ , α∗ )k2 + min(0, (∂α F )(u∗ , α∗ ))2
ζ∗
∗ ∗
!
(α − κ)ζ
+ min max(0, (∂α F )(u∗ , α∗ ))2 , max(0, (∂α F )(u∗ , α∗ ))
τ∗
≥ 0.
0 ν ν
τ τi E
ij
, gij , αij − max κ, αij − j hij ds
ν ν
(αi − κ)ν
θτij ζij
≤ kgij k2 + min(0, hij )2 + min max(0, hij )2 , max(0, hij ) j .
ν τij
22 CONSTANTIN CHRISTOF
If we assume that there exists a subsequence of ij (again not relabeled) such that
αij − τij hij /ν ≥ κ holds for all j, then we may divide the left- and the right-hand
side of (5.7) by τij , employ the elementary estimate min(a, b) ≤ a for all a, b ∈ R, and
pass to the limit j → ∞ (using the continuity of the derivative F 0 , the boundedness
of the iterates {(ui , αi )}, the definitions of gij and hij , the convergence τij & 0, and
the dominated convergence theorem) to obtain
1 θ
kF 0 (u∗ , α∗ )k2 ≤ ∗ kF 0 (u∗ , α∗ )k2 .
ζ ∗ν ζ ν
This inequality again contradicts our assumption ζ(u∗ , α∗ ) > 0. If, on the other hand,
we can find a subsequence of ij with αij − τij hij /ν ≤ κ, then, along this subsequence,
it necessarily holds
and we may assume w.l.o.g. that (αij − κ)/τij → ξ holds for some ξ ≥ 0. By dividing
by τij in (5.7) and by passing to the limit j → ∞, we now obtain analogously to the
case αij − τij hij /ν ≥ κ that
1
k(∂u F )(u∗ , α∗ )k2 + ξ(∂α F )(u∗ , α∗ )
ζ ∗ν
θ θ
≤ ∗ k(∂u F )(u∗ , α∗ )k2 + min (∂ α F )(u∗
, α ∗ 2
) , θξ(∂ α F )(u∗
, α ∗
)
ζ ν ζ ∗ν
1
≤θ k(∂u F )(u∗ , α∗ )k2 + ξ(∂α F )(u∗ , α∗ ) .
ζ ∗ν
The above implies k(∂u F )(u∗ , α∗ )k2 = 0 and yields, in combination with (5.8) and
the definition of ζ, that ζ(u∗ , α∗ ) = 0. This is again a contradiction. Accumulation
points with ζ(u∗ , α∗ ) > 0 thus cannot exist and the proof is complete.
The results of a numerical experiment conducted with Algorithm 5.4 can be seen
in Figure 2 below. As the plots show, the behavior of the iterates generated by our
gradient projection method accords very well with the predictions of Theorem 5.6. In
particular, we observe that the quantity ζi := ζ(ui , αi ), which measures the degree
of stationarity of the current iterate (ui , αi ), converges to zero as i tends to infinity.
This demonstrates that Theorems 3.3 and 4.1 indeed make it possible to solve bilevel
optimization problems of the type (P) with standard gradient-based algorithms. We
would like to emphasize at this point that Algorithm 5.4 solves (5.1) “as it is”, i.e., in
the presence of the absolute value functions on the lower level and without any kind of
regularization or modification of the problem or the solution procedure (in contrast to
the methods in [8, 15, 20, 28, 35, 36]). The latter implies in particular that the sparsity
promoting effects that the non-smooth terms |GP k (·)| in (5.1) have P
on the gradient of
the finite element functions ȳh∗ := yj∗ ϕj , p̄∗h := p∗j ϕj and ū∗h := u∗j ϕj associated
P
with a solution (u∗ , α∗ ) ∈ Rn × [κ, ∞) of (5.1) (cf. Section 2.1) are preserved in our
approach. This can also be seen in Figure 2 where the optimal state ȳh∗ ∈ Vh has a
distinct “flat” region in the middle of the fluid domain. Recall that, in the context
of the optimal control problem (5.1), the set {∇ȳh∗ = 0} is exactly that part of the
domain Ω where the viscoplastic medium under consideration behaves like a solid (the
nucleus). Our solution method thus allows to identify precisely where rigid material
GRADIENT-BASED ALGORITHMS FOR NON-SMOOTH BILEVEL PROBLEMS 23
behavior occurs in the fluid domain when we optimize the objective function in (5.1).
Such an identification is not possible anymore when regularization approaches are
used which necessarily remove the sparsity promoting effects from the problem. Note
that the functions p̄∗h ∈ Vh and ū∗h ∈ Vh associated with a solution (u∗ , α∗ ) of (5.1)
directly inherit the “flatness” properties of ȳh∗ due to (5.5) and since the optimality
condition ζ(u∗ , α∗ ) = 0 implies µu∗ + p∗ = 0, see (5.4).
0 −2
10 10
−3
10
−4
10
−5
10
−1
10
−6
10
−7
10
−8
10
−2 −9
10 10
0 50 100 150 200 0 50 100 150 200
40
1
0.9 35
0.8 30
0.7
25
0.6
20
0.5
0.4 15
0.3
10
0.2
5
0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
−3
x 10
1 0
0.8 −0.1
0.6 −0.2
0.4 −0.3
0.2 −0.4
0 −0.5
−0.2 −0.6
−0.4 −0.7
−0.6 −0.8
−0.8 −0.9
−1 −1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(e) Final multipliers λ∗k (f) Final adjoint state p̄∗h = p∗j ϕj
P
Fig. 2. Numerical results obtained for the problem (5.1) on the interval Ω = (0, 1) with an
equidistant partition T of width 1/500 and µ = 0.000025, κ = 3, σ = 40, ν = 0.25, θ = 0.5, α0 = 4
and u0 = (10, 10, ..., 10). Figures (a) and (b) show the reduction of the function value Fi and the
stationarity measure ζi during the first 200 iterations of Algorithm 5.4. The state, the control,
the multipliers and the adjoint state of the approximate solution at iteration 200 can be seen in
figures (c) to (f ). The considered desired state ȳD ∈ H01 (Ω) is plotted as a dashed line in (c). The
multipliers λ∗k are depicted as a step function whose value on a cell Tk of T is λk . As a tolerance
for the residue of the semi-smooth Newton method used for the solution of (5.2), we chose 10−10 .
24 CONSTANTIN CHRISTOF
We conclude this paper with some additional remarks on Algorithm 5.4, the
numerical results in Figure 2, and the analysis in Sections 3 and 4:
Remark 5.7.
1. As the graph in Figure 2(a) shows, in our numerical experiment, we do not
observe a significant decrease of the function value Fi anymore after approx-
imately 25 gradient steps. This number of iterations is thus sufficient if we
are primarily interested in determining a tuple (u, α) for which the value of
the reduced objective function is as small as possible. We would like to point
out that, although Fi remains nearly constant for i ≥ 25, the stationarity
measure ζi still changes in the later iterations of the algorithm. The same is
true for the quantities ui and pi . The two global maxima of the control seen
in Figure 2(d), for example, are not visible until i ≈ 60.
2. In the numerical experiment of Figure 2, the Oldroyd number α is decreased
from the initial guess α0 = 4 to the lower bound κ = 3 in the first three
iterations of Algorithm 5.4 and afterwards remains constant. This makes
sense since a low α is preferable in the situation of the discrete tracking-type
optimal control problem (5.1). (The lower the material parameter α, the
lower the yield stress and the smaller the pressure gradient that is needed to
create a desired flow profile.) If we replace the term α2 on the upper level
of (5.1) with, e.g., (α − αD )2 for some sufficiently large αD > 0, then this
behavior changes and we observe convergence to an α∗ > κ.
3. It is easy to check that the solution operator S : Rn × (0, ∞) → Rn associated
with the elliptic variational inequality on the lower level of (5.1) is constant
zero in an open neighborhood of the cone {0} × (0, ∞). (This is precisely the
set where the pressure gradient is not large enough to move the fluid under
consideration, compare also with (3.13).) Because of this behavior, the tuple
(0, κ) is always a local minimizer of (5.1) and a stationary point in the sense
of Lemma 5.3. To avoid converging to the point (0, κ), which is typically
not globally optimal and thus only of limited interest, one has to choose an
initial guess u0 that is sufficiently far away from the origin when an iterative
method analogous to Algorithm 5.4 is used for the solution of (5.1). Note
that, in the situation of Figure 2, we have indeed found a point that is better
than (0, κ) since the final iterate achieves a function value that is far smaller
than F (0, κ) ≈ 0.333446.
4. Note that the variational problems (4.1) and (5.5) can also be formulated
as quadratic minimization problems with linear equality constraints. This
makes it possible to use standard methods from quadratic programming for
the calculation of the adjoint state and the gradient of the reduced objective
function. In the numerical experiment of Figure 2, we determined p with the
Matlab routine quadprog.
5. We would like to point out that, if an iterative scheme is used for the solution
of the lower-level problem in (5.1), then the successive evaluations of the maps
S and F in the line-search procedure of Algorithm 5.4 can be performed quite
effectively since the last trial iterate can always be used as an initial guess
for the calculation of the next required state S(ui − σj gi , max(κ, αi − σj hi )).
In the situation of Figure 2, it can be observed that Algorithm 5.4 requires
an average of approximately four evaluations of the solution operator S per
gradient step over the first 200 iterations. The majority of these calculations
are needed for large i.
GRADIENT-BASED ALGORITHMS FOR NON-SMOOTH BILEVEL PROBLEMS 25
REFERENCES
[20] J. C. De los Reyes, C.-B. Schönlieb, and T. Valkonen, Bilevel parameter learning for
higher-order total variation regularisation models, J. Math. Imaging Vision, 57 (2017),
pp. 1–25.
[21] E. J. Dean, R. Glowinski, and G. Guidoboni, On the numerical simulation of Bingham
visco-plastic flow: Old and new results, J. Non-Newton. Fluid Mech., 142 (2007), pp. 36–
62.
[22] I. Ekeland and R. Temam, Convex Analysis and Variational Problems, North-Holland Pub-
lishing Company, 1976.
[23] W. Forst and D. Hoffman, Optimization - Theory and Practice, Springer, 2010.
[24] R. Glowinski, Lectures on Numerical Methods for Non-Linear Variational Problems, Tata
Institute of Fundamental Research, Bombay, 1980.
[25] M. Hintermüller and T. Wu, Bilevel optimization for calibrating point spread functions in
blind deconvolution, Inverse Probl. Imaging, 9 (2015), pp. 1139–1169.
[26] R. R. Huilgol and Z. You, Application of the augmented Lagrangian method to steady pipe
flows of Bingham, Casson and Herschel-Bulkley fluids, J. Non-Newton. Fluid Mech., 128
(2005), pp. 126–143.
[27] C. T. Kelley, Iterative Methods for Optimization, vol. 18 of Frontiers in Applied Mathematics,
SIAM, 1999.
[28] K. Kunisch and T. Pock, A bilevel optimization approach for parameter learning in varia-
tional models, SIAM J. Imaging Sci., 6 (2013), pp. 938–983.
[29] A. B. Levy, Implicit multifunction theorems for the sensitivity analysis of variational condi-
tions, Math. Program., 74 (1996), pp. 333–350.
[30] P. Lindqvist, Stability of solutions of div(|∇u|p−2 ∇u) = f with varying p, J. Math. Anal.
Appl., 127 (1987), pp. 93–102.
[31] P. P. Mosolov and V. P. Miasnikov, Variational methods in the theory of the fluidity of a
viscous-plastic medium, J. Appl. Math. Mech., 29 (1965), pp. 545–577.
[32] P. P. Mosolov and V. P. Miasnikov, On stagnant flow regions of a viscous-plastic medium
in pipes, PMM, 30 (1966), pp. 705–717.
[33] P. P. Mosolov and V. P. Miasnikov, On qualitative singularities of the flow of a viscoplastic
medium in pipes, J. Appl. Math. Mech., 31 (1967), pp. 609–613.
[34] J. Nocedal and S. J. Wright, Numerical Optimization, Springer, 2006.
[35] P. Ochs, R. Ranftl, T. Brox, and T. Pock, Bilevel optimization with nonsmooth lower
level problems, Scale Space and Variational Methods in Computer Vision, Lecture Notes
in Computer Science, 9087 (2015), pp. 654–665.
[36] G. Peyré and J. Fadili, Learning analysis sparsity priors, Proc. of Sampta’11, (2011).
[37] R. Pytlak, Conjugate Gradient Algorithms in Nonconvex Optimization, vol. 89 of Nonconvex
Optimization and Its Applications, Springer, 2008.
[38] A.-T. Rauls and G. Wachsmuth, Generalized derivatives for the solution operator of the
obstacle problem, preprint spp1962-057, 2018.
[39] R. T. Rockafellar and R.-B. Wets, Variational Analysis, vol. 317 of Grundlehren der Math-
ematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], Springer,
Berlin, 1998.
[40] M. Ruzicka, Nichtlineare Funktionalanalysis, Springer, Berlin/Heidelberg, 2004.
[41] S. Vaiter, G. Peyre, C. Dossal, and J. Fadili, Robust sparse analysis regularization, IEEE
Trans. Inform. Theory, 59 (2013), pp. 2001–2016.