0% found this document useful (0 votes)
44 views

Constrained Optimization Solving Sample Average Aproximation CCP

This document summarizes research on solving sample average approximation (SAA) problems for chance constrained programming. SAA reformulates the probabilistic constraint in chance constrained programming into an equivalent deterministic form using sample averages, but the resulting 0/1 loss function constraint makes the problem difficult to solve. The authors investigate a novel 0/1 constrained optimization problem to directly address SAA. By deriving optimality conditions for the 0/1 constraint, they establish conditions that allow designing a smoothing Newton method for solving SAA problems. Numerical results demonstrate the method's high performance and locally quadratic convergence rate.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Constrained Optimization Solving Sample Average Aproximation CCP

This document summarizes research on solving sample average approximation (SAA) problems for chance constrained programming. SAA reformulates the probabilistic constraint in chance constrained programming into an equivalent deterministic form using sample averages, but the resulting 0/1 loss function constraint makes the problem difficult to solve. The authors investigate a novel 0/1 constrained optimization problem to directly address SAA. By deriving optimality conditions for the 0/1 constraint, they establish conditions that allow designing a smoothing Newton method for solving SAA problems. Numerical results demonstrate the method's high performance and locally quadratic convergence rate.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

0/1 Constrained Optimization Solving Sample Average

Approximation for Chance Constrained Programming


Shenglong Zhou∗ Lili Pan †
Naihua Xiu‡ Geoffrey Ye Li§

Abstract: Sample average approximation (SAA) is a tractable approach to deal with the chance
constrained programming, a challenging issue in stochastic programming. The constraint is
usually characterized by the 0/1 loss function which results in enormous difficulties in designing
numerical algorithms. Most current methods have been created based on the SAA reformulation,
such as binary integer programming or the relaxation. However, no viable algorithms have been
developed to tackle SAA directly, not to mention theoretical guarantees. In this paper, we
investigate a novel 0/1 constrained optimization problem, which provides a new way to address
arXiv:2210.11889v2 [math.OC] 5 Jan 2023

SAA. Specifically, by deriving the Bouligand tangent and Fréchet normal cones of the 0/1
constraint, we establish several optimality conditions including the one that can be equivalently
expressed by a system of equations, thereby allowing us to design a smoothing Newton type
method. We show that the proposed algorithm has a locally quadratic convergence rate and
high numerical performance.

Keywords: Sample average approximation, 0/1 constrained optimization, chance constrained


programming, optimality conditions, smoothing Newton method, locally quadratic convergence

1 Introduction
Chance constrained programming (CCP) is an efficient tool for decision making in uncertain envi-
ronments to hedge risk and has been extensively studied recently [1, 13, 9, 37, 30]. Applications of
CCP include supply chain management [20], optimization of chemical processes [16], surface water
quality management [38], just naming a few. A simple CCP problem takes the form of

min f (x), s.t. P{g(x, ξ) ≤ 0} ≥ 1 − α, (CCP)


x∈RK

or equivalently,
min f (x), s.t. P{g(x, ξ)  0} ≤ α,
x∈RK

where function f : RK → R is continuously differentiable, ξ is a random vector with a probability


distribution supported on set Ξ ⊂ RD , set-valued function g : RK × Ξ → RM is continuous, and
α is a confidence parameter chosen by the decision maker, typically near zero, e.g. α = 0.01 or
α = 0.05. Here, x  0 means that x has at least one positive entry. Set {g(x, ξ) ≤ 0} represents
the feasible region described by a group of constraints subject to uncertainty ξ. In (CCP), if
g(x, ξ) = A(ξ)x − b(ξ), where A(ξ) and b(ξ) are a matrix and vector, then the constraint is called
single chance constraint if M = 1 [6] and joint chance constraint if M > 1 [26]. Some general
theory can be found in [31, 32] and the references therein.

ITP Lab, Department of EEE, Imperial College London, UK. Email: [email protected]

Department of Mathematics, Shandong University of Technology, China. Email: [email protected]

Department of Applied Mathematics, Beijing Jiaotong University, China. Email: [email protected]
§
ITP Lab, Department of EEE, Imperial College London, UK. Email: [email protected]

1
1.1 Related work

Problem (CCP) is difficult to solve numerically in general for two reasons. The first reason is that
quantity P{g(x, ξ) ≤ 0} is hard to compute for a given x as it requires multi-dimensional integration.
The second reason is that the feasible set generally is nonconvex even g(·, ξ) is convex. Therefore,
to solve the problem, one common strategy is to make some assumptions on the distributions of
ξ. For example, for the case of the single chance constraint, the feasible set is convex provided
that α < 0.5 and ξ has a nondegenerate multivariate normal distribution [18] or (A(ξ); b(ξ)) has
a symmetric log-concave density [19]. It also can be expressed as a second-order cone constraint if
ξ has an elliptically symmetric distribution [15].
Without making assumptions on ξ, there are several approaches by sampling to approximate the
probabilistic constraint. These techniques can reformulate problem (CCP) as integer programming
or nonlinear programming. In [1], a single chance constrained problem has been reformulated as a
mixed-integer nonlinear programming and relaxed the integer variables into continuous ones. The
sample approximation [23, 1] is reformulated as a mixed-integer programming [24] which can be
solved by Branch and Bound approach [4] or branch-and-cut decomposition algorithm [22]. To avoid
solving mixed-integer programming, some other approximation methods have been proposed, such
as difference-of-convex (DC) functions approximation to the indicator function [17], inner-outer
approximation [13], and convex approximations [27]. In [17], a gradient-based Monte Carlo method
has been developed to solve a sequence of convex approximations which aimed at addressing the
DC approximation of problem (CCP).
As pointed out before, it is hard to solve (CCP) numerically. A popular approach is the
sample average approximation (SAA). Specifically, let ξ 1 , · · · , ξ N be independent and identically
distributed samples of N realizations of random vector ξ. Then based on model in [23, 29], SAA
takes the form of
N
1 X  
min f (x), s.t. `0/1 max (g(x, ξ n ))m ≤ α, (SAA)
x∈RK N m=1,··· ,M
n=1

where `0/1 (t) is the 0/1 loss function [12, 21, 14] defined as

 1, t > 0,
`0/1 (t) := (1.1)
 0, t ≤ 0.

The function is known as the (Heaviside) step function [28, 11, 41] named after Oliver Heaviside
(1850-1925), an English mathematician and physicist. We would like to point out that (SAA) with
α > 0 is always non-convex. However, it gains the popularity because it requires relatively few
assumptions on the structure of (CCP) or the distribution of ξ and only yields statistical bounds
on solution feasibility and optimality, and requires replication to do so [22].
Therefore, there is an impressive body of work on developing numerical algorithms [29, 30, 35]
and establishing asymptotic convergence [23, 36, 37] for problem (SAA). One way to solve the
problem is to adopt the integer programming-based method [2]. When f and g are linear, a
branch-and-cut decomposition algorithm is proposed for problem (SAA) based on mixed-integer
linear optimization techniques [22]. Another method is by structuring (SAA) as nonlinear pro-
gramming. For example, the authors in [30] rewrote the chance constraint as a quantile constraint
and reformulated (CCP) through (SAA) as nonlinear programming. Then they cast a trust-region

2
method to tackle the problem with a joint chance constraint. In [9], (SAA) has been reformu-
lated as a cardinality-constrained nonlinear optimization problem which was solved by a sequential
method, where trial steps were computed through piecewise-quadratic penalty function models.
The proposed method has been proven to converge to a stationary point of the penalty function.
In summary, most aforementioned work focused on the surrogates of (SAA), not to mention
directly providing thorough optimality analysis and developing efficient algorithms for (SAA).

1.2 The main model

To simplify the constraint in (SAA), we define a measurement as


PN
kZk+
0 :=
max
n=1 `0/1 (Z:n ) , (1.2)

where matrix Z ∈ RM ×N and Zmax


:n denotes the maximum entry of the nth column of Z, namely,

Zmax
:n := maxm∈M Zmn .

One can observe that kZk+


0 counts the number of columns in Z with positive maximum values.
Motivated by (SAA), in this paper, we study 0/1 (or step) constrained optimization (SCO):

min f (x), s.t. kG(x)k+


0 ≤ s, (SCO)
x∈RK

where G(x) : RK → RM ×N is continuous differentiable and s  N is a given positive integer.


The above problem is NP-hard due to the discrete nature of the 0/1 loss function. However, it is
capable of solving various applications:

• If G(x) = (g(x, ξ 1 ) · · · g(x, ξ N )) and s = dαN e, the minimal integer no less than αN , then
model (SCO) turns to (SAA).

• If G(·) is a linear mapping and M = 1, then model (SCO) enables us to deal with support
vector machine [8, 40] and one-bit compressed sensing [5, 42].

1.3 Contributions

As far as we know, this is the first paper that processes (SAA) with 0/1 constraints directly and pro-
vides some theoretical guarantees as well as a viable numerical algorithm. The main contributions
of this paper are threefold.

1) A novel model with new theoretical perspectives.


Despite the difficulty stemmed from the 0/1 loss function, we have succeed in building some
theories as follows. By denoting

S G := x ∈ RK : kG(x)k+

0 ≤s , (1.3)

and

S := Z ∈ RM ×N : kZk+

0 ≤s , (1.4)

3
we derive the projection of a point onto S as well as the Bouligand tangent and Fréchet normal
cones of S G and S, see Propositions 2.1 and 2.3. In addition, the established properties on
S G allow us to conduct extensive optimality analysis. Specifically, we introduce a KKT point,
define a τ -stationary point of (SCO), and then reveal their relationships to the local minimizers.

2) A smoothing Newton type method with locally quadratic convergence.


An advantageous property of a τ -stationary point is that it can be equivalently expressed
as a system of equations, which thus enables us to make full use of the smoothing Newton
method for solving (SCO), dubbed as SNSCO. We then prove that the proposed algorithm has
a locally quadratic convergence rate under the standard assumptions, see Theorem 4.2. The
much endeavor of acquiring such a result indicates that the proof is non-trivial.

3) A high numerical performance.


We apply SNSCO into addressing the norm optimization problems. For the case of the single
chance constraints, it outperforms other selected solvers, especially in computational speed.

1.4 Organization

This paper is organized as follows. In the next section, we calculate the projection of one point
onto set S and derive the Bouligand tangent cone and normal cones of S G and S. In Section 3,
we establish two kinds of optimality conditions of problem (SCO) based on the normal cone of
S G . They are KKT points and τ -stationary points. We then reveal their relationships to the local
minimizers. In Section 4, we equivalently rewrite the τ -stationary point as a system of τ -stationary
equations. Then we develop a smoothing Newton type method, SNSCO, to solve the equations and
establish its locally quadratic convergence property. We implement SNSCO to solve problems with
a single chance constraint and joint chance constraints and give some concluding remarks in the
last two sections.

1.5 Notation

We end this section with defining some notation employed throughout this paper. Denote two
index sets as

M := {1, 2, · · · , M } and N := {1, 2, · · · , N }.

Given a subset T ⊆ N , its cardinality and complement are |T | and T := N \ T . For a scalar a ∈ R,
dae represents the smallest integer no less than a. For two matrices A = (Amn )M ×N ∈ RM ×N and
B = (Bmn )M ×N ∈ RM ×N , we denote

A+ := (max{Amn , 0})M ×N and A− := (min{Amn , 0})M ×N . (1.5)


P
and their inner product is hA, Bi := mn Amn Bmn . Moreover, let 0 ≥ A ⊥ B ≥ 0 stand for
0 ≥ Amn , Bmn ≥ 0, and Amn Bmn = 0 for any m ∈ M, n ∈ N . We use k · k to represent the
Euclidean norm for vectors and Frobenius norm for matrices. The neighbourhood of x ∈ RK with
a radius  > 0 is written as
N(x, ) := {v ∈ RK : kv − xk < }.

4
The nth largest singular value of A ∈ RN ×N is denoted by σn (A), namely σ1 (A) ≥ σ2 (A) ≥ · · · ≥
σN (A). Particularly, we write

kAk2 := σ1 (A) and σmin (A) := σN (A).

For matrix Z ∈ RM ×N , let ZJ be the sub-part indexed by J ⊆ M × N , namely ZJ :=


{Zmn }(m,n)∈J . Particularly, let Z:T stand for the sub-matrix containing columns indexed by T ⊆ N ,
and Z:n represent the nth column of Z. In addition, we define the following useful index sets:

Γ+ := {n ∈ N : Zmax
:n > 0} ,

Γ0 := {n ∈ N : Zmax
:n = 0} ,
(1.6)
Γ− := {n ∈ N : Zmax
:n < 0} ,

VΓ := {(m, n) ∈ M × N : Zmn = 0, ∀ n ∈ Γ} , Γ ⊆ N .

We point out that Γ+ , Γ0 , Γ− , and VΓ depend on Z, but we will drop their dependence if no
additional explanations are provided. Recalling (1.2), the above definitions indicate that

kZk+
0 = |Γ+ |. (1.7)

Let Z↓s be the sth largest element of {kZ+ + +


:1 k, kZ:2 k, · · · , kZ:N k}. Then

 = 0, if kZk+ < s,
↓ 0
Zs (1.8)
 > 0, if kZk+ ≥ s.
0

For G(x) : RK → RM ×N and an index set J , we denote Gmn (x) as its (m, n)th element and
∇J G(x) ∈ RK×|J | as a matrix with columns consisting of ∇Gmn (x) with (m, n) ∈ J , namely,

∇J G(x) := {∇Gmn (x) : (m, n) ∈ J } ∈ RK×|J | .

Finally, we denote an operator ◦ by


P
∇J G(x) ◦ WJ := (m,n)∈J Wmn ∇Gmn (x),

∇2J G(x) ◦ WJ := Wmn ∇2 Gmn (x),


P
(m,n)∈J

where W ∈ RM ×N . Particularly,
P
∇G(x) ◦ W := (m,n)∈M×N Wmn ∇Gmn (x).

2 Properties of Feasible Sets S and S G


In this section, we pay our attention to feasible set S G in (1.3) and set S in (1.4), aiming at deriving
the projection of a point onto S and their tangent and normal cones. To proceed that, we define a
set used throughout the paper. Let Γ+ , Γ0 , and Γ− be given by (1.6). Denote
 
 ∀Γs ⊆ Γ+ , |Γs | = min{s, |Γ+ |} 
T(Z; s) := (Γ+ \ Γs ) ∪ Γ0 : . (2.1)
 ∀i ∈ Γ , kZ+ k ≥ kZ+ k, ∀j ∈ Γ \ Γ 
s :i :j + s

One can observe that Γs contains |Γs | indices corresponding to the first |Γs | largest values in
{kZ+
:i k : i ∈ Γ+ }. Moreover, the set defined above has the following properties.

5
• For any T ∈ T(Z; s) and any n ∈ T

∈ (0, Z↓s ], n ∈ Γ+ \ Γs ,



kZ+
:n k (2.2)
 = 0 ≤ Z↓s , n ∈ Γ0 ,

and

T = N \ T = Γs ∪ Γ− . (2.3)

• If kZk+
0 ≤ s, then Γs = Γ+ . Hence T(Z; s) = {Γ0 } and T = Γ+ ∪ Γ− .

For example, consider the following case,


" #
2 2 0 −1
Z= , M = 2, N = 4. (2.4)
0 −1 −2 −3

Since Γ+ = {1, 2}, Γ0 = {3}, and Γ− = {4}, it is easy to check that kZk+
0 = 2, T(Z; 3) = T(Z; 2) =
{{3}} and T(Z; 1) = {{1, 3}, {2, 3}}.

2.1 Projection

For a nonempty and closed set Ω, projection ΠΩ (Z) of Z onto Ω is defined by

ΠΩ (Z) = argminW∈Ω kZ − Wk. (2.5)

It is well known that the solution set of the right hand side problem is a singleton when Ω is convex
and might have multiple elements otherwise. The following property shows that the projection
onto S has a closed form.

Proposition 2.1. Define T(Z; s) as (2.1). Then


nh i o
ΠS (Z) = Z−:T Z:T
: T ∈ T(Z; s) . (2.6)

The proof is quite straightforward and thus is omitted here. We provide an example to illustrate
(2.6). Again consider case (2.4). For s = 3 or 2, ΠS (Z) = {Z} due to T(Z; 3) = T(Z; 2) = {{3}}.
For s = 1, T(Z; 1) = {{1, 3}, {2, 3}} and thus
(" # " #)
0 2 0 −1 2 0 0 −1
ΠS (Z) = , .
0 −1 −2 −3 0 −1 −2 −3

With the help of the closed form of projection ΠS (Z), we establish the following fixed point inclusion.

Proposition 2.2. Given W ∈ RM ×N and τ > 0, a point Z ∈ RM ×N satisfies

Z ∈ ΠS (Z + τ W) (2.7)

if and only if it satisfies one of the following two conditions

• kZk+
0 < s and W = 0;

6
• kZk+
0 = s and

W:Γ0 = 0, 0 ≥ Z:Γ0 ⊥ W:Γ0 ≥ 0, τ kW:n k ≤ Z↓s , ∀n ∈ Γ0 . (2.8)

Moreover, if a point Z satisfies (2.7), then T(Z + τ W; s) = {Γ0 }.

Proof. It is easy to verify the following fact: for any n ∈ N ,

0 ≥ Z:n ⊥ W:n ≥ 0 ⇐⇒ (Z:n + τ W:n )− = Z:n


(2.9)
=⇒ k(Z:n + τ W:n )+ k = τ kW:n k.

We first show ‘if’ part. Relation (2.7) is clearly true for the case of kZk+
0 < s by W = 0. For
kZk+
0 = s, we have

 
Z:Γ0 + τ W:Γ0 , Z:Γ+ + τ W:Γ+ , Z:Γ− + τ W:Γ−
(2.10)
(2.8)
 
= Z + τ W = Z:Γ0 + τ W:Γ0 , Z:Γ+ , Z:Γ− ,

where Γ0 , Γ+ , and Γ− are defined for Z as (1.6). Therefore,




 > 0, n ∈ Γ+ ,

max max
(Z:n + τ W:n ) = Z:n (2.11)

 < 0, n ∈ Γ− .

Moreover, we have

(2.8,2.9) (2.8)
∀ n ∈ Γ0 , k(Z:n + τ W:n )+ k = τ kW:n k ≤ Z↓s . (2.12)

↓ ↓
Since kZk+
0 = s, there is |Γ+ | = s. This together with (2.10) and (2.12) suffices to Zs = (Z + τ W)s ,
which by (2.11) and (2.12) implies Γ0 ∈ T(Z + τ W; s). Then it follows from Proposition 2.1 that

h i
ΠS (Z + τ W) 3 (Z:Γ0 + τ W:Γ0 )− , Z:Γ0 + τ W:Γ0

(2.8,2.9)
h i
= Z:Γ0 , Z:Γ0 = Z.

Next, we prove ‘only if’ part. We note that (2.7) means Z ∈ S, namely, kZk+
0 ≤ s. Now we claim
the conclusion by two cases.
Case 1: kZk+ +
0 < s. Condition (2.7) implies kZ + τ Wk0 < s, leading to

Z ∈ ΠS (Z + τ W) = {Z + τ W},

deriving W = 0 and hence T(Z + τ W; s) = T(Z; s) = {Γ0 } due to kZk+


0 < s.
Case 2: kZk+ 0 0 0 0
0 = s. Let Γ+ , Γs , Γ− , and Γ0 be defined as (1.6) but for Z + τ W. For any T ∈
T(Z + τ W; s), we have
(2.3)
T = Γ0s ∪ Γ0− . (2.13)

7
It follows from Proposition 2.1 that there is a set T ∈ T(Z + τ W; s) such that

(2.13)  
[Z:T , Z:Γ0s , Z:Γ0− ] = Z:T , Z:T = Z

(2.6)
(Z:T + τ W:T )− , Z:T + τ W:T (2.14)
 
=

(2.13)
= [(Z:T + τ W:T )− , Z:Γ0s + τ W:Γ0s , Z:Γ0− + τ W:Γ0− ],

which by (2.9) suffices to

W:T = 0, 0 ≥ Z:T ⊥ W:T ≥ 0. (2.15)

Conditions (2.14) and the definitions of Γ0s and Γ0− enable us to obtain

 ≤ 0, n ∈ T,


max
Z:n > 0, n ∈ Γ0s , (2.16)

n ∈ Γ0− .

 < 0,

Recalling Γ0 = {n ∈ N : Zmax
:n = 0}, it follows

Γ0 ⊆ T ∈ T(Z + τ W; s). (2.17)

We next prove : Γ = T \ Γ0 = ∅. In fact, (2.16) means Γ ⊆ {n ∈ N : Zmax


:n < 0}, resulting in
Z:Γ < 0 and thus W:Γ = 0 from Z:Γ ⊥ W:Γ by (2.15). Therefore, (Z + τ W)max
:n = Zmax
:n < 0 for
any n ∈ Γ. However, Γ ⊆ T ∈ T(Z + τ W; s) manifests (Z + τ W)max
:n ≥ 0 for any n ∈ Γ from (2.1).
This contradiction shows that Γ = ∅, which by (2.17) delivers Γ0 = T . Since T is arbitrarily chosen
from T(Z + τ W; s), we conclude that T(Z + τ W; s) = {Γ0 }. Overall, we obtain

= 0, n ∈ Γ0 , W = [W:Γ0 , W:Γ0 ]






Zmax > 0, n ∈ Γ0s , and (2.18)
 
:n
 = W:Γ0 , W:T



 < 0, n ∈ Γ0 ,
 (2.15)
− = [W:Γ0 , 0] .

These conditions also indicate


(2.18)
Z↓s = minn∈Γ0s kZ+
:n k
(2.14)
= minn∈Γ0s k(Z:n + τ W:n )+ k
(2.19)
≥ k(Z:j + τ W:j )+ k
(2.15,2.9)
= τ kW:j k, ∀ j ∈ Γ0 ,

where ‘≥’ is due to the definition of {Γ0 } = T(Z + τ W; s) in (2.3). Finally, (2.18), (2.15), and
(2.19) enable us to conclude (2.8).

2.2 Tangent and Normal cones

Recalling that for any nonempty closed set Ω ⊆ RK , its Bouligand tangent cone TΩ (x) and corre-
sponding Fréchet normal cone NbΩ (x) at point x ∈ Ω are defined as [34]:
n o

TΩ (x) := d ∈ RK : ∃ η` ≥ 0, x` → x such that η` (x` − x) → d , (2.20)
n o
N
bΩ (x) := u ∈ RK : hu, di ≤ 0, ∀ d ∈ TΩ (x) , (2.21)

8

where x` →x represents lim`→∞ x` = x and x` → x stands for x` ∈ Ω for every k and x` → x. Let
S
n∈N Ωn be the union of finitely many nonempty and closed subsets Ωn . Then by [3, Proposition
S
3.1], for any x ∈ n∈N Ωn with N (x) := {n ∈ N : x ∈ Ωn }, we have
S
TSn∈N Ωn (x) = n∈N (x) TΩn (x). (2.22)

Note that set S G can be rewritten as


n o
SG = K : (G(x))max ≤ 0, n ∈ Γ ,
S
Γ∈P(M,s) x ∈ R :n

where P(T, t) is defined as

P(T, t) := {Γ ⊆ T : |Γ| ≤ t}. (2.23)

In the subsequent analysis, we let


Z = G(x),

and corresponding index sets Γ+ , Γ0 , Γ− , and VΓ in (1.6) are defined for G(x). Therefore, we have

Γ+ = {n ∈ N : (G(x))max
:n > 0} ,

Γ− = {n ∈ N : (G(x))max
:n < 0} ,
(2.24)
Γ0 = {n ∈ N : (G(x))max
:n = 0} ,

VΓ = {(m, n) ∈ M × N : Gmn (x) = 0, ∀ n ∈ Γ} .

Based on set P(Γ0 , s − |Γ+ |) and the above notation, we express the Bouligand tangent cones and
corresponding Fréchet normal cone of S G explicitly by the following theorem.

Proposition 2.3. Suppose ∇VΓ0 G(x) is full column rank. Then Bouligand tangent cone TS G (x)
bS G (x) at x ∈ S G are given by
and Fréchet normal cone N
[ n o
TS G (x) = d ∈ RK : d> ∇VΓ0 \Γ G(x) ≤ 0 , (2.25)
Γ∈P(Γ0 ,s−|Γ+ |)

n o
if kG(x)k+

 ∇VΓ G(x) ◦ WVΓ : WVΓ ≥ 0 ,

0 = s,
0 0 0
N
bS G (x) = (2.26)
kG(x)k+


 {0}, if 0 < s.

Proof. For notational simplicity, denote P := P(Γ0 , s − |Γ+ |). For any fixed x ∈ S G , it follow from
(2.24) that
  
 > 0, n ∈ Γ+ ,

 
 

 
x ∈ SΓG K max
:= z ∈ R : (G(z)):n < 0, n ∈ Γ− , , ∀ Γ ∈ P,
  

  ≤ 0, n ∈ Γ \ Γ
 

0

and hence

SΓG .
T
x∈ Γ∈P (2.27)

9
SΓG ⊆ S G , which by (2.27) and (2.22) yields
S 
It is easy to see that Γ∈P

TS G (x) = T(S SΓG ) (x). (2.28)


Γ∈P

We note that the active index set of SΓG at x is

VΓ0 \Γ = {(m, n) : Gmn (x) = 0, ∀ n ∈ Γ0 \ Γ} ⊆ VΓ0 .

Since ∇VΓ0 G(x) is full column rank, so is ∇VΓ0 \Γ G(x). Then for any Γ ∈ P, the Bouligand tangent
cone of SΓG at x is
n o
TS G (x) = d ∈ RK : d> ∇VΓ0 \Γ G(x) ≤ 0 . (2.29)
Γ

This together with (2.28) and (2.22) shows (2.25).


Next, we calculate Fréchet normal cone NbS G (x) of x ∈ S G by two cases.
Case 1: kG(x)k+
0 = s. By (1.7), we have |Γ+ | = s and thus P = P(Γ0 , s − |Γ+ |) = {∅}, resulting in

(2.25)
n o
TS G (x) = d ∈ RK : d> ∇VΓ0 G(x) ≤ 0 .

Then direct verification by definition (2.21) allows us to derive (2.26).


Case 2: kG(x)k+ 0 < s. If |Γ0 | ≤ s − |Γ+ |, then Γ0 ∈ P. From (2.25), we have TS G (x) = R
K and

bS G (x) = {0}. If |Γ0 | > s − |Γ+ |, then |Γ0 | ≥ 2 due to |Γ+ | = kG(x)k+ < s. Let
then N 0

Γ0 := {t1 , t2 , · · · , t|Γ0 | }, Γ` := {t` }, ` ∈ [|Γ0 |].


S
In addition, it follows from (2.29) and (2.25) that TS G (x) = Γ∈P TS G (x), thereby TS G (x) ⊆
Γ Γ`
TS G (x) due to Γ` ∈ P. We note from (2.29) that for any d ∈ TS G (x), any vector u satisfying
Γ
hu, di ≤ 0 takes the form of

u = ∇VΓ0 \Γ G(x) ◦ WVΓ0 \Γ , WVΓ0 \Γ ≥ 0. (2.30)

To show (2.26), it suffices to show u = 0. By considering d` ∈ TS G (x), ` ∈ [|Γ0 |] and any u ∈


Γ`
bS G (x), we have d` ∈ TS G (x) and hu, d` i ≤ 0 for any ` ∈ [|Γ0 |]. This and (2.30) suffice to
N

u = ∇VΓ0 \Γ1 G(x) ◦ WV1 Γ = ∇Γ0 \Γ` G(x) ◦ WV` Γ , (2.31)


0 \Γ1 0 \Γ`

where WV1 Γ ≥ 0 and WV` Γ ≥ 0, ` = 2, 3, · · · , |Γ0 |, which results in


0 \Γ1 0 \Γ`

∇VΓ` G(x) ◦ WV1 Γ − ∇VΓ1 G(x) ◦ WV` Γ


` 1

+ ∇VΓ0 \Γ1 \Γ G(x) ◦ (W1 − W` )VΓ0 \Γ1 \Γ = 0, ` = 2, 3, · · · , |Γ0 |.


` `

The above condition and the full column rankness of ∇VΓ0 G(x) yield that,

WV1 Γ = 0, WV1 Γ = WV` Γ , WV` Γ = 0,


` 0 \Γ1 \Γ` 0 \Γ1 \Γ` 1

for each `. Now taking ` = 2, 3, · · · , |Γ0 | enables us to show that WV1 Γ = 0, which by (2.31)
0 \Γ1
proves u = 0, as desired.

10
Remark 2.1. In particular, Bouligand tangent cone TS (Z) and corresponding Fréchet normal cone
bS (Z) at Z ∈ S are
N
n o
W ∈ RM ×N : WVΓ0 \Γ ≤ 0 ,
S
TS (Z) = (2.32)
Γ∈P(Γ0 ,s−|Γ+ |)

n o
 W ∈ RM ×N : WVΓ ≥ 0, W , kZk+

= 0, 0 = s,

0 V Γ0
NS (Z) =
b (2.33)
 +

 {0}, kZk0 < s,

where V Γ0 = (M × N ) \ VΓ0 . It should be noted that if ∇VΓ0 G(x) is full column rank, together with
(2.26) and (2.33), the Fréchet normal cone of S G at x ∈ S G can be written as
n o
bS G (x) = ∇G(x) ◦ W : W ∈ N
N bS (Z), Z = G(x) . (2.34)

We end this section with giving one example to illustrate the tangent and normal cones of
S = {Z ∈ R2×4 : kZk+
0 ≤ 2} at
" # " #
2 2 0 −1 2 0 0 −1
Z= , Z0 = , M = 2, N = 4. (2.35)
0 −1 −2 −3 0 −1 −2 −3

Direct calculations can derive that

TS (Z) = {W ∈ R2×4 : W13 ≤ 0},

bS (Z) = W ∈ R2×4 : W13 ≥ 0; Wmn = 0, ∀(m, n) 6= (1, 3) ,



N

TS (Z0 ) = {W ∈ R2×4 : W12 ≤ 0 or W13 ≤ 0},

bS (Z0 ) = {0} .
N

3 Optimality Analysis
In this section, we aim at establishing the first order necessary or sufficient optimality conditions
of (SCO). Hereafter, we always assume its feasible set is non-empty if no additional information is
provided and let

Z∗ = G(x∗ ). (3.1)

Similar to (2.24), we always denote Γ∗+ , Γ∗0 , Γ∗− , VΓ∗ for Z∗ = G(x∗ ) and

J∗ := VΓ∗0 = {(m, n) : Zmn = 0, ∀ n ∈ Γ∗0 }. (3.2)

3.1 KKT points

We call x∗ ∈ RK is a KKT point of problem (SCO) if there is a W∗ ∈ RM ×N such that





 ∇f (x∗ ) + ∇G(x∗ ) ◦ W∗ = 0,

bS (Z∗ ) 3 W∗ ,
N (3.3)


kZ∗ k+


0 ≤ s.

We first build the relation between a KKT point and a local minimizer.

11
Theorem 3.1 (KKT points and local minimizers). The following relationships hold for (SCO).
a) A local minimizer x∗ is a KKT point if ∇J∗ G(x∗ ) is full column rank. Furthermore, if
kZ∗ k+ ∗ ∗
0 < s, then W = 0 and ∇f (x ) = 0.

b) A KKT point x∗ is a local minimizer if functions f and Gmn for all m ∈ M, n ∈ N are
locally convex around x∗ .
Proof. a) From [34, Theorem 6.12], a minimizer of problem (1.3) satisfies −∇f (x∗ ) ∈ N
bS G (x∗ ).
bS G (x∗ ) in (2.34), the results hold obviously.
Together with the expression of N
b) Let (x∗ , W∗ ) be a KKT point satisfying (3.3). We prove the conclusion by two cases.
Case kZ∗ k+ ∗ ∗
0 < s. Condition (3.3) and NS (Z ) = {0} from (2.33) suffice to W = 0 and
b
∇f (x∗ ) = 0, which by the local convexity of f around x∗ yields

f (x) ≥ f (x∗ ) + h∇f (x∗ ), x − x∗ i = f (x∗ ).

This shows the local optimality of x∗ .


Case kZ∗ k+ ∗ ∗
0 = s. Consider a local region N (x , ) of x for a given sufficiently small radius  > 0
and x ∈ S G ∩ N (x∗ , ). To derive the results, we claim the following three facts.
• The local convexity of Gmn around x∗ leads to

∗ =G
Zmn − Zmn ∗ ∗ ∗ (3.4)
mn (x) − Gmn (x ) ≥ h∇Gmn (x ), x − x i,

for any (m, n) ∈ M × N .

• It follows from W∗ ∈ N
bS (Z∗ ) in (3.3), (3.2), and (2.33) that

∗ ≥ 0,
WJ ∗ = 0,
WJ Z∗J∗ = 0. (3.5)
∗ ∗

• Similar to (3.1), let Z = G(x). For any x ∈ S G ∩ N (x∗ , ), kZk+


0 ≤ s. Moreover, for a

sufficiently small , we note from the continuity of G(·) that Zmn > 0 if Zmn > 0, which
indicates kZk+ ∗ + + G ∗
0 ≥ kZ k0 = s. Overall, kZk0 = s for any x ∈ S ∩ N (x , ). This means that

Γ+ = {n ∈ N : Zmax ∗ max ∗
:n > 0} = {n ∈ N : (Z ):n > 0} = Γ+ .

Since Γ∗0 ∩ Γ∗+ = ∅ and kZk+


0 = |Γ+ | = s, the above condition indicates Z:Γ0 ≤ 0, which

combining J∗ = VΓ∗0 ⊆ M × Γ∗0 suffices to

ZJ∗ ≤ 0. (3.6)

Finally, these three facts and the convexity of f can conclude that

f (x) ≥ f (x∗ ) + h∇f (x∗ ), x − x∗ i


(3.3)
= f (x∗ ) − h∇G(x∗ ) ◦ W∗ , x − x∗ i
f (x∗ ) − (m,n)∈M×N Wmn ∗ h∇G ∗ ∗
P
= mn (x ), x − x i
(3.4)
f (x∗ ) − (m,n)∈M×N Wmn ∗ (Z ∗
P
≥ mn − Zmn )
(3.5)
f (x∗ ) − (m,n)∈J∗ Wmn∗ Z
P
= mn
(3.4,3.6)
≥ f (x∗ ),
which demonstrates the local optimality of x∗ to problem (SCO).

12
3.2 τ -stationary points

Our next result is about the τ -stationary point of (SCO) defined as follows: A point x∗ ∈ RK is
called a τ -stationary point of (SCO) for some τ > 0 if there is a W∗ ∈ RM ×N such that

 ∇f (x∗ ) + ∇G(x∗ ) ◦ W∗ = 0,


(3.7)
(Z∗ τ W∗ ) Z∗ .


 ΠS + 3

The following result shows that s τ -stationary point also has a close relationship with the local
minimizer of problem (SCO).

Theorem 3.2 (τ -stationary points and local minimizers). The following results hold for (SCO).

a) Suppose ∇J∗ G(x∗ ) is full column rank. A local minimizer x∗ is also a τ -stationary point

– either for any τ > 0 if kZ∗ k+


0 <s

– or for any 0 < τ ≤ τ∗ := (Z∗ )↓s /r∗ if kZ∗ k+


0 = s, where

∗ k : ∇f (x∗ ) + ∇ G(x∗ ) ◦ W∗ = 0 .

r∗ := maxn∈Γ∗0 kW:n J∗ J∗ (3.8)

b) A τ -stationary point with τ > 0 is a local minimizer if functions f and Gmn for all m ∈ M, n ∈
N are locally convex around x∗ .

Proof. a) It follows from Theorem 3.1 that a local minimizer x∗ is also a KKT point. Therefore,
we have condition (3.3). So to prove the τ -stationarity in (3.7), we only need to show

Z∗ ∈ ΠS (Z∗ + τ W∗ ) . (3.9)

If kZ∗ k+ ∗ ∗
0 < s, then (3.3) and NS (Z ) = {0} from (2.33) yield W = 0, resulting in (3.9) for any
b
τ > 0. Now consider the case of kZ∗ k+
0 = s. Under such a case, conditions (3.5) hold, which by

J∗ = VΓ∗0 ⊆ M × Γ∗0

allows us to derive that


W:Γ∗ = 0,

0 ≥ Z∗:Γ∗0 ⊥ W:Γ∗ ≥ 0.
0
(3.10)
0

Moreover, the linearly independence of ∇J∗ G(x∗ ) and

(3.5)
0 = ∇f (x∗ ) + ∇G(x∗ ) ◦ W∗ = ∇f (x∗ ) + ∇J∗ G(x∗ ) ◦ WJ
∗ ,

∗ is uniquely determined by ∇f (x∗ ) and ∇ G(x∗ ). It is clear that τ > 0 due to


means that WJ ∗ J∗ ∗
(Z∗ )↓s > 0. By 0 < τ ≤ τ∗ and

(3.8)
∗ ↓
∀ n ∈ Γ∗0 , τ kW:n
∗ k ≤ τ max
∗ i∈Γ∗0 kW:n k = τ∗ r∗ = (Z )s .

The above condition and (3.10) show that (3.9) by Proposition 2.2.
b) We only prove that a τ -stationary point is a KKT point because Theorem 3.1 b) enables us
to conclude the conclusion immediately. We note that a τ -stationary point satisfies (3.9), leading

13
to kZ∗ k+ ∗ ∗ ∗ +
0 ≤ s. Comparing (3.7) and (3.3), we only need to prove W ∈ NS (Z ). If kZ k0 < s,
b
bS (Z∗ ) by (2.33). If kZ∗ k+ = s, then Proposition 2.2 shows
then Proposition 2.2 shows W∗ = 0 ∈ N 0
(3.10), which by the definition of VΓ∗0 in (3.2) indicates

WV∗ Γ∗ ≥ 0, WV∗ = 0,
0 Γ∗
0

contributing to W∗ ∈ N
bS (Z∗ ) by (2.33).

4 Newton Method
In this section, we cast a Newton-type algorithm that aims to find a τ -stationary point to problem
(SCO). Hereafter, for τ > 0, we always denote

w := (x; vec(W)) ∈ RK+M N ,

Z := G(x),
(4.1)
Λ := Z + τ W,

UT := {(m, n) : Λmn ≥ 0, n ∈ T }, where T ∈ T(Λ; s),

where (x; y) = (x> y> )> and vec(W) is the vector formed by entries in W along with the columns.
Similar definitions to (4.1) are also applied into w∗ := (x∗ ; vec(W∗ )) and w` := (x` ; vec(W` )),
where the former is a τ -stationary point and the latter is the point generated by our proposed
algorithm at the `th step. Moreover, we denote a system of equations as
 
 ∇f (x) + ∇J G(x) ◦ WJ 
 
F(w; J ) :=  (4.2)
 
 vec(ZJ ) 

 
vec(WJ )

and an associated matrix as


 
 ∇2 f (x) + ∇2J G(x) ◦ WJ ∇J G(x) 0 
 
∇Fµ (w; J ) :=  ∇J G(x)> , (4.3)
 
 −µI|J | 0  
 
0 0 I|J |

where In is the nth order of identity matrix. Particularly, let

∇F(w; J ) := ∇F0 (w; J ), (4.4)

when µ = 0. Then for given J , ∇F(w; J ) is the Jacobian matrix of F(w; J ).

4.1 τ -stationary equations

To employ the Newton method, we need to convert a τ -stationary point satisfying (3.7) to a system
of equations, stated as the following theorem.

14
Theorem 4.1 (τ -stationary equations). A point x∗ is a τ -stationary point with τ > 0 of (SCO) if
and only if there is a W∗ ∈ RM ×N such that

T(Λ∗ ; s) 3 Γ∗0 ,

UΓ∗0 = J∗ , (4.5)

F(w∗ ; J∗ ) = 0,

where J∗ is defined as (3.2).

Proof. First of all, we denote

J− := (Γ∗0 × M) \ J∗ = {(m, n) : Zmn


∗ 6= 0, ∀ n ∈ Γ∗ }.
0

The definition of Γ∗0 means Z∗:Γ∗ ≤ 0, which by J∗ ⊆ M × Γ∗0 derives


0

Z∗J∗ = 0, Z∗J− < 0. (4.6)

Necessity. Since x∗ is a τ -stationary point of (SCO), there is a W∗ satisfying Z∗ ∈ ΠS (Λ∗ ) from


(3.7), which together with Proposition 2.2 suffices to T(Λ∗ ; s) = {Γ∗0 }. Therefore, Γ∗0 ∈ T(Λ∗ ; s). If
kZ∗ k+ ∗
0 < s, then W = 0 by Proposition 2.2, which immediately shows J∗ = UΓ0 and (4.5). Next,

we prove the conclusion for case kZ∗ k+


0 = s. It follows from (2.8) that


W:Γ∗ = 0,

0 ≥ Z∗:Γ∗0 ⊥ W:Γ ∗ ≥ 0,
0
(4.7)
0


To show J∗ = UΓ∗0 , we only need to show Zmn = 0 ⇔ (Z ∗ + τ W ∗ )mn ≥ 0 for any n ∈ Γ∗0 . This
is clearly true due to the second condition in (4.7). The conditions in (4.7) and (4.6) indicate
∗ = 0, thereby resulting in
WJ−
∗ ∗
WJ = WJ ∗ = 0.
∗ − ∪(M×Γ0 )

This leads to the following condition and hence displays (4.5),

(3.7)
0 = ∇f (x∗ ) + ∇G(x∗ ) ◦ W∗ = ∇f (x∗ ) + ∇J∗ G(x∗ ) ◦ WJ
∗ .

Sufficiency. We aim to prove (3.7). Condition ∇f (x∗ ) + ∇G(x∗ ) ◦ W∗ = 0 follows from the
first and third equations in (4.5) immediately. We next show Z∗ ∈ ΠS (Λ∗ ) in (3.7). Condition
J∗ = UΓ∗0 implies that

Z∗:Γ∗0 = 0, ∗
W:Γ ∗ ≥ 0,
0
(4.8)

which together with Γ∗0 ∈ T(Λ∗ ; s) and (2.6) derives


h i
(Λ∗ )−
:Γ ∗ (Λ∗
):Γ
∗ ∈ ΠS (Λ∗ ). (4.9)
0 0


We finally show the left-hand side of (4.9) is Z∗ . Since J∗ ⊆ M × Γ∗0 , we have M × Γ0 ⊆ J ∗ ,
∗ = 0 from (4.5). Hence,
thereby leading to W:Γ ∗
0

(Λ∗ ):Γ∗ = (Z∗ + τ W∗ ):Γ∗ = Z∗:Γ∗ . (4.10)


0 0 0

15
Condition (4.5) means that WJ∗− = 0 due to J− ⊆ J ∗ . As a result,

(4.6) (4.8)
Λ∗J∗ = (Z∗ + τ W∗ )J∗ = τ WJ


≥ 0,
(4.6)
Λ∗J− = (Z∗ + τ W∗ )J− = Z∗J− < 0.

Using the above conditions and (4.6) enables us to show (Λ∗J∗ )− = Z∗J∗ and (Λ∗J− )− = Z∗J− , which
combining J− ∪ J∗ = (M × Γ∗0 ) and conditions (4.9) and (4.10) proves Z∗ ∈ ΠS (Λ∗ ).

4.2 Algorithmic design

We note that equations F(w∗ ; J∗ ) = 0 in (4.5) involve an unknown set J∗ . Therefore, to proceed
with the Newton method, we have to find J∗ , which will be adaptively updated by using the
approximation of w∗ . More precisely, let w` be the current point, we first select a T` ∈ T(Λ` ; s),
based on which we find Newton direction d` ∈ Rn+mK by solving the following linear equations:

∇Fµ` (w` ; UT` ) d` = −F(w` ; UT` ), (4.11)

where ∇Fµ (w; J ) is defined as (4.3) and µ` is updated by

µ` = min{νµ`−1 , ρkF(w` ; UT` )k}, (4.12)

with ν ∈ (0, 1) and ρ > 0. The framework of our proposed method is presented in Algorithm 1.

Algorithm 1 SNSCO: Smoothing Newton method for (SCO)


1: Initialize w0 =(x0 ; vec(W)0 ) positive parameters maxIt, tol, ρ, τ, µ, γ, and ν, π ∈ (0, 1).
Select T0 ∈ T(Λ0 ; s) and compute µ0 = min{µ, ρkF(w0 ; UT0 )k}. Set `=0.
2: if ` ≤ maxIt and kF(w` ; UT` )k ≥ tol then
3: if (4.11) is solvable then
4: Update d` by solving (4.11).
5: else
6: Update d` = −F(w` ; UT` ).
7: end if
8: Find the minimal integer t` ∈ {0, 1, 2, · · · } such that

kG(x` + π t` d`x )k+


0 ≤ (γ + 1)s, (4.13)

where d`x is the subvector formed by the first K entries in d` .


9: Update α` = π t` and w`+1 = w` + α` d` .
10: Update UT`+1 where T`+1 ∈ T(Λ`+1 ; s),
11: Update µ`+1 by (4.12) and set ` = ` + 1.
12: end if
13: return w` .

Remark 4.1. Regarding Algorithm 1, we have some observations.

i) One of the halting conditions makes use of kF(w` ; UT` )k. The reason behind this is that if point
w` satisfies kF(w` ; UT` )k = 0, then it is a τ -stationary point of (SCO) by Theorem 4.1.

16
ii) Recalling (4.5), we are expected to update d` by solving

∇F(w` ; UT` ) d` = −F(w` ; UT` ), (4.14)

instead of (4.11). However, the major concern is made on the existence of d` by solving (4.14).
To overcome such a drawback, we add a smoothing term −µ` I|J | to increase the possibilities of
the non-singularity of ∇Fµ` (w` ; UT` ). Such an idea has been extensively used in literature, e.g.,
[7, 43]

iii) When the algorithm derives a direction d` , we use condition (4.13) to decide the step size. This
condition allows the next point to be chosen in a larger region to some extent by setting γ > 0.
However, it can ensure that the next point does not step far away from the feasible region by
setting a small value of γ (e.g, γ = 0.25). In this way, the algorithm performs relatively steadily.
In addition, we will show that if the starting point is chosen close to a stationary point, condition
(4.13) can be always satisfied with π t` = 1 from Theorem 4.2.

4.3 Locally quadratic convergence

To establish the locally quadratic convergence, we need the following assumptions for a given τ -
stationary point x∗ of (SCO).

Assumption 4.1. Let x∗ be any τ -stationary point of (SCO). Suppose that ∇2 f (·) and ∇2 G(·)
are locally Lipschitz continuous around x∗ , respectively.

Assumption 4.2. Let x∗ be any τ -stationary point of (SCO). Assume functions f and G are twice
continuously differentiable on RK , ∇J∗ G(x∗ ) is full column rank, where J∗ is given by (3.2), and
∇2 f (x∗ ) + ∇2J∗ G(x∗ ) ◦ WJ
∗ is positive definite, where W∗ is uniquely decided by ∇ G(x∗ ) ◦
∗ J∗ J∗
∗ = −∇f (x∗ ).
WJ∗

We point out that the above two assumptions are related to the regularity conditions [33, 10]
usually used to achieve the convergence results for Newton type methods. Moreover, establishing
the quadratic convergence of the proposed smooth Newton method, SNSCO, is not trivial, because
differing from the standard system of equations, the τ -stationary equations, (4.5), involve an un-
known set UΓ∗0 . If we know this set in advance, then there is not much difficulty in building the
quadratic convergence. However, set UT` may change from one iteration to another. A different
set leads to a different system of equations F(w; UT` ) = 0. Hence, in each step, the algorithm finds
a Newton direction for a different system of equations instead of a fixed system. This is where
the standard proof for quadratic convergence fails to fit our case. Hence, we take a long way to
establish the locally quadratic convergence in the sequel.

Lemma 4.1. Let w∗ be a τ -stationary point with 0 < τ < τ∗ of (SCO). Then there is an η∗ > 0
such that for any w ∈ N(w∗ , η∗ ) and any T ∈ T(Λ; s),

F(w∗ ; UT ) = 0. (4.15)

Proof. a) First of all, let Γ∗+ , Γ∗− , and Γ∗0 be defined for Z∗ , while let Γ+ , Γ− , and Γ0 be defined for

17
Λ as

Γ+ = {n ∈ N : Λmax
:n > 0} ,

Γ− = {n ∈ N : Λmax
:n < 0} ,
(4.16)

Γ0 = {n ∈ N : Λmax
:n = 0} .

Similar to (2.1), let Γs ⊆ Γ+ extract s indices in Γ+ that correspond to the first s largest elements
in {kΛ+
:n k : n ∈ Γ+ }. Moreover, we define J∗ as (3.2) and J as

J := UT = {(m, n) : Λmn ≥ 0, n ∈ T }, T ∈ T(Λ, s). (4.17)

It follows from (4.5) in Theorem 4.1 that a τ -stationary point w∗ satisfies Γ∗0 ∈ T(Λ∗ ; s), UΓ∗0 = J∗ ,
and

∇f (x∗ ) + ∇J∗ G(x∗ ) ◦ WJ


∗ = 0, Z∗ = 0, W∗ = 0.
∗ J∗ J
(4.18)

Consider any w ∈ N(w∗ , η∗ ) with a sufficiently small radius η∗ > 0. For such w, we define Z and
Λ, which also defines J = UT as (4.17). To show (4.15), we need to prove

∇f (x∗ ) + ∇J G(x∗ ) ◦ WJ
∗ ∗ = 0.
= 0, Z∗J = 0, WJ (4.19)

Suppose that we have



J ⊆ J∗ , WJ = 0. (4.20)

Then condition (4.18) can immediately derive (4.19) due to (4.20) and

∗ = ∇f (x∗ ) + ∇G(x∗ ) ◦ W∗
∇f (x∗ ) + ∇J G(x∗ ) ◦ WJ

= ∇f (x∗ ) + ∇J∗ G(x∗ ) ◦ WJ



= 0.

Therefore, in the following part we aim to prove (4.20).


Suppose there is an index (m0 , n0 ) ∈ J but (m0 , n0 ) ∈
/ J∗ . We now decompose the entire index
set M × N as Table 1.

Table 1: Index set decomposition.

n∈N
n ∈ Γ∗0 n ∈ Γ∗+ n ∈ Γ∗−

 Z∗ = 0
 mn


(m, n) ∈ J∗ : ∗ ≥0
Wmn  

 
 (Z∗ )max
:n > 0

 (Z∗ )max
:n < 0
 ∗
  
Λmn ≥ 0 
∗ =0

∗ =0
m∈M  Wmn Wmn
Z∗ < 0
 
 

 mn (Λ∗ )max (Λ∗ )max
  
:n > 0 :n < 0
  
(m, n) ∈ J− : ∗ =0
Wmn


 ∗

Λmn < 0

18
In the table, we used the facts from Λ∗ = Z∗ + τ W ∗ , definition (3.2), and Proposition 2.2 that

∗ = 0,
W:Γ 0 ≥ Z∗:Γ∗ ⊥ W:Γ
∗ ≥0
∗ ∗
0 0 0

Since η∗ > 0 can be set sufficiently small, w can be close to w∗ , and so is Λ to Λ∗ , which shows

∗ max
∀ n ∈ Γ∗+ : (Z∗ )max max
:n = (Λ ):n > 0 =⇒ (Λ):n > 0,
(4.21)
∗ max
∀ n ∈ Γ∗− : (Z∗ )max max
:n = (Λ ):n < 0 =⇒ (Λ):n < 0.

The definition of T(Λ, s) in (2.1) implies T = Γ0 ∪ (Γ+ \ Γs ) for any given T ∈ T(Λ, s). Condition
(4.21) suffices to Γ∗+ ⊆ Γs ⊆ Γ+ and Γ∗− ⊆ Γ− . Therefore, we must have T ⊆ Γ∗0 . Now combining
this condition, Table 1, (4.17), (m0 , n0 ) ∈ J = UT , and (m0 , n0 ) ∈
/ J∗ , we can claim that (m0 , n0 ) ∈
J− , thereby resulting in Λ∗m0 n0 <0. However, since Λ is relatively close to Λ∗ , we have Λm0 n0 <0,
which contradicts to (m0 , n0 ) ∈ J in (4.17). So we prove J ⊆ J∗ , the first condition in (4.20)
∗ = 0. If kZ∗ k+ < s, then W∗ = 0 by Proposition 2.2. The conclusion
Finally, we prove WJ 0
is clearly true. We focus on the case of kZ∗ k+ ∗ ∗ +
0 = s. This indicates |Γ+ | = kZ k0 = s. Again, by
(4.21), we can derive

Γ∗+ = Γs , Γ∗− ⊆ Γ− , Γ∗0 ⊇ T = Γ0 ∪ (Γ+ \ Γs ). (4.22)

∗ = 0, we only need to check W ∗ = 0 due to Table 1 and J ⊆ J , where P := J \ J .


To show WJ P ∗ ∗
Note that W∗ ≥ 0, see Table 1. Suppose there is (m0 , n0 ) ∈ P such that Wm

0 n0
> 0, resulting in
Λ∗m0 n0 > 0 and thus Λm0 n0 > 0 because Λ is close to Λ∗ . Then we have n0 ∈
/ Γ− . In addition,
suppose n0 ∈ T , then (4.17) means (m0 , n0 ) ∈ J , a contradiction. So n0 ∈
/ T . Overall, n0 ∈ Γs =
Γ∗+ by (4.22). However, P ⊆ J∗ , we must have n0 ∈ Γ∗0 due to (m0 , n0 ) ∈ J∗ , which contradicts
with n0 ∈ Γ∗+ . Hence, P = ∅ and hence J = J∗ , there by WJ∗ = 0 from (4.18).

Lemma 4.2. If Hessian ∇2 ϕ(·) is locally Lipschitz continuous around w∗ , then so are gradient
∇ϕ(·) and function ϕ(·).

Proof. Let Lϕ be the Lipschitz continuous constant of ∇2 ϕ(·) around w∗ . For any w ∈ N (w∗ , δ∗ ),
by letting wβ∗ := w∗ + β(w − w∗ ) for some β ∈ (0, 1), the Mean Value Theory results in

R1
k∇ϕ(w) − ∇ϕ(w∗ )k = k 0 [∇2 ϕ(wβ∗ )(w − w∗ )dβk
R1
=k 0 [∇2 ϕ(wβ∗ ) − ∇2 ϕ(w∗ )](w − w∗ )dβk
R1
+k 0 ∇2 ϕ(w∗ )(w − w∗ )dβk
R1
≤ Lϕ kw − w∗ k 0 kwβ∗ − w∗ kdβ + k∇2 ϕ(w∗ )kkw − w∗ k

= (Lϕ δ∗ /2 + k∇2 ϕ(w∗ )k)kw − w∗ k,

showing that gradient ∇ϕ(·) is locally Lipschitz continuous around w∗ , which using the similar
reasoning allows us to check the locally Lipschitz continuity of ϕ(·) around w∗ .

19
Lemma 4.3. Under Assumption 4.1, for any w ∈ N(w∗ , δ∗ ),

k∇F(w; J ) − ∇F(w∗ ; J )k ≤ L∗ kw − w∗ k. (4.23)

for any fixed J , where L∗ is relied on (f, G, w∗ , δ∗ ).

Proof. By Lemma 4.2, the locally Lipschitz continuity of ∇2 G(·) around x∗ implies that ∇G(·) is
also locally Lipschitz continuous around x∗ . Let the locally Lipschitz constants of ∇2 G(·), ∇G(·),
and ∇2 f (·) around x∗ be L2 , L1 , and Lf . Then we have

k∇J G(x) − ∇J G(x∗ )kF ≤ k∇G(x) − ∇G(x∗ )kF


(4.24)
≤ L1 kx − x∗ k ≤ L1 kw − w∗ k,

and also

k∇2J G(x) ◦ WJ − ∇2J G(x∗ ) ◦ WJ∗ kF

≤ k∇2 G(x) ◦ W − ∇2 G(x∗ ) ◦ W∗ kF

≤ k(∇2 G(x) − ∇2 G(x∗ )) ◦ WkF + k∇2 G(x∗ ) ◦ (W − W∗ )kF


(4.25)
≤ k∇2 G(x) − ∇2 G(x∗ )kF kWkF + k∇2 G(x∗ )kkW − W∗ kF

≤ L2 kx − x∗ k(kW∗ kF + kW − W∗ kF ) + k∇2 G(x∗ )kkW − W∗ kF


h i
≤ L2 (kw∗ kF + δ∗ ) + k∇2 G(x∗ )k kw − w∗ k,

where we used the facts that

max{kx − x∗ k, kW − W∗ kF } ≤ kw − w∗ k

kWkF ≤ kW − W∗ kF + kW∗ kF ≤ kw − w∗ k + kw∗ k ≤ δ∗ + kw∗ k.

Conditions (4.25) and (4.24) enable us to obtain

k∇F(w; J ) − ∇F(w∗ ; J )k

≤ k∇F(w; J ) − ∇F(w∗ ; J )kF


(4.3)
≤ k∇2 f (x) − ∇2 f (x∗ )kF + 2k∇J G(x) − ∇J G(x∗ )kF

+ k∇2J G(x) ◦ WJ − ∇2J G(x∗ ) ◦ WJ∗ kF


(4.25) h i
≤ Lf + 2L1 + L2 (kw∗ kF + δ∗ ) + k∇2 G(x∗ )k kw − w∗ k,

showing the desired result.

20
Lemma 4.4. Let w∗ be a τ -stationary point with 0 < τ < τ∗ of problem (SCO). If Assumptions
4.2 and 4.1 hold, the there always exist positive constants c∗ , C∗ , δ∗ , and µ∗ such that for any
µ ∈ [0, µ∗ ], w ∈ N(w∗ , δ∗ ), and T ∈ Tτ (Λ; s),

1/c∗ ≤ σmin (∇Fµ (w; UT )) ≤ k∇Fµ (w; UT )k ≤ C∗ . (4.26)

Proof. Consider any w ∈ N(w∗ , δ∗ ) with a sufficiently small radius δ∗ ∈ (0, η∗ ], where η∗ is given
in Lemma 4.1. Similarly, we define Γ+ , Γ− , and Γ0 for Λ as (4.16) and J := UT as (4.17). Using
the notation in (4.4), we have
 
 H(w; J ) 0 
∇F(w; J ) = 

,
 (4.27)
0 0

where
 
 ∇2 f (x) + ∇2J G(x) ◦ WJ ∇J G(x) 
H(w; J ) := 

.
 (4.28)
∇J G(x)> 0

Since w ∈ N(w∗ , δ∗ ) for a sufficiently small δ∗ ≤ η∗ , it follows from (4.20) that J ⊆ J∗ and
∗ = 0. Therefore,
WJ
 
 ∇2 f (x∗ ) + ∇2J∗ G(x∗ ) ◦ WJ
∗ ∇ G(x∗ )
∗ J
H(w∗ ; J ) = 

, (4.29)
 
∇J G(x )∗ > 0

∗ due to J ⊆ J and W∗ = 0.
where we used the fact that ∇2J G(x∗ ) ◦ WJ∗ = ∇2J∗ G(x∗ ) ◦ WJ ∗ ∗ J
Therefore, H(w∗ ; J ) is a sub-matrix of H(w∗ ; J∗ ) owing to J ⊆ J∗ . Recall the full rankness of
∇J∗ G(x∗ ) and the positive definiteness of ∇2 f (x∗ ) + ∇2J∗ G(x∗ ) ◦ WJ
∗ in Assumption 4.2. We can

conclude that H(w∗ ; J ) is non-singular for any J ⊆ J∗ , namely σmin (H(w∗ ; J )) > 0. Then by [39,
Theorem 1] that the maximum singular value of a matrix is no less than the maximum singular
value of its sub-matrix, we obtain

kH(w∗ ; J )k ≤ kH(w∗ ; J∗ )k,


(4.30)
σmin (H(w∗ ; J )) ≥ minJ ⊆J∗ σmin (H(w∗ ; J )) > 0,

which contributes to

k∇Fµ (w; J )k ≤ k∇Fµ (w; J ) − ∇F(w; J )k

+ k∇F(w; J ) − ∇F(w∗ ; J )k + k∇F(w∗ ; J )k


(4.23)
≤ µ + L∗ δ∗ + k∇F(w∗ ; J )k
(4.27)
≤ 2L∗ δ∗ + max{kH(w∗ ; J )k, 1}
(4.30)
≤ 2L∗ δ∗ + max{kH(w∗ ; J∗ )k, 1} =: C∗ .

21
To show the lower bound of σmin (∇F(w; J )), we need the following fact:

kA − Bk ≥ maxi |σi (A) − σi (B)|

≥ |σi0 (A) − σi0 (B)|


(4.31)
≥ σi0 (A) − σmin (B)

≥ σmin (A) − σmin (B),

for any two matrices A and B, where the first inequality holds from [25, Reminder (2), on Page
76] and i0 satisfies σi0 (B) = σmin (B). Let µ∗ := L∗ δ∗ . The above fact allows us to derive

(4.31)
σmin (∇Fµ (w; J )) ≥ σmin (∇Fµ (w; J )) − k∇Fµ (w; J ) − ∇F(w; J )k

= σmin (∇Fµ (w; J )) − µ


(4.31)
≥ σmin (∇F(w∗ ; J )) − k∇F(w∗ ; J ) − ∇F(w; J )k − µ
(4.23)
≥ σmin (∇F(w∗ ; J )) − L∗ δ∗ − µ
(4.27)
≥ min{σmin (H(w∗ ; J )), 1} − 2µ∗
(4.30)
≥ min{minJ ⊆J∗ σmin (H(w∗ ; J∗ )), 1} − 2µ∗ =: 1/c∗ .

Since δ∗ can be sufficiently small, so is µ∗ and hence c∗ > 0.

Now we present the main convergence results in the following theorem.

Theorem 4.2 (Locally quadratic convergence). Let w∗ be a τ -stationary point of problem (SCO)
with 0 < τ < τ∗ and {w` } be the sequence generated by Algorithm 1. If Assumptions 4.2 and 4.1
hold, then there always exist positive constants c∗ , C∗ , ∗ and µ∗ ensuring the following results if
choosing µ ∈ (0, µ∗ ], γ ≥ |Γ∗0 |/s, and w0 ∈ U (w∗ , ∗ ).

a) The full Newton update is always admitted.

b) Sequence {d` }`≥0 is well defined and lim`→∞ d` = 0.

c) Whole sequence {w` } converges to w∗ quadratically, namely,

kw`+1 − w∗ k ≤ 3c∗ (L∗ + ρC∗ )kw` − w∗ k2 .

d) The halting condition satisfies

kF(w`+1 , UT`+1 )k ≤ 3c3∗ C∗ (L∗ + ρC∗ )kF(w` , UT` )k2

and Algorithm 1 will terminate when

l p  √ m
` ≥ log2 3c3∗ C∗3 (L∗ + ρC∗ )kw0 − w∗ k − log2 ( tol) . (4.32)

22
Proof. Let η∗ be given in Lemma 4.1, c∗ , C∗ , δ∗ , µ∗ are given in Lemma 4.4, and L∗ be defined in
Lemma 4.3. For notational simplicity, for ` = 0, 1, 2, · · · , let

wβ` := w∗ + β(w` − w∗ ), where β ∈ [0, 1],

J` := UT` , (4.33)
 n oi
1 1
∗ ∈ 0, min δ∗ , 6c∗ (L∗ +ρC∗ ) , 2ρC∗ ,

where the last conditon indicates


n o
max ∗ ρC∗ , 3∗ c∗ (L∗ + ρC∗ ) ≤ 12 . (4.34)

a) We note that µ0 ≤ µ ≤ µ∗ . Then it follows from Lemma 4.1, Lemma 4.4, and w0 ∈ N(w∗ , ∗ )
that for any T0 ∈ Tτ (Λ0 ; s),
(4.15) (4.26)
F(w∗ ; J0 ) = 0, k(∇Fµ0 (w0 ; J0 ))−1 k ≤ c∗ , (4.35)

which means that (4.11) is solvable, that is,


(4.11)
d0 = − (∇Fµ0 (w0 ; J0 ))−1 F(w0 ; J0 ). (4.36)

Recalling (4.2), for given J0 , F(·; J0 ) is locally Lipschitz continuous around x∗ by Lemma 4.2 due
to the locally Lipschitz continuity of ∇2 f (·) and ∇2 G(·). Then by the first condition in (4.35),
kd0 k is close to zero for a sufficiently small ∗ . This allows us to derive the following condition

∀ n ∈ Γ∗+ : (G(x∗ ))max 0 0 max


:n > 0 =⇒ (G(x + dx )):n > 0,

∀ n ∈ Γ∗− : (G(x∗ ))max 0 0 max


:n < 0 =⇒ (G(x + dx )):n < 0.

(We shall emphasize that we can find a strictly positive bound  > 0 such that the above conditions
can be achieved for any ∗ ∈ (0, ] due to the locally Lipschitz continuity of G(·) around x∗ .
Therefore, such an ∗ can be away from zero.) Based on the above conditions, we can obtain

kG(x0 + d0x )k+ ∗ ∗ ∗ + ∗


0 ≤ |Γ+ | + |Γ0 | = kG(x )k0 + |Γ0 | ≤ s + γs.

Overall, we showed that (4.11) is solvable and (4.13) is satisfied with π t` = 1. Hence, the full
Newton is admitted, namely,

w 1 = w 0 + d0 . (4.37)

In addition, by using the fact that µ0 ≤ ρkF(w0 ; J0 )k, we have the following chain of inequalities,

k∇Fµ0 (w0 ; J0 ) − ∇F(w∗ ; J0 )k


(4.3)
≤ k∇F(w0 ; J0 ) − ∇F(w∗ ; J0 )k + µ0

≤ k∇F(w0 ; J0 ) − ∇F(w∗ ; J0 )k + ρkF(w0 ; UT0 )k


(4.36)
≤ k∇F(w0 ; J0 ) − ∇F(w∗ ; J0 )k + ρk∇Fµ0 (w0 ; J0 )d0 k
(4.23,4.26)
≤ L∗ kw0 − w∗ k + ρC∗ kd0 k.

23
This enables us to obtain

k∇Fµ0 (w0 ; J0 ) − ∇F(wβ0 ; J0 )k

≤ k∇Fµ0 (w0 ; J0 ) − ∇F(w∗ ; J0 )k + k∇F(wβ0 ; J0 ) − ∇F(w∗ ; J0 )k

≤ L∗ kw0 − w∗ k + ρC∗ kd0 k + k∇F(wβ0 ; J0 ) − ∇F(w∗ ; J0 )k


(4.38)
(4.23)
≤ L∗ (kw0 − w∗ k + kw0 − wβ∗ k) + ρC∗ kd0 k

(4.37)
= L∗ (1 + β)kw0 − w∗ k + ρC∗ kw1 − w0 k

≤ (L∗ (1 + β) + ρC∗ )kw0 − w∗ k + ρC∗ kw1 − w∗ k

due to both w0 and wβ0 ∈ N(w∗ , δ∗ ). Note that for a fixed J0 , function F(·; J0 ) is differentiable.
So we have the following mean value expression

R1
F(w0 ; J0 ) = F(w∗ ; J0 ) + 0 ∇F(wβ0 ; J0 )(w0 − w∗ )dβ
(4.39)
(4.35) R 1
= 0 ∇F(wβ0 ; J0 )(w0 − w∗ )dβ,

which ensures the following chain of inequalities,

(4.37)
kw1 − w∗ k = kw0 + d0 − w∗ k

(4.36)
= kw0 − w∗ − (∇Fµ0 (w0 ; J0 ))−1 F(w0 ; J0 )k
(4.35)
≤ c∗ k∇Fµ0 (w;0 J0 )(w0 − w∗ ) − F(w0 ; J0 )k

(4.39) R1
= c∗ k∇Fµ0 (w0 ; J0 )(w0 − w∗ ) − 0 ∇F(wβ0 ; J0 )(w0 − w∗ )dβk
R1
≤ c∗ 0 k∇Fµ0 (w0 ; J0 ) − ∇F(wβ0 ; J0 )k · kw0 − w∗ kdβ
(4.38) R1h i
≤ c∗ 0 (L∗ (1 + β) + ρC∗ )kw0 − w∗ k + ρC∗ kw1 − w∗ k kw0 − w∗ kdβ

= c∗ (3L∗ /2 + ρC∗ )kw0 − w∗ k2 + ρC∗ kw1 − w∗ kkw0 − w∗ k

≤ c∗ (3L∗ /2 + ρC∗ )kw0 − w∗ k2 + (1/2)kw1 − w∗ k,

where the last inequality is from ρC∗ kw0 − w∗ k ≤ ρC∗ ∗ ≤ 1/2 by (4.34). The above condition
immediately results in

kw1 − w∗ k ≤ 3c∗ (L∗ + ρC∗ )kw0 − w∗ k2

≤ 3c∗ ∗ (L∗ + ρC∗ )kw0 − w∗ k (4.40)

(4.34)
≤ (1/2)kw0 − w∗ k < ∗ .

24
This means w1 ∈ N(w∗ , ∗ ). Replacing J0 by J1 , the same reasoning allows us to show that
for ` = 1, (i) (4.11) is solvable; (ii) the full Newton update is admitted; and (iii) kw2 − w∗ k ≤
3c∗ (L∗ + ρC∗ )kw1 − w∗ k2 ≤ (1/2)kw1 − w∗ k. By the induction, we can conclude that for any `,

• w` ∈ N(w∗ , ∗ );

• (4.11) is solvable;

• the full Newton update is admitted, and

kw`+1 − w∗ k ≤ 3c∗ (L∗ + ρC∗ )kw` − w∗ k2

≤ 3c∗ ∗ (L∗ + ρC∗ )kw` − w∗ k (4.41)

(4.34)
≤ (1/2)kw` − w∗ k.

Therefore, conclusions a) and b) follow immediately.


c) It has shown that w` ∈ N(w∗ , ∗ ), thereby resulting in
(4.15)
F(w∗ , J` ) = 0. (4.42)

By ∇F(w; J ) = ∇F0 (w; J ), wβ` ∈ N(w∗ , ∗ ), and ∗ ≤ δ∗ , Lemma 4.4 allows us to derive

1/c∗ ≤ σmin (∇F(wβ` ; J` )) ≤ k∇F(wβ` ; J` )k ≤ C∗ . (4.43)

Again, for fixed J` , function F(·; J` ) is differentiable. The Mean-Value theorem indicates that there
exists a β0 ∈ (0, 1) satisfying

kF(w` ; J` )k = kF(w∗ ; J` ) + ∇F (wβ` 0 ; J` )(w` − w∗ )k


(4.42)
= k∇F(wβ` 0 ; J` )(w` − w∗ )k
(4.43) h i
∈ (1/c∗ )kw` − w∗ k, C∗ kw` − w∗ k . (4.44)

This contributes to the following chain of inequalities


(4.44)
kF(w`+1 ; J`+1 )k ≤ C∗ kw`+1 − w∗ k
(4.41)
≤ 3c∗ C∗ (L∗ + ρC∗ )kw` − w∗ k2
(4.44)
≤ 3c3∗ C∗ (L∗ + ρC∗ )kF(w` ; J` )k2 .

Finally, the above relation also indicates

kF(w` ; J` )k ≤ 3c3∗ C∗ (L∗ + ρC∗ )kF(w`−1 ; J`−1 )k2


(4.44)
≤ 3c3∗ C∗3 (L∗ + ρC∗ )kw`−1 − w∗ k2
(4.41)
≤ 3c3∗ C∗3 (L∗ + ρC∗ )/22 kw`−2 − w∗ k2
..
.
≤ 3c3∗ C∗3 (L∗ + ρC∗ )/22` kw0 − w∗ k2 .

Therefore, one can easily verify that if (4.32) is satisfied then the term of the right hand side of the
above inequality is smaller than tol, namely, kF(w` ; J` )k < tol, showing the desired result.

25
Remark 4.2. Regarding the assumptions in Theorem 4.2, µ and γ can be set easily in the numerical
experiments. For example, we could set a small value for µ (e.g., 10−4 ) and let γ ∈ (0, 1) as |Γ∗0 | is
usually pretty small in the numerical experiments. Therefore, to achieve the quadratic convergence
rate, we have to choose a proper starting point w0 , which is apparently impractical since the final
stationary point is unknown in advance.
However, in our numerical experiments, we find that the quadratic convergence rate can be
always observed for solving some problems for different starting points, which indicates that the
proposed algorithm seems not to rely on the initial points heavily. For example, Fig. 1 presents
the results of SNSCO solving the norm optimization problems (see Section 5.1) under three different
starting points. From the left to the right figure, the three starting points are the point with each
entry being 0, the point with each entry being randomly generated from [0, 0.5], and the point with
each entry being 0.5. It can be clearly seen that all error kF(w` , UT` )k declines dramatically when
the iteration is over a certain threshold.
101
Error

10-5

=0.01 =0.01 =0.01


=0.05 =0.05 =0.05
=0.10 =0.10 =0.10 10-10

10 20 20 40 10 20 20 40 10 20 20 40
Iteration Iteration Iteration

Figure 1: Quadratic convergence rate.

8.5 45
Number of iteration

=0.01
8.0 35
Obective

=0.05
=0.10

=0.01
=0.05
7.5 =0.10
25

10-4 10-2 100 10-4 10-2 100

Figure 2: Effect of τ .

5 Numerical Experiments
In this section, we will conduct some numerical experiments of SNSCO using MATLAB (R2020a) on
a laptop of 32GB memory and Core i9.

26
Starting point is always initialized as w0 = 0. The parameters are set as follows: maxIt =
2000, tol = 10−9 KM N, ρ = 10−2 , µ = 10−2 , γ = 0.25, ν = 0.999, and π = 0.25 if no additional in-
formation is provided. In addition, we always set s = dαN e where α is chosen from {0.01, 0.05, 0.1}.
We point out that our main theory about the τ stationary point involves an important parameter τ .
However, we test SNSCO on solving the norm optimization problem described below under different
choices of τ ∈ [10−4 , 1]. The lines presented in Fig. 2 indicate that the results are quite robust to
τ . Therefore, in the following numerical experiments, we fix it as τ = 0.75.

5.1 Test example

We use the norm optimization problem described in [17, 1] to demonstrate the performance of
SNSCO. The problem takes the form of

min −kxk1 ,
x∈RK (5.1)
nP o
K 2 x2 ≤ b, m ∈ M ≥ 1 − α, x ≥ 0,
s.t. P ξ
k=1 mk k

where kxk1 is the 1-norm of x. However, to fit the above problem into our model (SCO) without
other additional constraints, we aim to address the following one,

min λ2 kxk2 + λ1 kx− k1 − kxk1


x∈RK
nP o (5.2)
K 2 2
s.t. P k=1 ξmk xk ≤ b, m ∈ M ≥ 1 − α,

where λ2 and λ1 are two positive penalty parameters. Here, kx− k1 is used to exactly penalize con-
straint x ≥ 0. We add kxk2 in the objective function to guarantee the non-singularity of coefficient
matrix in (4.11) when updating the Newton direction. In the following numerical experiments we
fix λ2 = 0.5 and λ1 = 0.5 for simplicity. It is known from [17] that if ξmk , m ∈ M, k ∈ K are
independent and identically distributed standard normal random variables, then optimal solution
x to problem (5.1) is s
b
x1 = · · · = xK =  ,
F −1
1
(1−α) M
(5.3)
χ2
K

where Fχ−1
2 (·) stands for the inverse distribution function of a chi-square distribution of degree K.
K

5.2 Single chance constraint

We first focus on the scenarios where model (5.2) has a single chance constraint, namely, M = 1.
We will compare SNSCO with two algorithms proposed in [1]: regularized algorithm (RegAlg) with a
convex start and relaxed algorithm (RelAlg). Both algorithms are adopted to solve problem (5.2).
We choose α ∈ {0.01, 0.05, 0.10}. For each fixed (α, K, N ), we run 100 trials and report the average
results of the objective function values and computational time in seconds.
a) Effect of K. To see this, we fix N = 100 and choose K ∈ {10, 20, · · · , 50}. The average
results are presented in Fig. 3, where we display the computational time in the log domain to make
the difference evidently. For the objective function values, SNSCO and RegAlg basically obtain
similar ones, which are slightly better than RelAlg. However, SNSCO runs the fastest with taking
less than e−3 ≈ 0.05 seconds. The slowest solver is RegAlg, which is not surprising since it often
has to restart the method when the point does not satisfy the optimality conditions.

27
0
3

Obective ( =0.01)

ln(Time) ( =0.01)
-7 0

SNSCO
-3
-14 RegAlg
RelAlg

0
3
Obective ( =0.05)

ln(Time) ( =0.05)
-7 0

SNSCO
-3
-14 RegAlg
RelAlg

0
3
Obective ( =0.10)

ln(Time) ( =0.10)

-7 0

SNSCO
-3
-14 RegAlg
RelAlg

10 20 30 40 50 10 20 30 40 50
K K
Figure 3: Effect of K.

b) Effect of N . To see this, we fix K = 10 and select N ∈ {100, 150, · · · , 300}. We emphasize
that SNSCO is able to solve instances with a much higher dimension N . However, we do not test on
instances with larger N because it will take a long time for RegAlg to solve the problem. The average
results are given in Fig. 4. Once again, SNSCO produces similar objective function values to those
of RegAlg and runs much faster than the other two algorithms. The advantage of computational
speed tends to be more obvious when N is rising. For example, when N = 300 and α = 0.1, RegAlg
consumes more than e6 ≈ 400 seconds while SNSCO only takes e−5 ≈ 0.007 seconds.

5.3 Joint chance constraint

For solving problem (5.1) with joint constraints, we fix N = 100 while choose K and M from
{10, 20, · · · , 50}. Average results over 100 trials are reported in Table 2 where we fix M = 10
and Table 3 where we fix K = 10. Since for problem (5.1), optimal solution x is known as (5.3).
Therefore we compare x generated by SNSCO with x by calculating the objective function value for
problem (5.1). It can be clearly seen that −kxk1 is close to −kxk1 but usually larger than it. This
is because SNSCO solves (5.2), a relaxation of problem (5.1).

28
0
5

Obective ( =0.01)

ln(Time) ( =0.01)
2
-7
0

-2
-14 SNSCO
RegAlg
RelAlg
-5

0
5
Obective ( =0.05)

ln(Time) ( =0.05)
2
-7
0

-2
-14 SNSCO
RegAlg
RelAlg
-5

0
5
Obective ( =0.10)

ln(Time) ( =0.10)

2
-7
0

-2
-14 SNSCO
RegAlg
RelAlg
-5

100 150 200 250 300 100 150 200 250 300
N N

Figure 4: Effect of N .

Table 2: Effect of K.

α = 0.01 α = 0.05 α = 0.10


K −kxk1 −kxk1 Time −kxk1 −kxk1 Time −kxk1 −kxk1 Time
10 -5.814 -6.304 0.42 -6.312 -6.453 0.75 -6.581 -6.562 0.86
20 -9.400 -10.13 0.40 -10.01 -10.32 0.75 -10.34 -10.48 1.01
30 -12.28 -13.30 0.57 -12.96 -13.41 0.95 -13.32 -13.55 1.06
40 -14.77 -15.91 0.64 -15.49 -16.01 1.05 -15.88 -16.19 1.34
50 -16.99 -18.30 0.52 -17.75 -18.39 0.94 -18.15 -18.53 1.15

6 Conclusion
The 0/1 loss function ideally characterizes the constraints of SAA. However, due to its discontinuous
nature, it impedes the development of numerical algorithms solving SAA for a long time. In this

29
Table 3: Effect of M .

α = 0.01 α = 0.05 α = 0.10


M −kxk1 −kxk1 Time −kxk1 −kxk1 Time −kxk1 −kxk1 Time
10 -5.81 -6.16 0.52 -6.31 -6.43 0.74 -6.58 -6.56 1.01
20 -5.64 -6.07 0.85 -6.08 -6.22 1.18 -6.32 -6.30 1.31
30 -5.55 -5.93 0.64 -5.96 -6.01 1.25 -6.18 -6.15 1.94
40 -5.49 -5.90 1.51 -5.88 -5.95 2.03 -6.09 -6.05 2.46
50 -5.44 -5.83 1.50 -5.82 -5.92 1.69 -6.02 -5.97 2.79

paper, we managed to address a general 0/1 constrained optimization problem that includes SAA
as a special case. One of the key factors of such a success resulted from the derivation of the normal
cone to the feasible set. Another crucial factor was the establishment of the τ -stationary equations,
a type of optimality condition that allows us to exploit the smooth Newton type method. We
feel that those results could be extended to a more general case where equalities or inequalities
constraints are included in model (SCO), which deserves future investigation as the general model
has more practical applications [17, 9].

References
[1] Adam, L., Branda, M.: Nonlinear chance constrained problems: optimality conditions, regu-
larization and solvers. J. Optim. Theory. Appl. 170(2), 419–436 (2016)

[2] Ahmed, S., Shapiro, A.: Solving chance-constrained stochastic programs via sampling and
integer programming. In: State-of-the-art decision-making tools in the information-intensive
age, pp. 261–269. Informs (2008)

[3] Ban, L., Mordukhovich, B.S., Song, W.: Lipschitzian stability of parametric variational in-
equalities over generalized polyhedra in banach spaces. Nonlinear Anal. Theory Methods Appl.
74(2), 441–461 (2011)

[4] Beraldi, P., Bruni, M.E.: An exact approach for solving integer problems under probabilistic
constraints with random technology matrix. Ann. Oper. Res. 177(1), 127–137 (2010)

[5] Boufounos, P.T., Baraniuk, R.G.: 1-bit compressive sensing. In: 2008 42nd Annual Conference
on Information Sciences and Systems, pp. 16–21. IEEE (2008)

[6] Charnes, A., Cooper, W.W., Symonds, G.H.: Cost horizons and certainty equivalents: an
approach to stochastic programming of heating oil. Manage Sci. 4(3), 235–263 (1958)

[7] Chen, X., Qi, L., Sun, D.: Global and superlinear convergence of the smoothing newton method
and its application to general box constrained variational inequalities. Math. Comput. 67(222),
519–540 (1998)

[8] Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

30
[9] Curtis, F.E., Wachter, A., Zavala, V.M.: A sequential algorithm for solving nonlinear opti-
mization problems with chance constraints. SIAM J. Optim. 28(1), 930–958 (2018)

[10] Dontchev, A.L., Rockafellar, R.T.: Newton’s method for generalized equations: a sequential
implicit function theorem. Math. Program. 123(1), 139–159 (2010)

[11] Evgeniou, T., Pontil, M., Poggio, T.: Regularization networks and support vector machines.
Adv. Comput. Math. 13(1), 1 (2000)

[12] Friedman, J.H.: On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Min. Knowl.
Discov. 1(1), 55–77 (1997)

[13] Geletu, A., Hoffmann, A., Kloppel, M., Li, P.: An inner-outer approximation approach to
chance constrained optimization. SIAM J. Optim. 27(3), 1834–1857 (2017)

[14] Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data mining,
inference, and prediction. Springer Science & Business Media (2009)

[15] Henrion, R.: Structural properties of linear probabilistic constraints. Optim. 56(4), 425–440
(2007)

[16] Henrion, R., Möller, A.: Optimization of a continuous distillation process under random inflow
rate. Comput. Math. with Appl. 45(1-3), 247–262 (2003)

[17] Hong, L.J., Yang, Y., Zhang, L.: Sequential convex approximations to joint chance constrained
programs: A monte carlo approach. Oper. Res. 59(3), 617–630 (2011)

[18] Kataoka, S.: A stochastic programming model. J. Econom. pp. 181–196 (1963)

[19] Lagoa, C.M., Li, X., Sznaier, M.: Probabilistically constrained linear programs and risk-
adjusted controller design. SIAM J. Optim. 15(3), 938–951 (2005)

[20] Lejeune, M.A., Ruszczyński, A.: An efficient trajectory method for probabilistic production-
inventory-distribution problems. Oper. Res. 55(2), 378–394 (2007)

[21] Li, L., Lin, H.T.: Optimizing 0/1 loss for perceptrons by random coordinate descent. In: 2007
International Joint Conference on Neural Networks, pp. 749–754. IEEE (2007)

[22] Luedtke, J.: A branch-and-cut decomposition algorithm for solving chance-constrained math-
ematical programs with finite support. Math. Program. 146(1), 219–244 (2014)

[23] Luedtke, J., Ahmed, S.: A sample approximation approach for optimization with probabilistic
constraints. SIAM J. Optim. 19(2), 674–699 (2008)

[24] Luedtke, J., Ahmed, S., Nemhauser, G.L.: An integer programming approach for linear pro-
grams with probabilistic constraints. Math. Program. 122(2), 247–272 (2010)

[25] Lütkepohl, H.: Handbook of matrices, vol. 1. Wiley Chichester (1996)

[26] Miller, B.L., Wagner, H.M.: Chance constrained programming with joint constraints. Oper.
Res. 13(6), 930–945 (1965)

31
[27] Nemirovski, A., Shapiro, A.: Convex approximations of chance constrained programs. SIAM
J. Optim. 17(4), 969–996 (2007)

[28] Osuna, E., Girosi, F.: Reducing the run-time complexity of support vector machines. In:
International Conference on Pattern Recognition (submitted) (1998)

[29] Pagnoncelli, B.K., Ahmed, S., Shapiro, A.: Sample average approximation method for chance
constrained programming: theory and applications. J. Optim. Theory. Appl. 142(2), 399–416
(2009)

[30] Peña-Ordieres, A., Luedtke, J.R., Wächter, A.: Solving chance-constrained problems via a
smooth sample-based nonlinear approximation. SIAM J. Optim. 30(3), 2221–2250 (2020)

[31] Prekopa, A.: Contributions to the theory of stochastic programming. Math. Program. 4(1),
202–221 (1973)

[32] Prékopa, A.: Stochastic programming, vol. 324. Springer Science & Business Media (2013)

[33] Robinson, S.M.: Strongly regular generalized equations. Math. Oper. Res. 5(1), 43–62 (1980)

[34] Rockafellar, R.T., Wets, R.J.B.: Variational analysis, vol. 317. Springer Science & Business
Media (2009)

[35] Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on stochastic programming: modeling
and theory. SIAM (2021)

[36] Sun, H., Xu, H., Wang, Y.: Asymptotic analysis of sample average approximation for stochastic
optimization problems with joint chance constraints via conditional value at risk and difference
of convex functions. J. Optim. Theory. Appl. 161(1), 257–284 (2014)

[37] Sun, H., Zhang, D., Chen, Y.: Convergence analysis and a dc approximation method for
data-driven mathematical programs with distributionally robust chance constraints. http :
//www.optimization − online.org/DBH T M L/2019/11/7465.html (2019)

[38] Takyi, A.K., Lence, B.J.: Surface water quality management using a multiple-realization
chance constraint method. Water Resour. Res. 35(5), 1657–1670 (1999)

[39] Thompson, R.C.: Principal submatrices ix: Interlacing inequalities for singular values of sub-
matrices. Linear Algebra Appl. 5(1), 1–12 (1972)

[40] Wang, H., Shao, Y., Zhou, S., Zhang, C., Xiu, N.: Support vector machine classifier via l {0/1}
soft-margin loss. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

[41] Weisstein, E.W.: Heaviside step function. https://mathworld.wolfram.com/ (2002)

[42] Zhou, S., Luo, Z., Xiu, N., Li, G.Y.: Computing one-bit compressive sensing via double-
sparsity constrained optimization. IEEE Trans. Signal Process 70, 1593–1608 (2022)

[43] Zhou, S., Pan, L., Xiu, N., Qi, H.D.: Quadratic convergence of smoothing newton’s method
for 0/1 loss optimization. SIAM J. Optim. 31(4), 3184–3211 (2021)

32

You might also like