0% found this document useful (0 votes)
1 views

4300_neural_network_approximation_b

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

4300_neural_network_approximation_b

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Published as a conference paper at ICLR 2022

N EURAL N ETWORK A PPROXIMATION BASED ON


H AUSDORFF DISTANCE OF T ROPICAL Z ONOTOPES
Panagiotis Misiakos1, † , Georgios Smyrnis2 , Georgios Retsinas3 , Petros Maragos3
1
ETH Zurich, 2 University of Texas at Austin, 3 National Technical University of Athens
[email protected], [email protected],
[email protected], [email protected]

A BSTRACT

In this work we theoretically contribute to neural network approximation by pro-


viding a novel tropical geometrical viewpoint to structured neural network com-
pression. In particular, we show that the approximation error between two neural
networks with ReLU activations and one hidden layer depends on the Hausdorff
distance of the tropical zonotopes of the networks. This theorem comes as a first
step towards a purely geometrical interpretation of neural network approxima-
tion. Based on this theoretical contribution, we propose geometrical methods that
employ the K-means algorithm to compress the fully connected parts of ReLU
activated deep neural networks. We analyze the error bounds of our algorithms
theoretically based on our approximation theorem and evaluate them empirically on
neural network compression. Our experiments follow a proof-of-concept strategy
and indicate that our geometrical tools achieve improved performance over relevant
tropical geometry techniques and can be competitive against non-tropical methods.

1 I NTRODUCTION

Tropical geometry (Maclagan & Sturmfels, 2015) is a mathematical field based on algebraic geometry
and strongly linked to polyhedral and combinatorial geometry. It is built upon the tropical semiring
which originally refers to the min-plus semiring (Rmin , ∧, +), but may also refer to the max-plus
semiring (Cuninghame-Green, 2012; Butkovič, 2010). In our work, we follow the convention of the
max-plus semiring (Rmax , ∨, +) which replaces the classical operations of addition and multiplication
by max and sum respectively. These operations turn polynomials into piecewise linear functions
making them directly applicable in neural networks.
Tropical mathematics cover a wide range of applications including dynamical systems on weighted
lattices (Maragos, 2017), finite state transducers (Theodosis & Maragos, 2018; 2019) and convex
regression (Maragos & Theodosis, 2020; Tsilivis et al., 2021). Recently, there has been remarkable
theoretical impact of tropical geometry in the study of neural networks and machine learning (Maragos
et al., 2021). Zhang et al. (2018) prove the equivalence of ReLU activated neural networks with
tropical rational mappings. Furthermore, they use zonotopes to compute a bound on the number
of the network’s linear regions, which has already been known in (Montúfar et al., 2014). In a
similar context, Charisopoulos & Maragos (2018) compute an upper bound to the number of linear
regions of convolutional and maxout layers and propose a randomized algorithm for linear region
counting. Other works employ tropical geometry to examine the training and further properties of
morphological perceptron (Charisopoulos & Maragos, 2017) and morphological neural networks
(Dimitriadis & Maragos, 2021).
Pruning or, generally, compressing neural networks gained interest in recent years due to the surprising
capability of reducing the size of a neural network without compromising performance (Blalock et al.,
2020). As tropical geometry explains the mathematical structure of neural networks, pruning may
also be viewed under the perspective of tropical geometry. Indeed, Alfarra et al. (2020) propose an
unstructured compression algorithm based on sparsifying the zonotope matrices of the network. Also,

Conducted research as a student in National Technical University of Athens.

1
Published as a conference paper at ICLR 2022

Smyrnis et al. (2020) construct a novel tropical division algorithm that applies to neural network
minimization. A generalization of this applies to multiclass networks (Smyrnis & Maragos, 2020).

Contributions In our work, we contribute to structured neural network approximation from the
mathematical viewpoint of tropical geometry:

• We establish a novel bound on the approximation error between two neural networks with
ReLU activations and one hidden layer. To prove this we bound the difference of the
networks’ tropical polynomials via the Hausdorff distance of their respective zonotopes.
• We construct two geometrical neural network compression methods that are based on
zonotope reduction and employ K-means algorithm for clustering. Our algorithms apply on
the fully connected layers of ReLU activated neural networks.
• Our algorithms are analyzed both theoretically and experimentally. The theoretical eval-
uation is based on the theoretical bound of neural network approximation error. On the
experimental part, we examine the performance of our algorithms on retaining the accuracy
of convolutional neural networks when applying compression on their fully connected layers.

2 BACKGROUND ON T ROPICAL G EOMETRY


We study tropical geometry from the viewpoint of the max-plus semiring (Rmax , ∨, +) which is
defined as the set Rmax = R ∪ {−∞} equipped with two operations (∨, +). Operation ∨ stands for
max and + stands for sum. In max-plus algebra we define polynomials in the following way.

T
Tropical polynomials A tropical polynomial f in d variables x = (x1 , x2 , ..., xd ) is defined as
the function
f (x) = max{aTi x + bi } (1)
i∈[n]

where [n] = {1, ..., n}, ai are vectors in Rd and bi is the corresponding monomial coefficient in
Rmax = R ∪ {−∞}. The set of such polynomials constitutes the semiring Rmax [x] of tropical
polynomials. Note that each term aTi x + bi corresponds to a hyperplane in Rd . We thus call the
vectors {ai }i∈[n] the slopes of the tropical polynomial, and {bi }i∈[n] the respective biases. We allow
slopes to be vectors with real coefficients rather than integer ones, as it is normally the case for
polynomials in regular algebra. These polynomials are also referred to as signomials (Duffin &
Peterson, 1973) in the literature.

Polytopes Polytopes have been studied extensively (Ziegler, 2012; Grünbaum, 2013) and occur as
a geometric tool for fields such as linear programming and optimization. They also have an important
role in the analysis of neural networks. For instance, Zhang et al. (2018); Charisopoulos & Maragos
(2018) show that linear regions of neural networks correspond to vertices of polytopes. Thus, the
counting of linear regions reduces to a combinatorial geometry problem. In what follows, we explore
this connection of tropical geometry with polytopes.
Consider the tropical polynomial defined in (1). The Newton polytope associated to f (x) is defined
as the convex hull of the slopes of the polynomial
Newt (f ) := conv{ai : i ∈ [n]}
Furthermore, the extended Newton polytope of f (x) is defined as the convex hull of the slopes of the
polynomial extended in the last dimension by the corresponding bias coefficient.
ENewt (f ) := conv{(aTi , bi ) : i ∈ [n]}
The following proposition computes the extended Newton polytope that occurs when a tropical
operation is applied between two tropical polynomials. It will allow us to compute the polytope
representation corresponding to a neural network’s hidden layer.
Proposition 1. (Zhang et al., 2018; Charisopoulos & Maragos, 2018) Let f, g ∈ Rmax [x] be two
tropical polynomials . Then for the extended Newton polytopes it is true that
ENewt (f ∨ g) = conv{ENewt (f ) ∪ ENewt (g)}
ENewt (f + g) = ENewt (f ) ⊕ ENewt (g)

2
Published as a conference paper at ICLR 2022

(2,1,2)

U F (ENewt(f ∨g))
(2,1,1) (0,0,1) (0,0,1) (2,2,1)
(0,0,1)
(0,1,0)
(2,1,1)
(3,1,1)
(0,1,0)
(0,1,0)
(1,0,0)
(0,0,0) (0,0,0) (1,0,0)
(1,0,0)

ENewt (f ) ENewt (g) ENewt (f ∨ g) ENewt (f + g)

Figure 1: Illustration of tropical operations between polynomials. The polytope of the max (∨) of f
and g corresponds to the convex hull of the union of points of the two polytopes and the polytope of
sum (+) corresponds to their Minkowski sum.

Here ⊕ denotes Minkowski addition. In particular, for two sets A, B ⊆ Rd it is defined as


A ⊕ B := {a + b | a ∈ A, b ∈ B}
Corollary 1. This result can be extended to any finite set of polynomials using induction.
Example 1. Let f, g be two tropical polynomials in 2 variables, such that
f (x, y) = max(2x + y + 1, 0), g(x, y) = max(x, y, 1)
The tropical operations applied to these polynomials give
f ∨ g = max(2x + y + 1, 0, x, y, 1)
f + g = max(3x + y + 1, x, 2x + 2y + 1, y, 2x + y + 2, 1)
Fig. 1 illustrates the extended Newton polytopes of the original and the computed polynomials.
The extended Newton polytope provides a geometrical representation of a tropical polynomial. In
addition, it may be used to compute the values that the polynomial attains, as Proposition 2 indicates.
Proposition 2. (Charisopoulos & Maragos, 2018) Let f ∈ Rmax [x] be a tropical polynomial in d
variables. Let U F (ENewt (f )) be the points in the upper envelope of ENewt (f ), where upward
direction is taken with respect to the last dimension of Rd+1 . Then for each i ∈ [n] there exists a
linear region of f on which f (x) = aTi x + bi if and only if (aTi , bi ) is a vertex of U F (ENewt (f )).
Example 2. Using the polynomials from Example 1 we compute a reduced representation for f ∨ g.
f ∨ g = max(2x + y + 1, 0, x, y, 1) = max(2x + y + 1, x, y, 1)
Indeed, the significant terms correspond to the vertices of U F (ENewt (f ∨ g)) shown in Fig. 1.

2.1 T ROPICAL G EOMETRY OF N EURAL N ETWORKS

Tropical geometry has the capability of expressing the mathematical structure of ReLU activated
neural networks. We review some of the basic properties of neural networks and introduce notation
that will be used in our analysis. For this purpose, we consider the ReLU activated neural network of
Fig. 2 with one hidden layer.

Network tropical equations The network of Fig. 2 consists of an input layer x = (x1 , ..., xd ), a
hidden layer f = (f1 , ..., fn ) with ReLU activations, an output layer v = (v1 , ..., vm ) and two linear
layers defined by the matrices A, C respectively. As illustrated in Fig. 2 we have Ai,: = aTi , bi for
the first linear layer and Cj,: = (cj1 , cj2 , ..., cjn ) for the second linear layer, as we ignore its biases.
Furthermore, the output of the i−th component of the hidden layer f is computed as
d
!
X
fi (x) = max aik xk + bi , 0 = max(aTi x + bi , 0) (2)
k=1
We deduce that each fi is a tropical polynomial with two terms. It therefore follows that ENewt (fi )
is a linear segment in Rd+1 . The components of the output layer may be computed as
Xn X X
vj (x) = cji fi (x) = |cji |fi (x) − |cji |fi (x) = pj (x) − qj (x) (3)
i=1 cji >0 cji <0

3
Published as a conference paper at ICLR 2022

Tropical rational functions In Eq. (3), functions pj , qj x1 v1


f1
are both linear combinations of {fi } with positive coef-
ficients, which implies that they are tropical polynomials. b1
We conclude that every output node vi can be written as x2 v2
f2
a difference of two tropical polynomials, which is defined ai
as a tropical rational function. This indicates that the .. ..b cj
..

a i1
2
2 2

c j1
output layer of the neural network of Fig. 2 is equivalent
to a tropical rational mapping. In fact, this result holds xk
aik
fi
cji
vj
for deeper networks, in general, as demonstrated by the
a id jn
following theorem. .. ..b i
c
..
Theorem 1. (Zhang et al., 2018) A ReLU activated deep
neural network F : Rd → Rm is equivalent to a tropical xd fn vm
rational mapping.
bn
It is not known whether a tropical rational function r(x)
admits an efficient geometric representation that deter- Figure 2: Neural network with one hid-
mines its values {r(x)} for x ∈ Rd , as it holds for tropi- den ReLU layer. The first linear layer has
cal polynomials with their polytopes in Proposition 2. For weights {aTi } with bias {bi } correspond-
this reason, we choose to work separately on the polytopes ing to i−th node ∀i ∈ [n] and the second
of the tropical polynomials pj , qj . has weights {cji }, ∀j ∈ [m], i ∈ [n].

Zonotopes Zonotopes are defined as the Minkowski


sum of a finite set of line segments. They are a special
case of polytopes that occur as a building block for our
network. These geometrical structures provide a representation of the polynomials pj , qj in (3)
that further allows us to build our compression algorithms. We use the notation Pj , Qj for the
extended Newton polytopes of tropical polynomials pj , qj , respectively. Notice from (3) that for each
component vj of the output pj , qj are written as linear combinations of tropical polynomials that
correspond to linear segments. Thus Pj and Qj are zonotopes. We call Pj the positive zonotope,
corresponding to the positive polynomial pj and Qj the negative one.

Zonotope Generators Each neuron of the hidden layer represents geometrically a line segment
contributing to the positive or negative zonotope. We thus call these line segments generators of the
zonotope. The generators further receive the characterization positive or negative depending on the
zonotope they contribute to. It is intuitive to expect that a zonotope gets more complex as its number
of generators increases. In fact, each vertex of the zonotope can be computed as the sum of vertices of

the generators, where we choose a vertex from each generating line segment, either 0 or cji aTi , bi .
We summarize the above with the following extension of (Charisopoulos & Maragos, 2018).
Proposition 3. Pj , Qj are zonotopes in Rd+1 . For each P
vertex v of Pj there exists a subset of indices
I+ of {1, 2, ..., n} with cji > 0, ∀i ∈ I+ such that v = i∈I+ cji aTi , bi . Similarly, a vertex u of

Qj can be written as u = i∈I− cji aTi , bi where I− corresponds to cji < 0, ∀i ∈ I− .
P

3 A PPROXIMATION OF T ROPICAL P OLYNOMIALS

In this section we present our central theorem that bounds the error between the original and
approximate neural network, when both have the architecture of Fig. 2. To achieve this we need
to derive a bound for the error of approximating a simpler functional structure, namely the tropical
polynomials that represent the neural network. The motivation behind the geometrical bounding of
the error of the polynomials is Proposition 2. It indicates that a polynomial’s values are determined at
each point of the input space by the vertices of the upper envelope of its extended Newton polytope.
Therefore, it is expected that two tropical polynomials with approximately equal extended Newton
polytopes should attain similar values. In fact, this serves as the intuition for our theorem. The metric
we use to define the distance between extended Newton polytopes is the Hausdorff distance.

Hausdorff distance The distance of a point u ∈ Rd from the finite set V ⊂ Rd is denoted by either
dist (u, V) or dist (V, u) and computed as minv∈V ku − vk which is the Euclidean distance of u

4
Published as a conference paper at ICLR 2022

from its closest point v ∈ V. The Hausdorff distance H(V, U) of two finite point sets V, U ⊂ Rd is
defined as
 
H (V, U) = max max dist (v, U) , max dist (V, u)
v∈V u∈U

Let P, P̃ be two polytopes


 with their vertex sets denoted by VP , VP̃ respectively. We define the
Hausdorff distance H P, P̃ of the two polytopes as the Hausdorff distance of their respective vertex
sets VP , VP̃ . Namely,  
H P, P̃ := H (VP , VP̃ ) (4)
Clearly, the Hausdorff distance is a metric of how close two polytopes are to each other. Indeed, it
becomes zero when the two polytopes coincide. According to this metric, the following bound on the
error of tropical polynomial approximation is derived.
Proposition 4. Let p, p̃ ∈ Rmax [x] be two tropical polynomials and let P = ENewt (p) , P̃ =
ENewt (p̃). Then,  
max |p(x) − p̃(x)| ≤ ρ · H P, P̃
x∈B
d

where B = {x ∈ R : kxk ≤ r} is the hypersphere of radius r, and ρ = r2 + 1.

The above proposition enables us to handle the more general case of neural networks with one hidden
layer, that are equivalent with tropical rational mappings. By repeatedly applying Proposition 4 to
each tropical polynomial corresponding to the networks, we get the following bound.
Theorem 1. Let v, ṽ ∈ Rmax [x] be two neural networks with architecture as in Fig. 2. With P̃j , Q̃j
we denote the positive and negative zonotopes of ṽ. The following bound holds.
 
Xm    
max kv(x) − ṽ(x)k1 ≤ ρ ·  H Pj , P̃j + H Qj , Q̃j 
x∈B
j=1

Remark 1. The reason we choose to compute the error of the approximation on a bounded hyper-
sphere B is twofold. Firstly, the unbounded error of linear terms always diverges to infinity and,
secondly, in practice the working subspace of our dataset is usually bounded, e.g. images.

4 N EURAL N ETWORK C OMPRESSION A LGORITHMS


Compression problem formulation The tropical geometry theorem on the approximation of
neural networks enables us to derive compression algorithms for ReLU activated neural networks.
Suppose that we want to compress the neural network of Fig. 2 by reducing the number of neurons in
the hidden layer, from n to K. Let us assume that the output of the compressed network is the tropical
rational map ṽ = (ṽ1 , ..., ṽm ). Its j−th component may be written as ṽj (x) = p̃j (x) − q̃j (x) where
using Proposition 3 the zonotopes of p̃j , q̃j are generated by c̃ji (ãTi , b̃i ), ∀i. The generators need
to be chosen in such a way that ṽj (x) ≈ vj (x) for all  x ∈ B.
 Due  to Theorem  1 it suffices to find
generators such that the resulting zonotopes have H Pj , P̃j , H Qj , Q̃j as small as possible ∀j.
We thus formulated neural network compression as a geometrical zonotope approximation problem.

Our approaches Approximating a zonotope with fewer generators is a problem known as zonotope
order reduction (Kopetzki et al.,2017). In our case we approach this problem by manipulating the
zonotope generators cji aTi , bi , ∀i, j 1 . Each of the algorithms presented will create a subset of
altered generators that approximate the original zonotopes. Ideally, we require the approximation to
hold simultaneously for all positive and negative zonotopes of each output component vj . However,
this is not always possible, as in the case of multiclass neural networks, and it necessarily leads to
heuristic manipulation. Our first attempt to tackle this problem applies the K-means algorithm to the
1
Dealing with the full generated zonotope would lead to exponential computational overhead.

5
Published as a conference paper at ICLR 2022

f1
c2 (aT
2 , b2 )
b1
f2 c̃1 (ãT
1 , b̃1 )

b2 c2 c1 (aT

c1
1 , b1 ) c4 (aT
4 , b4 )
f3 c̃2 (ãT
2 , b̃2 ) f˜1
c3 c5 (aT
5 , b5 ) c3 (aT
3 , b3 )
x1 b3 c̃3 (ãT
3 , b̃3 ) c̃1 b̃1
c4 x1 +
f4 v f˜2 +
c5 c6 (aT
6 , b6 ) v
x2 b4 c̃2 b̃2 −
c6 x2
f˜3 −
f5

c7
c7 (aT
7 , b7 ) c̃4 (ãT T
4 , b̃4 ) ≡ c7 (a7 , b7 )
c̃3 b̃3
b5
f˜4
f6
{c̃j ãji }i∈(1,2), j∈(1,...,4) c̃4 b̃4
b6
{aji }i∈(1,2), j∈(1,...,7) f7

b7

(a) Original network. (b) Original zonotopes. (c) Resulting zonotopes. (d) Compressed network.

Figure 3: Illustration of Zonotope K-means execution. The original zonotope P is generated by


ci aTi , bi for i = 1, ..., 4 and the negative zonotope Q generated by 
the remaining
 ones i = 5, 6, 7.
The approximation P̃ of P is colored in purple and generated by c̃i ãTi , b̃i , i = 1, 2 where the
first generator is the K-means center representing the generators 1, 2 of P and the second is the
 of 3,4. Similarly, the approximation Q̃ of Q is colored in green and defined by
representative center
the generators c̃i ãTi , b̃i , i = 3, 4 that stand as representative centers for {5, 6} and 7 respectively.

positive and negative generators, separately. This method is restricted on applying to single output
neural networks. Our second approach further develops this technique to multiclass neural networks.
Specifically, it utilizes K-means on the vectors associated with the neural paths passing from a node
in the hidden layer, as we define later. The algorithms we present refer to the neural network of Fig.
2 with one hidden layer, but we may repeatedly apply them to compress deeper networks.

4.1 Z ONOTOPE A PPROXIMATION

Zonotope K-means The first compression approach uses K-means to compress each zonotope of
the network, and covers only the case of a single output neural network, e.g as in Fig. 2 but with
m = 1. The algorithm reduces the hidden layer size from n to K neurons. We use the notation
ci , i = 1, ..., n for weights of the second linear layer, connecting the hidden layer with the output
node. Algorithm 1 is presented below and a demonstration of its execution can be found in Fig. 3.

Algorithm 1: Zonotope K-means Compression


   
1. Split generators into positive ci aTi , bi : ci > 0 and negative ci aTi , bi : ci < 0 .
K
2. Apply
n  K-means  for 2 ocenters,
n separately  for botho sets of generators and receive
T T
c̃i ãi , b̃i : c̃i > 0 , c̃i ãi , b̃i : c̃i < 0 as output.
3. Construct the final weights. For the first linear layer,the weights
 and the bias which
T
correspond to the i−th neuron become the vector c̃i ãi , b̃i .
4. The weights
 of the  second linear layer are set to 1 for every hidden layer neuron
T
where c̃i ãi , b̃i occurs from positive generators and −1, elsewhere.

Proposition 5. Zonotope K-means produces a compressed neural network with output ṽ satisfying
 X n
1 1
|ci |k aTi , bi k

· max |v(x) − ṽ(x)| ≤ K · δmax + 1 −
ρ x∈B Nmax i=1
where K is the total number of centers used in both K-means, δmax is the largest distance from a point
to its corresponding cluster center and Nmax is the maximum cardinality of a cluster.

The above proposition provides an upper bound between the original neural network and the one that
is approximated with Zonotope K-means. In particular, if we use K = n centers the bound of the

6
Published as a conference paper at ICLR 2022

approximation error becomes 0, because then δmax = 0 and Nmax = 1. Also, if K ≈ 0 the bound gets
a fixed value depending on the magnitude of the weights of the linear layers.

4.2 M ULTIPLE Z ONOTOPE A PPROXIMATION

The exact positive and negative zonotope approximation performed by Zonotope K-means algorithm
has a main disadvantage: it can only be used in single output neural networks. Indeed, suppose
that we want to employ the preceeding algorithm to approach the zonotopes of each output in a
multiclass neural network. That would require 2m separate executions of K-means which are not
necessarily consistent. For instance, it is possible to have cj1 i > 0 and cj2 i < 0 for some output
components vj1 , vj2 . That means that in the compression procedure of vj1 , the i−th neuron belongs
to the positive generators set, while for vj2 , it belongs to the negative one. This makes the two
compressions incompatible. Moreover, the drawback of restricting to single output only allow us to
compress the final ReLU layer and not any preceeding ones.

Neural Path K-means To overcome this obstacle we apply a simultaneous approximation of the
zonotopes. The method is called Neural
 Path K-means and directly applies K-means to the vectors of
the weights aTi , bi , c1i , ..., cmi associated to each neuron i of the hidden layer. The name of the
algorithm emanates from the fact that the vector associated to each neuron consists of the weights of
all the neural network paths passing from this neuron. The procedure is presented in Algorithm 2.

Algorithm 2: Neural Path K-means Compression



1. Apply K-means for K centers to the vectors aTi , bi , C:,i
T
, i = 1, ..., n, and get
 
T T
the centers ãi , b̃i , C̃:,i , i = 1, ..., K.
2. 
Construct  the final weights. For the first linear layer matrix the i − th row becomes
ãTi , b̃i , while for the second linear layer matrix, the i − th column becomes C̃:,i .

Null Generators Neural Path K-means does not apply compression directly to each zonotope of
the network, but is rather a heuristic approach for this task. More precisely, if we focus on the set
of generators of the zonotopes of output j, Neural Path K-means  might mix positive and negative
generators together in the same cluster. For instance, suppose ãTk , b̃k , C̃:,k
T
is the cluster center
T T

corresponding to vectors ai , bi , C:,i for i ∈ I. Then regarding output j, it is not necessary
that ∀i ∈ I all cji have the same sign. Thus, the compressed positive zonotope P̃j might contain
generators of the original negative zonotope Qj and vice versa. We will call generators cji aTi , bi
contributing to opposite zonotopes, null generators.
Proposition 6. Neural Path K-means produces a compressed neural network with output ṽ satisfying

n
√ √
 X
1 2 1
kC:,i k aTi , bi +

· max kv(x) − ṽ(x)k1 ≤ mKδmax + m 1−
ρ x∈B Nmax i=1
√ n m X
mδmax X  X
aTi , bi + kC:,i k + |cji | aTi , bi
 
Nmin i=1 j=1 i∈Nj
where K is the number of K-means clusters, δmax the maximum distance from any point to its
corresponding cluster center, Nmax , Nmin the maximum and minimum cardinality respectively of a
cluster and Nj the set of null generators with respect to output j.

The performance of Neural Path K-means is evaluated with Proposition 6. The result we deduce
is analogous to Zonotope K-means. The bound of the approximation error becomes zero when K
approaches n. Indeed, for K = n we get δmax = 0, Nmax = 1 and Nj = ∅, ∀j ∈ [m]. For lower
values of K, the upper bound reaches a value depending on the magnitude of the weights of the linear
layers together with weights corresponding to null generators.

7
Published as a conference paper at ICLR 2022

Table 1: Reporting accuracy of compressed networks for single output compression methods.

Percentage of MNIST 3/5 MNIST 4/9


remaining (Smyrnis et al., Zonotope Neural Path (Smyrnis et al., Zonotope Neural Path
neurons 2020) K-means K-means 2020) K-means K-means
100% (Original) 99.18 ± 0.27 99.38 ± 0.09 99.38 ± 0.09 99.53 ± 0.09 99.53 ± 0.09 99.53 ± 0.09
5% 99.12 ± 0.37 99.42 ± 0.07 99.25 ± 0.04 98.99 ± 0.09 99.52 ± 0.09 99.48 ± 0.15
1% 99.11 ± 0.36 99.39 ± 0.05 99.32 ± 0.03 99.01 ± 0.09 99.46 ± 0.05 99.35 ± 0.17
0.5% 99.18 ± 0.36 99.41 ± 0.05 99.22 ± 0.11 98.81 ± 0.09 99.35 ± 0.24 98.84 ± 1.18
0.3% 99.18 ± 0.36 99.25 ± 0.37 99.19 ± 0.41 98.81 ± 0.09 98.22 ± 1.38 98.22 ± 1.33

Table 2: Reporting theoretical upper bounds of Propositions 5, 6.

Percentage of MNIST 3/5 MNIST 4/9


remaining neurons Zonotope K-means Neural Path K-means Zonotope K-means Neural Path K-means
100% 0.00 0.00 0.00 0.00
10% 17.07 246.74 18.85 229.37
2.5% 15.35 59.42 17.02 63.37
1% 14.79 42.22 16.44 45.58
0.5% 14.57 36.47 16.20 39.71

5 E XPERIMENTS

We conduct experiments on compressing the linear layers of convolutional neural networks. Our
experiments serve as proof-of-concept and indicate that our theoretical claims indeed hold in practice.
The heart of our contribution lies in presenting novel tropical geometrical background for neural
network approximation that will shed light for further research towards tropical mathematics.
Our methods compress the linear layers of the network layer by layer. They perform a functional
approximation of the original network and thus they are applicable for both classification and
regression tasks. To compare them with other techniques in the literature we choose methods with
similar structure, i.e. structured pruning techniques without re-training. For example, Alfarra et al.
(2020) proposed a compression algorithm based on the sparsification of the matrices representing the
zonotopes which served as an intuition for part of our work. However, their method is unstructured
and incompatible for comparison. The methods we choose to compare are two tropical methods for
single-output (Smyrnis et al., 2020) and multi-output (Smyrnis & Maragos, 2020) networks, Random
and L1 Structured, and a modification of ThiNet (Luo et al., 2017) adapted to linear layers. Smyrnis
et al. (2020); Smyrnis & Maragos (2020) proposed a novel tropical division framework that aimed on
the reduction of zonotope vertices. Random method prunes neurons according to uniform probability,
while L1 prunes those with the lowest value of L1 norm of their weights. Also, ThiNet uses a greedy
criterion for discarding the neurons that have the smallest contribution to the output of the network.

MNIST Dataset, Pairs 3-5 and 4-9 The first experiment is performed on the binary classification
tasks of pairs 3/5 and 4/9 of the MNIST dataset and so we can utilize both of our proposed methods.
In Table 1, we compare our methods with a tropical geometrical approach of Smyrnis et al. (2020).
Their method is based on a tropical division framework for network minimization. For fair comparison,
we use the same CNN with two fully connected layers and hidden layer of size 1000. According to
Table 1, our techniques seem to have similar performance. They retain the accuracy of the network
while reducing its size. Moreover, in Table 2 we include experimental computation of the theoretical
bounds provided by Proposition 5, 6. We notice that the bounds decrease as the remaining weights
get less. The behaviour of the bounds was expected to be incremental because the less weights we
use, the compression gets worse and the error becomes larger. However, the opposite holds which
means that the bounds are tighter for higher pruning rates. It is also important to mention that the
bounds become 0 when we keep all the weights, as expected.

MNIST and Fashion-MNIST Datasets For the second experiment we employ MNIST and
Fashion-MNIST datasets. The corresponding classification is multiclass and thus Neural Path
K-means may only be applied. In Table 3, we compare it with the multiclass tropical method of
Smyrnis & Maragos (2020) using the same CNN architecture they do. Furthermore, in plots 4a,
4b we compare Neural Path K-means with ThiNet and baseline pruning methods by compressing

8
Published as a conference paper at ICLR 2022

Table 3: Reporting accuracy of compressed networks for multiclass compression methods.

Percentage of MNIST Fashion-MNIST


remaining neurons (Smyrnis & Maragos, 2020) Neural Path K-means (Smyrnis & Maragos, 2020) Neural Path K-means
100% (Original) 98.60 ± 0.03 98.61 ± 0.11 88.66 ± 0.54 89.52 ± 0.19
50% 96.39 ± 1.18 98.13 ± 0.28 83.30 ± 2.80 88.22 ± 0.32
25% 95.15 ± 2.36 98.42 ± 0.42 82.22 ± 2.85 86.67 ± 1.12
10% 93.48 ± 2.57 96.89 ± 0.55 80.43 ± 3.27 86.04 ± 0.94
5% 92.93 ± 2.59 96.31 ± 1.29 − 83.68 ± 1.06

LeNet5 (LeCun et al., 1998). To get a better idea of how our method performs in deeper architectures
we provide plots 4c,4d that illustrate the performance of compressing a deep neural network with
layers of size 28 ∗ 28, 512, 256, 128 and 10, which we refer to as deepNN. The compression is
executed on all hidden layers beginning from the input and heading to the output. From Table 3, we
deduce that our method performs better than (Smyrnis & Maragos, 2020). Also, it achieves higher
accuracy scores and experience lower variance as shown in plots 4a-4d. Neural Path K-means, overall,
seems to have good performance, even competitive to ThiNet. Its worst performance occurs on low
percentages of remaining weights. An explanation for this is that K-means provides a high-quality
compression as long as the number of centers is not less than the number of "real" clusters.

(a) LeNet5, MNIST (b) LeNet5, F-MNIST (c) deepNN, MNIST (d) deepNN, F-MNIST

(h) CIFAR-VGG, CI-


(e) AlexNet, CIFAR10 (f) CIFAR-VGG, CIFAR10 (g) AlexNet, CIFAR100
FAR100

Figure 4: Neural Path K-means compared with baseline pruning methods and ThiNet. Horizontal
axis shows the ratio of remaining neurons in each hidden layer of the fully connected part.

CIFAR Dataset We conduct our final experiment on CIFAR datasets using CIFAR-VGG (Blalock
et al., 2020) and an altered version of AlexNet adapted for CIFAR. The resulting plots are shown in
Fig. 4e-4h. We deduce that Neural Path K-means retains a good performance on larger datasets. In
particular, in most cases it has slightly better accuracy an lower deviation than the baselines, but has
worse behaviour when keeping almost zero weights.

6 C ONCLUSIONS AND F UTURE W ORK

We presented a novel theorem on the bounding of the approximation error between two neural
networks. This theorem occurs from the bounding of the tropical polynomials representing the neural
networks via the Hausdorff distance of their extended Newton polytopes. We derived geometrical
compression algorithms for the fully connected parts of ReLU activated deep neural networks, while
application to convolutional layers is an ongoing work. Our algorithms seem to perform well in
practice and motivate further research towards the direction revealed by tropical geometry.

9
Published as a conference paper at ICLR 2022

R EFERENCES
Motasem Alfarra, Adel Bibi, Hasan Hammoud, Mohamed Gaafar, and Bernard Ghanem. On the
decision boundaries of deep neural networks: A tropical geometry perspective. arXiv preprint
arXiv:2002.08838, 2020.
Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of
neural network pruning? arXiv preprint arXiv:2003.03033, 2020.
Peter Butkovič. Max-linear systems: theory and algorithms. Springer monographs in mathematics.
Springer, 2010. ISBN 978-1-84996-298-8.
Vasileios Charisopoulos and Petros Maragos. Morphological perceptrons: geometry and training
algorithms. In International Symposium on Mathematical Morphology and Its Applications to
Signal and Image Processing, pp. 3–15. Springer, 2017.
Vasileios Charisopoulos and Petros Maragos. A tropical approach to neural networks with piecewise
linear activations. arXiv preprint arXiv:1805.08749, 2018.
Raymond A Cuninghame-Green. Minimax algebra, volume 166. Springer Science & Business
Media, 2012.
Nikolaos Dimitriadis and Petros Maragos. Advances in morphological neural networks: Training,
pruning and enforcing shape constraints. In Proc. 46th IEEE Int’l Conf. Acoustics, Speech and
Signal Processing (ICASSP-2021), Toronto, June 2021.
Richard James Duffin and Elmor L Peterson. Geometric programming with signomials. Journal of
Optimization Theory and Applications, 11(1):3–35, 1973.
Branko Grünbaum. Convex polytopes, volume 221. Springer Science & Business Media, 2013.
Anna-Kathrin Kopetzki, Bastian Schürmann, and Matthias Althoff. Methods for order reduction of
zonotopes. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 5626–5633.
IEEE, 2017.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A Filter Level Pruning Method for Deep Neural
Network Compression. In 2017 IEEE International Conference on Computer Vision (ICCV), pp.
5068–5076, 2017. doi: 10.1109/ICCV.2017.541.
Diane Maclagan and Bernd Sturmfels. Introduction to tropical geometry, volume 161. American
Mathematical Soc., 2015.
Petros Maragos. Dynamical systems on weighted lattices: general theory. Mathematics of Control,
Signals, and Systems, 29(4):1–49, 2017.
Petros Maragos and Emmanouil Theodosis. Multivariate tropical regression and piecewise-linear
surface fitting. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 3822–3826. IEEE, 2020.
Petros Maragos, Vasileios Charisopoulos, and Emmanouil Theodosis. Tropical geometry and machine
learning. Proceedings of the IEEE, 109(5):728–755, 2021. doi: 10.1109/JPROC.2021.3065238.
Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear
regions of deep neural networks. arXiv preprint arXiv:1402.1869, 2014.
Georgios Smyrnis and Petros Maragos. Multiclass neural network minimization via tropical newton
polytope approximation. In Proc. Int’l Conf. on Machine Learning, PMLR, 2020.
Georgios Smyrnis, Petros Maragos, and George Retsinas. Maxpolynomial division with application to
neural network simplification. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 4192–4196. IEEE, 2020.

10
Published as a conference paper at ICLR 2022

Emmanouil Theodosis and Petros Maragos. Analysis of the viterbi algorithm using tropical algebra
and geometry. In 2018 IEEE 19th International Workshop on Signal Processing Advances in
Wireless Communications (SPAWC), pp. 1–5. IEEE, 2018.
Emmanouil Theodosis and Petros Maragos. Tropical modeling of weighted transducer algorithms on
graphs. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 8653–8657, 2019. doi: 10.1109/ICASSP.2019.8683127.
Nikos Tsilivis, Anastasios Tsiamis, and Petros Maragos. Sparsity in max-plus algebra and applications
in multivariate convex regression. In Proc. 46th IEEE Int’l Conf. Acoustics, Speech and Signal
Processing (ICASSP-2021), Toronto, June 2021.
Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. Tropical geometry of deep neural networks. In
International Conference on Machine Learning, pp. 5824–5832. PMLR, 2018.
Günter M Ziegler. Lectures on polytopes, volume 152. Springer Science & Business Media, 2012.

11
Published as a conference paper at ICLR 2022

A P ROOFS FOR THE SECTION "BACKGROUND ON T ROPICAL G EOMETRY "


A.1 P ROOF OF P ROPOSITION 3

Proof. The first argument follows from the fact that both pj , qj are linear combinations of tropical
polynomials consisting of two terms. Indeed, we compute
X X Prop. 1
pj (x) = cji max(aTi x + bi , 0) = max(cji aTi x + cji bi , 0) ====⇒
cji >0 cji >0
M
max(cji aTi x

Pj = ENewt + cji bi , 0)
cji >0

T
 T

Each ENewt  max(c ji ai x + cji bi , 0) is a line segment with endpoints 0 and cji ai , c ji b i =
T
cji ai , bi . Therefore Pj is written as the Minkowski sum of line segments, which is a zonotope by
definition. Similarly Qj is a zonotope.
Furthermore, from the definition of the Minkowski sum, each point v ∈ Pj may be written as
T
P
cji >0 vi , where each vi is a point in the segment ENewt max(cji ai x + cji b , 0) . A vertex of
 i
T
Pj can only occur if vi is an extreme point of ENewt
 max(c a
ji i x + c b
ji i , 0) for every i which is
equivalent to either vi = 0 or vi = cji aTi , bi . This means that every vertex of P  j corresponds to a
T
subset I+ ⊆ [n] of indices i with cji > 0, for which we choose vi = cji ai , bi and for the rest it
holds vi = 0. Thus, X
cji aTi , bi

v=
i∈I+
In the same way we derive the analogous result for the negative zonotope Qj .
Corollary 2. The geometric result concerning the structure of zonotopes can be extended to max-
pooling layers. For instance, a max-pooling layer of size 2 × 2 corresponds to a polytope that is
constructed as the Minkowski sum of pyramids which could stand as generalized case of zonotope.

B P ROOFS FOR THE SECTION "A PPROXIMATION OF T ROPICAL


P OLYNOMIALS "
B.1 P ROOF OF P ROPOSITION 4

Proof. Consider a point x ∈ B and assume that p(x) = aT x + b, p̃(x) = cT x + d. Then,


 
 x
p(x) − p̃(x) = p(x) − max {ãT x + b̃} ≤ aT x + b − uT , v
(ãT ,b̃)∈Vp̃ 1

where uT , v may be any vertex of P̃ . Similarly, we derive
 
 x
rT , s − cT x + d ≤ p(x) − p̃(x)

1
  
for any vertex rT , s of P . Therefore, we may select  the vertices uT , v ∈ P̃ , rT , s ∈ P so that
their respective distances from aT , b and cT , d , respectively, are minimized. Choosing them in
such a way gives
   
 x  x
p(x) − p̃(x) ≤ aT x + b − uT , v = aT , b − uT , v


1 1
  (5)
x
  p
≤ aT , b − uT , v ≤ d aT , b , P̃
 
r2 + 1
1
In similar manner we deduce    
 x
T T T
 T
 x
p(x) − p̃(x) ≥ r , s −c x+d= r , s − c , d ≥
1 1
  (6)
x  p
≥ − rT , s − cT , d ≥ −d P, cT , d
 
r2 + 1
1

12
Published as a conference paper at ICLR 2022

Notice, that for the relations (5) and (6) we used Cauchy-Schwartz inequality
|hx, yi| ≤ kxkkyk ⇔ −kxkkyk ≤ hx, yi ≤ kxkkyk

Inequality (5) holds at any point x ∈ B for some vertex aT , b ∈ P , therefore

p(x) − p̃(x) ≤ ρ · max d aT , b , VP̃


 
(7)
(aT ,b)∈VP

for all x ∈ B. Similarly, we derive

min −ρ · d VP , cT , d ρ · d VP , cT , d
 
p(x) − p̃(x) ≥ =− max (8)
(c,d)∈VP̃ (cT ,d)∈VP̃

Combining (7) and (8) gives

ρ · d VP , cT , d ≤ p(x) − p̃(x) ≤ max ρ · d aT , b , VP̃ ⇔


  
− max
(cT ,d)∈V

T
(a ,b)∈VP
 
max ρ · d aT , b , VP̃ , max ρ · d VP , cT , d
  
|p(x) − p̃(x)| ≤ ρ · max
(aT ,b)∈VP (cT ,d)∈VP̃

Hence, from the definition of the Hausdorff distance of two polytopes we derive the desired upper
bound
 
|p(x) − p̃(x)| ≤ ρ · H P, P̃ , ∀x ∈ B ⇒
 
max |p(x) − p̃(x)| ≤ ρ · H P, P̃
x∈B

Remark 2. Note that with similar proof one may replace the Hausdorff distance of the two polytopes
by the Hausdorff distance of their upper envelopes. This makes our theorem an exact generalization
of Proposition 2. However, this format is difficult to use in practice, because it is computationally
harder to determine the vertices of the upper envelope.

B.2 P ROOF OF T HEOREM 1

Proof. Notice that we may write

m
X m
X
kv(x) − ṽ(x)k1 = |vj (x) − ṽj (x)| = |(pj (x) − qj (x)) − (p̃j (x) − q̃j (x))|
j=1 j=1
Xm m
X
= |(pj (x) − p̃j (x)) − (qj (x) − q̃j (x))| ≤ |pj (x) − p̃j (x)| + |qj (x) − q̃j (x)|
j=1 j=1

Thus from from Proposition 4 we derive


 
m
X    
max kv(x) − ṽ(x)k1 ≤ ρ ·  H Pj , P̃j + H Qj , Q̃j 
x∈B
j=1

C P ROOFS AND F IGURES FOR THE SECTION "N EURAL N ETWORK


C OMPRESSION A LGORITHMS "

C.1 I LLUSTRATION OF Z ONOTOPE K- MEANS

Below we present a larger version of Fig. 5 demonstrating the execution of Zonotope K-means .

13
Published as a conference paper at ICLR 2022

f1

b1
f2

b2 c2

c1
f3 c3 f˜1

x1 b3 c̃1 b̃1 +
c4 x1
f4 v f˜2 +
c5 − v
x2 b4 c̃2 b̃2
c6 x2
f˜3 −
f5

c7
c̃3 b̃3
b5
f˜4
f6
{c̃j ãji }i∈(1,2), j∈(1,...,4) c̃4 b̃4
b6
{aji }i∈(1,2), j∈(1,...,7) f7

b7

(a) Original network. (b) Minimized network.

c2 (aT
2 , b2 )

c̃1 (ãT
1 , b̃1 )

c1 (aT
1 , b1 ) c4 (aT
4 , b4 )
c̃2 (ãT
2 , b̃2 )
c5 (aT
5 , b5 ) c3 (aT
3 , b3 )
c̃3 (ãT
3 , b̃3 )

c6 (aT
6 , b6 )

c7 (aT
7 , b7 ) c̃4 (ãT T
4 , b̃4 ) ≡ c7 (a7 , b7 )

(c) Original zonotopes (d) Approximating zonotopes.

Figure 5: Illustration of Zonotope K-means execution. The original zonotope P is generated by


ci aTi , bi for i = 1, ..., 4 and the negative zonotope Q generated by 
the remaining
 ones i = 5, 6, 7.
The approximation P̃ of P is colored in purple and generated by c̃i ãTi , b̃i , i = 1, 2 where the
first generator is the K-means center representing the generators 1, 2 of P and the second is the
 of 3,4. Similarly, the approximation Q̃ of Q is colored in green and defined by
representative center
the generators c̃i ãTi , b̃i , i = 3, 4 that stand as representative centers for {5, 6} and 7 respectively.

C.2 P ROOF OF P ROPOSITION 5

Proof. We remind that for the output functions it holds


v(x) = p(x) − q(x) , ṽ(x) = p̃(x) − q̃(x)
From triangle inequality we deduce
|v(x) − ṽ(x)| = |p(x) − q(x) − (p̃(x) − q̃(x))| < |p(x) − p̃(x)| + |q(x) − q̃(x)|
   
Prop 4 bounds |p(x) − p̃(x)| and |q(x) − q̃(x)| are bounded by H P, P̃ and H Q, Q̃ respec-
tively. Therefore, it suffices to get an upper bound for these Hausdorff distances. Let us consider any

14
Published as a conference paper at ICLR 2022


aTi , bi of P . For the vertex u ∈ P we need to choose vertex v ∈ P̃ as close
P
vertex u = i∈I+ ci  
to u as possible, in order to provide an upper bound for dist u, P̃ . Vertex v is selected as follows.
 
For each i ∈ I+ we select k such that c̃k ãTk b̃k is the center of the cluster where ci aTi , bi
belongs to. We denote the set of such clusters by C+ , where each cluster center k appears only once.
Then, vertex v is constructed as v = k∈C+ c̃k ãTk b̃k ∈ P̃ . We have that:
P

  X X  
ci aTi , bi − c̃k ãTk , b̃k

dist u, P̃ ≤
i∈I+ k∈C+

X X  
ci aTi , bi − c̃k ãTk , b̃k


k∈C+ i∈Ik+
 
X X c̃k ãTk , b̃k
ci aTi , bi −


|Ik+ |
k∈C+ i∈Ik+

X X ci aTi , bi + εi
ci aTi ,

= bi −
|Ik+ |
k∈C+ i∈Ik+
  
X X 1 T
 kεi k
≤ 1− |ci | ai , bi +
|Ik+ | |Ik+ |
k∈C+ i∈Ik+
 X
1
|ci | aTi , bi

≤ |C+ | · δmax + 1 −
Nmax
i∈I+

where
 we denote
 by Ik+ the set of indices i ∈ I+ that belong to the center k ∈ C+ and εi =

c̃k ãk , b̃k − ci aTi , bi is the vector of the difference of the i−th generator, with its corresponding
T

K-means cluster center.


The maximum value of the upper bound occurs when I+ contains all indices that correspond
to ci > 0. This value gives us an upper boundPfor maxu∈P d(u,P̃ ). To compute an up-
per bound for maxv∈VP̃ d(P, v) we assume v = T
 k∈C+ c̃k ãk b̃k and consider the vertex
T
P
i∈I+ ci ai , bi ∈ P where I+ is the set of indices of positive generators corresponding to the
union of all clusters corresponding to the centers of C+ . Note that the occurring distance
X X  
ci aTi , bi − c̃k ãTi , b̃i


i∈I+ k∈C+

was taken into account when computing the upper bound for maxu∈VP d(u, P̃ ), and thus both values
obtain the same upper bound. Therefore,
 X
  1
|ci | aTi , bi

H P, P̃ ≤ K+ · δmax + 1 −
Nmax
i∈I+

where K+ is the number of cluster centers corresponding to P̃ and I+ the indices corresponding to
all positive generators of P . Similarly,
 X
  1
|ci | aTi , bi

H Q, Q̃ ≤ K− · δmax + 1 −
Nmax
i∈I−

where K− , I− are defined in analogous way for the negative zonotope. Combining the relations gives
the desired bound.
 X n
1     1
|ci | aTi , bi

· |v(x) − ṽ(x)| ≤ H P, P̃ + H Q, Q̃ ≤ K · δmax + 1 −
ρ Nmax i=1

15
Published as a conference paper at ICLR 2022

C.3 I LLUSTRATION FOR N EURAL PATH K- MEANS ALGORITHM

Below we illustrate the vectors on which K-means is applied for the multi-output case. The vectors
that are compressed are consist of all the edges associated to a hidden layer neuron. The corresponding
edges contain all the possible neural paths that begin from some node of the input, end in some node
of the output and pass through this hidden node.

x1 f1 v1

b1

x2 f2 v2

i
ai

c1
.. .. b ..

a i1
2
2 c 2i

aik cji
xk fi vj
cm
a id
.. .. b i
i
..
xd fn vm

bn

Figure 6: Neural Path K-means for multi-output neural network compression. In green color we
highlight the weights corresponding to the i−th vector used by Neural Path K-means.

In the main text we defined the null generators of zonotopes that occur by the execution of Neural
Path K-means. Below we provide an illustration for them.

Null Generators

Figure 7: Visualization of K-means in Rd+1+n , where d is the input dimension and n the hidden
layer size. We color points according to the j−th output component of the network. Black and white
points correspond to generators of Pj and Qj respectively. White vertices in positive (brown) clusters
and black vertices in negative (blue) clusters are null generators regarding j−th output.

C.4 P ROOF OF P ROPOSITION 6

Proof. Let us firstfocus on


 a single  say j−th output. As in the proof for Zonotope K-means,
 output,
we will bound H Pj , P̃j , H Qj , Q̃j for all j ∈ [m]. From triangle inequality we get

|vj (x) − ṽj (x)| ≤ |pj (x) − p̃j (x)| + |qj (x) − q̃j (x)|

Any vertex of u ∈ Pj can be written as u = i∈Ij+ cji aTi , bi where the set of indices Ij+ satisfy
P

cji > 0, ∀i ∈ Ij+ and thus correspond to positive generators. To choose a nearby vertex from P̃j
T
we perform the following. For each i ∈ Ij+ we select the center ãk b̃k C̃ (k)T of the cluster to
 
where aTi bi C (i)T belongs, only if c̃jk > 0. Such a center only exists if cji aTi , bi is not a

16
Published as a conference paper at ICLR 2022

null generator. Else, we choose as representation


P   vector 0. Each cluster center, or 0, is taken into
the
account once and the vertex k∈Cj+ c̃jk ãk , b̃k ∈ P̃j is formed. We derive:

  X X  
cji aTi , bi − c̃jk ãTk , b̃k

max dist u, P̃j ≤
u∈VPj
i∈Ij+ k∈Cj+

X X   X
cji aTi , bi − c̃jk ãTk , b̃k + aTi , bi
 
≤ |cji |
k∈Cj+ i∈Ijk+ i∈Nj+
 
T
X X  c̃jk ãk , b̃k X
cji aTi , bi − |cji | aTi , bi

≤ +
|Ijk+ |
k∈Cj+ i∈Ijk+ i∈Nj+
 T  
X X
T
 (cji + εji ) ai , bi + λi X
aTi , bi

≤ cji ai , bi − + |cji |
|Ijk+ |
k∈C i∈Ijk+ i∈Nj+
X X  |εji |kλi k  1
 
T

≤ + 1− |cji | ai , bi +
|Ijk+ | |Ijk+ |
k∈Cj+ i∈Ijk+
"  #
X X |εji | aTi , bi + |cji | kλi k X
|cji | aTi , bi

+ +
|Ijk+ |
k∈Cj+ i∈Ijk+ i∈Nj+

where for all i ∈ Ijk+ the i−th vector of K-means is represented


  by the k−th center k ∈ Cj+ . We
(i) (i) T T

also assumed C̃ = C:,i + ε ⇒ c̃ji = cji + εji and ãi , b̃i = ai , bi + λi .

The maximum value of the upper bound occurs when Ij+ contains all indices that correspond
to cji > 0. To compute an upper bound for maxv∈VP̃ dist (Pj , v) we write the vertex v as
j
P T
 P T

v = k∈Cj+ c̃jk ãk b̃k ∈ P̃j and choose the vertex u = i∈Ij+ cji ai , bi of Pj where
Ij+ is the set of all indices corresponding to generators that belong to these clusters. As in the
proof of Proposition
 5,their distance was taken into account when computing the upper bound for
maxu∈VPj dist u, P̃j . Hence, both obtain the same upper bound which leads to

  X X  |εji |kλi k  1
 
|cji | aTi , bi

H Pj , P̃j ≤ + 1− +
|Ijk+ | |Ijk+ |
k∈Cj+ i∈Ijk+
"  #
X X |εji | aTi , bi + |cji | kλi k X
|cji | aTi , bi

+ +
|Ijk+ |
k∈Cj+ i∈Ijk+ i∈Nj+

where Ij+ contains all indices corresponding to positive cji . Similarly, we deduce
  X X  |εji |kλi k  1
 
T

H Qj , Q̃j ≤ + 1− |cji | ai , bi +
|Ijk− | |Ijk− |
k∈Cj− i∈Ijk−
"  #
X X |εji | aTi , bi |cji | kλi k X
|cji | aTi , bi

+ +
|Ijk− |
k∈Cj− i∈Ijk− i∈Nj−

where Ij− contains all i such that cji < 0. In total these bounds together with Proposition 4 give

17
Published as a conference paper at ICLR 2022

1 X X  |εji |kλi k  1
 
T

· max |vj (x) − ṽj (x)| ≤ + 1− |cji | ai , bi +
ρ x∈B |Ijk | |Ijk |
k∈Cj i∈Ijk
"  #
X X |εji | aTi , bi + |cji | kλi k X
|cji | aTi , bi

+ +
|Ijk |
k∈Cj i∈Ijk i∈Nj

Here we used the notation Cj = Cj+ ∪ Cj− = {1, 2, ..., K} and Ijk is either equal to Ijk+ or Ijk−
depending on k ∈ Cj . Note that {i|i ∈ Ijk , k ∈ Cj } = {1, 2, ..., n} \ Nj ⊆ {1, 2, ..., n}, since
every generator that is not null corresponds to some cluster center with the same sign. Also, using
Nmax ≥ |Ijk | ≥ Nmin , it follows that
n    
1 X |εji |kλi k 1
|cji | aTi , bi

· max |vj (x) − ṽj (x)| ≤ + 1− +
ρ x∈B i=1
Nmin Nmax
n
"  #
X |εji | aTi , bi + |cji | kλi k X
|cji | aTi , bi

+ +
i=1
N min
i∈Nj

We will compute the total cost that combines all outputs by applying the inequality
 2  
m m m
X X
2
X √
 |uj | ≤m  |uj | ⇔ |uj | ≤ m k(u1 , ..., um )k
j=1 j=1 j=1

which is a direct application of Cauchy-Schwartz inequality. Together with the relations


kε(i) k < δmax , kλi k < δmax , we get
m m X n    
1 X X |εji |kλi k 1
|cji | aTi , bi

· max |vj (x) − ṽj (x)| ≤ + 1− +
ρ j=1 x∈B j=1 i=1
Nmin Nmax
m X n
"  # m X
X |εji | aTi , bi + |cji | kλi k X
|cji | aTi , bi

+ +
j=1 i=1
Nmin j=1 i∈N j

n
"√ #
m ε(i) kλi k √
 
X 1 T

≤ + 1− m kC:,i k ai , bi +
i=1
Nmin Nmax
n
"√  √ # m X
X m ε(i) aTi , bi + m kC:,i k kλi k X
|cji | aTi , bi

+ +
i=1
Nmin j=1 i∈Nj
n
√ √
 X
2 1
kC:,i k aTi , bi +

≤ mKδmax + m 1−
Nmax i=1
√ n m X
mδmax X  X
aTi , bi + kC:,i k + |cji | aTi , bi
 
Nmin i=1 j=1 i∈Nj
as desired.

18

You might also like