4300_neural_network_approximation_b
4300_neural_network_approximation_b
A BSTRACT
1 I NTRODUCTION
Tropical geometry (Maclagan & Sturmfels, 2015) is a mathematical field based on algebraic geometry
and strongly linked to polyhedral and combinatorial geometry. It is built upon the tropical semiring
which originally refers to the min-plus semiring (Rmin , ∧, +), but may also refer to the max-plus
semiring (Cuninghame-Green, 2012; Butkovič, 2010). In our work, we follow the convention of the
max-plus semiring (Rmax , ∨, +) which replaces the classical operations of addition and multiplication
by max and sum respectively. These operations turn polynomials into piecewise linear functions
making them directly applicable in neural networks.
Tropical mathematics cover a wide range of applications including dynamical systems on weighted
lattices (Maragos, 2017), finite state transducers (Theodosis & Maragos, 2018; 2019) and convex
regression (Maragos & Theodosis, 2020; Tsilivis et al., 2021). Recently, there has been remarkable
theoretical impact of tropical geometry in the study of neural networks and machine learning (Maragos
et al., 2021). Zhang et al. (2018) prove the equivalence of ReLU activated neural networks with
tropical rational mappings. Furthermore, they use zonotopes to compute a bound on the number
of the network’s linear regions, which has already been known in (Montúfar et al., 2014). In a
similar context, Charisopoulos & Maragos (2018) compute an upper bound to the number of linear
regions of convolutional and maxout layers and propose a randomized algorithm for linear region
counting. Other works employ tropical geometry to examine the training and further properties of
morphological perceptron (Charisopoulos & Maragos, 2017) and morphological neural networks
(Dimitriadis & Maragos, 2021).
Pruning or, generally, compressing neural networks gained interest in recent years due to the surprising
capability of reducing the size of a neural network without compromising performance (Blalock et al.,
2020). As tropical geometry explains the mathematical structure of neural networks, pruning may
also be viewed under the perspective of tropical geometry. Indeed, Alfarra et al. (2020) propose an
unstructured compression algorithm based on sparsifying the zonotope matrices of the network. Also,
†
Conducted research as a student in National Technical University of Athens.
1
Published as a conference paper at ICLR 2022
Smyrnis et al. (2020) construct a novel tropical division algorithm that applies to neural network
minimization. A generalization of this applies to multiclass networks (Smyrnis & Maragos, 2020).
Contributions In our work, we contribute to structured neural network approximation from the
mathematical viewpoint of tropical geometry:
• We establish a novel bound on the approximation error between two neural networks with
ReLU activations and one hidden layer. To prove this we bound the difference of the
networks’ tropical polynomials via the Hausdorff distance of their respective zonotopes.
• We construct two geometrical neural network compression methods that are based on
zonotope reduction and employ K-means algorithm for clustering. Our algorithms apply on
the fully connected layers of ReLU activated neural networks.
• Our algorithms are analyzed both theoretically and experimentally. The theoretical eval-
uation is based on the theoretical bound of neural network approximation error. On the
experimental part, we examine the performance of our algorithms on retaining the accuracy
of convolutional neural networks when applying compression on their fully connected layers.
T
Tropical polynomials A tropical polynomial f in d variables x = (x1 , x2 , ..., xd ) is defined as
the function
f (x) = max{aTi x + bi } (1)
i∈[n]
where [n] = {1, ..., n}, ai are vectors in Rd and bi is the corresponding monomial coefficient in
Rmax = R ∪ {−∞}. The set of such polynomials constitutes the semiring Rmax [x] of tropical
polynomials. Note that each term aTi x + bi corresponds to a hyperplane in Rd . We thus call the
vectors {ai }i∈[n] the slopes of the tropical polynomial, and {bi }i∈[n] the respective biases. We allow
slopes to be vectors with real coefficients rather than integer ones, as it is normally the case for
polynomials in regular algebra. These polynomials are also referred to as signomials (Duffin &
Peterson, 1973) in the literature.
Polytopes Polytopes have been studied extensively (Ziegler, 2012; Grünbaum, 2013) and occur as
a geometric tool for fields such as linear programming and optimization. They also have an important
role in the analysis of neural networks. For instance, Zhang et al. (2018); Charisopoulos & Maragos
(2018) show that linear regions of neural networks correspond to vertices of polytopes. Thus, the
counting of linear regions reduces to a combinatorial geometry problem. In what follows, we explore
this connection of tropical geometry with polytopes.
Consider the tropical polynomial defined in (1). The Newton polytope associated to f (x) is defined
as the convex hull of the slopes of the polynomial
Newt (f ) := conv{ai : i ∈ [n]}
Furthermore, the extended Newton polytope of f (x) is defined as the convex hull of the slopes of the
polynomial extended in the last dimension by the corresponding bias coefficient.
ENewt (f ) := conv{(aTi , bi ) : i ∈ [n]}
The following proposition computes the extended Newton polytope that occurs when a tropical
operation is applied between two tropical polynomials. It will allow us to compute the polytope
representation corresponding to a neural network’s hidden layer.
Proposition 1. (Zhang et al., 2018; Charisopoulos & Maragos, 2018) Let f, g ∈ Rmax [x] be two
tropical polynomials . Then for the extended Newton polytopes it is true that
ENewt (f ∨ g) = conv{ENewt (f ) ∪ ENewt (g)}
ENewt (f + g) = ENewt (f ) ⊕ ENewt (g)
2
Published as a conference paper at ICLR 2022
(2,1,2)
U F (ENewt(f ∨g))
(2,1,1) (0,0,1) (0,0,1) (2,2,1)
(0,0,1)
(0,1,0)
(2,1,1)
(3,1,1)
(0,1,0)
(0,1,0)
(1,0,0)
(0,0,0) (0,0,0) (1,0,0)
(1,0,0)
Figure 1: Illustration of tropical operations between polynomials. The polytope of the max (∨) of f
and g corresponds to the convex hull of the union of points of the two polytopes and the polytope of
sum (+) corresponds to their Minkowski sum.
Tropical geometry has the capability of expressing the mathematical structure of ReLU activated
neural networks. We review some of the basic properties of neural networks and introduce notation
that will be used in our analysis. For this purpose, we consider the ReLU activated neural network of
Fig. 2 with one hidden layer.
Network tropical equations The network of Fig. 2 consists of an input layer x = (x1 , ..., xd ), a
hidden layer f = (f1 , ..., fn ) with ReLU activations, an output layer v = (v1 , ..., vm ) and two linear
layers defined by the matrices A, C respectively. As illustrated in Fig. 2 we have Ai,: = aTi , bi for
the first linear layer and Cj,: = (cj1 , cj2 , ..., cjn ) for the second linear layer, as we ignore its biases.
Furthermore, the output of the i−th component of the hidden layer f is computed as
d
!
X
fi (x) = max aik xk + bi , 0 = max(aTi x + bi , 0) (2)
k=1
We deduce that each fi is a tropical polynomial with two terms. It therefore follows that ENewt (fi )
is a linear segment in Rd+1 . The components of the output layer may be computed as
Xn X X
vj (x) = cji fi (x) = |cji |fi (x) − |cji |fi (x) = pj (x) − qj (x) (3)
i=1 cji >0 cji <0
3
Published as a conference paper at ICLR 2022
a i1
2
2 2
c j1
output layer of the neural network of Fig. 2 is equivalent
to a tropical rational mapping. In fact, this result holds xk
aik
fi
cji
vj
for deeper networks, in general, as demonstrated by the
a id jn
following theorem. .. ..b i
c
..
Theorem 1. (Zhang et al., 2018) A ReLU activated deep
neural network F : Rd → Rm is equivalent to a tropical xd fn vm
rational mapping.
bn
It is not known whether a tropical rational function r(x)
admits an efficient geometric representation that deter- Figure 2: Neural network with one hid-
mines its values {r(x)} for x ∈ Rd , as it holds for tropi- den ReLU layer. The first linear layer has
cal polynomials with their polytopes in Proposition 2. For weights {aTi } with bias {bi } correspond-
this reason, we choose to work separately on the polytopes ing to i−th node ∀i ∈ [n] and the second
of the tropical polynomials pj , qj . has weights {cji }, ∀j ∈ [m], i ∈ [n].
Zonotope Generators Each neuron of the hidden layer represents geometrically a line segment
contributing to the positive or negative zonotope. We thus call these line segments generators of the
zonotope. The generators further receive the characterization positive or negative depending on the
zonotope they contribute to. It is intuitive to expect that a zonotope gets more complex as its number
of generators increases. In fact, each vertex of the zonotope can be computed as the sum of vertices of
the generators, where we choose a vertex from each generating line segment, either 0 or cji aTi , bi .
We summarize the above with the following extension of (Charisopoulos & Maragos, 2018).
Proposition 3. Pj , Qj are zonotopes in Rd+1 . For each P
vertex v of Pj there exists a subset of indices
I+ of {1, 2, ..., n} with cji > 0, ∀i ∈ I+ such that v = i∈I+ cji aTi , bi . Similarly, a vertex u of
Qj can be written as u = i∈I− cji aTi , bi where I− corresponds to cji < 0, ∀i ∈ I− .
P
In this section we present our central theorem that bounds the error between the original and
approximate neural network, when both have the architecture of Fig. 2. To achieve this we need
to derive a bound for the error of approximating a simpler functional structure, namely the tropical
polynomials that represent the neural network. The motivation behind the geometrical bounding of
the error of the polynomials is Proposition 2. It indicates that a polynomial’s values are determined at
each point of the input space by the vertices of the upper envelope of its extended Newton polytope.
Therefore, it is expected that two tropical polynomials with approximately equal extended Newton
polytopes should attain similar values. In fact, this serves as the intuition for our theorem. The metric
we use to define the distance between extended Newton polytopes is the Hausdorff distance.
Hausdorff distance The distance of a point u ∈ Rd from the finite set V ⊂ Rd is denoted by either
dist (u, V) or dist (V, u) and computed as minv∈V ku − vk which is the Euclidean distance of u
4
Published as a conference paper at ICLR 2022
from its closest point v ∈ V. The Hausdorff distance H(V, U) of two finite point sets V, U ⊂ Rd is
defined as
H (V, U) = max max dist (v, U) , max dist (V, u)
v∈V u∈U
The above proposition enables us to handle the more general case of neural networks with one hidden
layer, that are equivalent with tropical rational mappings. By repeatedly applying Proposition 4 to
each tropical polynomial corresponding to the networks, we get the following bound.
Theorem 1. Let v, ṽ ∈ Rmax [x] be two neural networks with architecture as in Fig. 2. With P̃j , Q̃j
we denote the positive and negative zonotopes of ṽ. The following bound holds.
Xm
max kv(x) − ṽ(x)k1 ≤ ρ · H Pj , P̃j + H Qj , Q̃j
x∈B
j=1
Remark 1. The reason we choose to compute the error of the approximation on a bounded hyper-
sphere B is twofold. Firstly, the unbounded error of linear terms always diverges to infinity and,
secondly, in practice the working subspace of our dataset is usually bounded, e.g. images.
Our approaches Approximating a zonotope with fewer generators is a problem known as zonotope
order reduction (Kopetzki et al.,2017). In our case we approach this problem by manipulating the
zonotope generators cji aTi , bi , ∀i, j 1 . Each of the algorithms presented will create a subset of
altered generators that approximate the original zonotopes. Ideally, we require the approximation to
hold simultaneously for all positive and negative zonotopes of each output component vj . However,
this is not always possible, as in the case of multiclass neural networks, and it necessarily leads to
heuristic manipulation. Our first attempt to tackle this problem applies the K-means algorithm to the
1
Dealing with the full generated zonotope would lead to exponential computational overhead.
5
Published as a conference paper at ICLR 2022
f1
c2 (aT
2 , b2 )
b1
f2 c̃1 (ãT
1 , b̃1 )
b2 c2 c1 (aT
c1
1 , b1 ) c4 (aT
4 , b4 )
f3 c̃2 (ãT
2 , b̃2 ) f˜1
c3 c5 (aT
5 , b5 ) c3 (aT
3 , b3 )
x1 b3 c̃3 (ãT
3 , b̃3 ) c̃1 b̃1
c4 x1 +
f4 v f˜2 +
c5 c6 (aT
6 , b6 ) v
x2 b4 c̃2 b̃2 −
c6 x2
f˜3 −
f5
c7
c7 (aT
7 , b7 ) c̃4 (ãT T
4 , b̃4 ) ≡ c7 (a7 , b7 )
c̃3 b̃3
b5
f˜4
f6
{c̃j ãji }i∈(1,2), j∈(1,...,4) c̃4 b̃4
b6
{aji }i∈(1,2), j∈(1,...,7) f7
b7
(a) Original network. (b) Original zonotopes. (c) Resulting zonotopes. (d) Compressed network.
positive and negative generators, separately. This method is restricted on applying to single output
neural networks. Our second approach further develops this technique to multiclass neural networks.
Specifically, it utilizes K-means on the vectors associated with the neural paths passing from a node
in the hidden layer, as we define later. The algorithms we present refer to the neural network of Fig.
2 with one hidden layer, but we may repeatedly apply them to compress deeper networks.
Zonotope K-means The first compression approach uses K-means to compress each zonotope of
the network, and covers only the case of a single output neural network, e.g as in Fig. 2 but with
m = 1. The algorithm reduces the hidden layer size from n to K neurons. We use the notation
ci , i = 1, ..., n for weights of the second linear layer, connecting the hidden layer with the output
node. Algorithm 1 is presented below and a demonstration of its execution can be found in Fig. 3.
Proposition 5. Zonotope K-means produces a compressed neural network with output ṽ satisfying
X n
1 1
|ci |k aTi , bi k
· max |v(x) − ṽ(x)| ≤ K · δmax + 1 −
ρ x∈B Nmax i=1
where K is the total number of centers used in both K-means, δmax is the largest distance from a point
to its corresponding cluster center and Nmax is the maximum cardinality of a cluster.
The above proposition provides an upper bound between the original neural network and the one that
is approximated with Zonotope K-means. In particular, if we use K = n centers the bound of the
6
Published as a conference paper at ICLR 2022
approximation error becomes 0, because then δmax = 0 and Nmax = 1. Also, if K ≈ 0 the bound gets
a fixed value depending on the magnitude of the weights of the linear layers.
The exact positive and negative zonotope approximation performed by Zonotope K-means algorithm
has a main disadvantage: it can only be used in single output neural networks. Indeed, suppose
that we want to employ the preceeding algorithm to approach the zonotopes of each output in a
multiclass neural network. That would require 2m separate executions of K-means which are not
necessarily consistent. For instance, it is possible to have cj1 i > 0 and cj2 i < 0 for some output
components vj1 , vj2 . That means that in the compression procedure of vj1 , the i−th neuron belongs
to the positive generators set, while for vj2 , it belongs to the negative one. This makes the two
compressions incompatible. Moreover, the drawback of restricting to single output only allow us to
compress the final ReLU layer and not any preceeding ones.
Neural Path K-means To overcome this obstacle we apply a simultaneous approximation of the
zonotopes. The method is called Neural
Path K-means and directly applies K-means to the vectors of
the weights aTi , bi , c1i , ..., cmi associated to each neuron i of the hidden layer. The name of the
algorithm emanates from the fact that the vector associated to each neuron consists of the weights of
all the neural network paths passing from this neuron. The procedure is presented in Algorithm 2.
Null Generators Neural Path K-means does not apply compression directly to each zonotope of
the network, but is rather a heuristic approach for this task. More precisely, if we focus on the set
of generators of the zonotopes of output j, Neural Path K-means might mix positive and negative
generators together in the same cluster. For instance, suppose ãTk , b̃k , C̃:,k
T
is the cluster center
T T
corresponding to vectors ai , bi , C:,i for i ∈ I. Then regarding output j, it is not necessary
that ∀i ∈ I all cji have the same sign. Thus, the compressed positive zonotope P̃j might contain
generators of the original negative zonotope Qj and vice versa. We will call generators cji aTi , bi
contributing to opposite zonotopes, null generators.
Proposition 6. Neural Path K-means produces a compressed neural network with output ṽ satisfying
n
√ √
X
1 2 1
kC:,i k aTi , bi +
· max kv(x) − ṽ(x)k1 ≤ mKδmax + m 1−
ρ x∈B Nmax i=1
√ n m X
mδmax X X
aTi , bi + kC:,i k + |cji | aTi , bi
Nmin i=1 j=1 i∈Nj
where K is the number of K-means clusters, δmax the maximum distance from any point to its
corresponding cluster center, Nmax , Nmin the maximum and minimum cardinality respectively of a
cluster and Nj the set of null generators with respect to output j.
The performance of Neural Path K-means is evaluated with Proposition 6. The result we deduce
is analogous to Zonotope K-means. The bound of the approximation error becomes zero when K
approaches n. Indeed, for K = n we get δmax = 0, Nmax = 1 and Nj = ∅, ∀j ∈ [m]. For lower
values of K, the upper bound reaches a value depending on the magnitude of the weights of the linear
layers together with weights corresponding to null generators.
7
Published as a conference paper at ICLR 2022
Table 1: Reporting accuracy of compressed networks for single output compression methods.
5 E XPERIMENTS
We conduct experiments on compressing the linear layers of convolutional neural networks. Our
experiments serve as proof-of-concept and indicate that our theoretical claims indeed hold in practice.
The heart of our contribution lies in presenting novel tropical geometrical background for neural
network approximation that will shed light for further research towards tropical mathematics.
Our methods compress the linear layers of the network layer by layer. They perform a functional
approximation of the original network and thus they are applicable for both classification and
regression tasks. To compare them with other techniques in the literature we choose methods with
similar structure, i.e. structured pruning techniques without re-training. For example, Alfarra et al.
(2020) proposed a compression algorithm based on the sparsification of the matrices representing the
zonotopes which served as an intuition for part of our work. However, their method is unstructured
and incompatible for comparison. The methods we choose to compare are two tropical methods for
single-output (Smyrnis et al., 2020) and multi-output (Smyrnis & Maragos, 2020) networks, Random
and L1 Structured, and a modification of ThiNet (Luo et al., 2017) adapted to linear layers. Smyrnis
et al. (2020); Smyrnis & Maragos (2020) proposed a novel tropical division framework that aimed on
the reduction of zonotope vertices. Random method prunes neurons according to uniform probability,
while L1 prunes those with the lowest value of L1 norm of their weights. Also, ThiNet uses a greedy
criterion for discarding the neurons that have the smallest contribution to the output of the network.
MNIST Dataset, Pairs 3-5 and 4-9 The first experiment is performed on the binary classification
tasks of pairs 3/5 and 4/9 of the MNIST dataset and so we can utilize both of our proposed methods.
In Table 1, we compare our methods with a tropical geometrical approach of Smyrnis et al. (2020).
Their method is based on a tropical division framework for network minimization. For fair comparison,
we use the same CNN with two fully connected layers and hidden layer of size 1000. According to
Table 1, our techniques seem to have similar performance. They retain the accuracy of the network
while reducing its size. Moreover, in Table 2 we include experimental computation of the theoretical
bounds provided by Proposition 5, 6. We notice that the bounds decrease as the remaining weights
get less. The behaviour of the bounds was expected to be incremental because the less weights we
use, the compression gets worse and the error becomes larger. However, the opposite holds which
means that the bounds are tighter for higher pruning rates. It is also important to mention that the
bounds become 0 when we keep all the weights, as expected.
MNIST and Fashion-MNIST Datasets For the second experiment we employ MNIST and
Fashion-MNIST datasets. The corresponding classification is multiclass and thus Neural Path
K-means may only be applied. In Table 3, we compare it with the multiclass tropical method of
Smyrnis & Maragos (2020) using the same CNN architecture they do. Furthermore, in plots 4a,
4b we compare Neural Path K-means with ThiNet and baseline pruning methods by compressing
8
Published as a conference paper at ICLR 2022
LeNet5 (LeCun et al., 1998). To get a better idea of how our method performs in deeper architectures
we provide plots 4c,4d that illustrate the performance of compressing a deep neural network with
layers of size 28 ∗ 28, 512, 256, 128 and 10, which we refer to as deepNN. The compression is
executed on all hidden layers beginning from the input and heading to the output. From Table 3, we
deduce that our method performs better than (Smyrnis & Maragos, 2020). Also, it achieves higher
accuracy scores and experience lower variance as shown in plots 4a-4d. Neural Path K-means, overall,
seems to have good performance, even competitive to ThiNet. Its worst performance occurs on low
percentages of remaining weights. An explanation for this is that K-means provides a high-quality
compression as long as the number of centers is not less than the number of "real" clusters.
(a) LeNet5, MNIST (b) LeNet5, F-MNIST (c) deepNN, MNIST (d) deepNN, F-MNIST
Figure 4: Neural Path K-means compared with baseline pruning methods and ThiNet. Horizontal
axis shows the ratio of remaining neurons in each hidden layer of the fully connected part.
CIFAR Dataset We conduct our final experiment on CIFAR datasets using CIFAR-VGG (Blalock
et al., 2020) and an altered version of AlexNet adapted for CIFAR. The resulting plots are shown in
Fig. 4e-4h. We deduce that Neural Path K-means retains a good performance on larger datasets. In
particular, in most cases it has slightly better accuracy an lower deviation than the baselines, but has
worse behaviour when keeping almost zero weights.
We presented a novel theorem on the bounding of the approximation error between two neural
networks. This theorem occurs from the bounding of the tropical polynomials representing the neural
networks via the Hausdorff distance of their extended Newton polytopes. We derived geometrical
compression algorithms for the fully connected parts of ReLU activated deep neural networks, while
application to convolutional layers is an ongoing work. Our algorithms seem to perform well in
practice and motivate further research towards the direction revealed by tropical geometry.
9
Published as a conference paper at ICLR 2022
R EFERENCES
Motasem Alfarra, Adel Bibi, Hasan Hammoud, Mohamed Gaafar, and Bernard Ghanem. On the
decision boundaries of deep neural networks: A tropical geometry perspective. arXiv preprint
arXiv:2002.08838, 2020.
Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of
neural network pruning? arXiv preprint arXiv:2003.03033, 2020.
Peter Butkovič. Max-linear systems: theory and algorithms. Springer monographs in mathematics.
Springer, 2010. ISBN 978-1-84996-298-8.
Vasileios Charisopoulos and Petros Maragos. Morphological perceptrons: geometry and training
algorithms. In International Symposium on Mathematical Morphology and Its Applications to
Signal and Image Processing, pp. 3–15. Springer, 2017.
Vasileios Charisopoulos and Petros Maragos. A tropical approach to neural networks with piecewise
linear activations. arXiv preprint arXiv:1805.08749, 2018.
Raymond A Cuninghame-Green. Minimax algebra, volume 166. Springer Science & Business
Media, 2012.
Nikolaos Dimitriadis and Petros Maragos. Advances in morphological neural networks: Training,
pruning and enforcing shape constraints. In Proc. 46th IEEE Int’l Conf. Acoustics, Speech and
Signal Processing (ICASSP-2021), Toronto, June 2021.
Richard James Duffin and Elmor L Peterson. Geometric programming with signomials. Journal of
Optimization Theory and Applications, 11(1):3–35, 1973.
Branko Grünbaum. Convex polytopes, volume 221. Springer Science & Business Media, 2013.
Anna-Kathrin Kopetzki, Bastian Schürmann, and Matthias Althoff. Methods for order reduction of
zonotopes. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 5626–5633.
IEEE, 2017.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A Filter Level Pruning Method for Deep Neural
Network Compression. In 2017 IEEE International Conference on Computer Vision (ICCV), pp.
5068–5076, 2017. doi: 10.1109/ICCV.2017.541.
Diane Maclagan and Bernd Sturmfels. Introduction to tropical geometry, volume 161. American
Mathematical Soc., 2015.
Petros Maragos. Dynamical systems on weighted lattices: general theory. Mathematics of Control,
Signals, and Systems, 29(4):1–49, 2017.
Petros Maragos and Emmanouil Theodosis. Multivariate tropical regression and piecewise-linear
surface fitting. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 3822–3826. IEEE, 2020.
Petros Maragos, Vasileios Charisopoulos, and Emmanouil Theodosis. Tropical geometry and machine
learning. Proceedings of the IEEE, 109(5):728–755, 2021. doi: 10.1109/JPROC.2021.3065238.
Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear
regions of deep neural networks. arXiv preprint arXiv:1402.1869, 2014.
Georgios Smyrnis and Petros Maragos. Multiclass neural network minimization via tropical newton
polytope approximation. In Proc. Int’l Conf. on Machine Learning, PMLR, 2020.
Georgios Smyrnis, Petros Maragos, and George Retsinas. Maxpolynomial division with application to
neural network simplification. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 4192–4196. IEEE, 2020.
10
Published as a conference paper at ICLR 2022
Emmanouil Theodosis and Petros Maragos. Analysis of the viterbi algorithm using tropical algebra
and geometry. In 2018 IEEE 19th International Workshop on Signal Processing Advances in
Wireless Communications (SPAWC), pp. 1–5. IEEE, 2018.
Emmanouil Theodosis and Petros Maragos. Tropical modeling of weighted transducer algorithms on
graphs. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 8653–8657, 2019. doi: 10.1109/ICASSP.2019.8683127.
Nikos Tsilivis, Anastasios Tsiamis, and Petros Maragos. Sparsity in max-plus algebra and applications
in multivariate convex regression. In Proc. 46th IEEE Int’l Conf. Acoustics, Speech and Signal
Processing (ICASSP-2021), Toronto, June 2021.
Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. Tropical geometry of deep neural networks. In
International Conference on Machine Learning, pp. 5824–5832. PMLR, 2018.
Günter M Ziegler. Lectures on polytopes, volume 152. Springer Science & Business Media, 2012.
11
Published as a conference paper at ICLR 2022
Proof. The first argument follows from the fact that both pj , qj are linear combinations of tropical
polynomials consisting of two terms. Indeed, we compute
X X Prop. 1
pj (x) = cji max(aTi x + bi , 0) = max(cji aTi x + cji bi , 0) ====⇒
cji >0 cji >0
M
max(cji aTi x
Pj = ENewt + cji bi , 0)
cji >0
T
T
Each ENewt max(c ji ai x + cji bi , 0) is a line segment with endpoints 0 and cji ai , c ji b i =
T
cji ai , bi . Therefore Pj is written as the Minkowski sum of line segments, which is a zonotope by
definition. Similarly Qj is a zonotope.
Furthermore, from the definition of the Minkowski sum, each point v ∈ Pj may be written as
T
P
cji >0 vi , where each vi is a point in the segment ENewt max(cji ai x + cji b , 0) . A vertex of
i
T
Pj can only occur if vi is an extreme point of ENewt
max(c a
ji i x + c b
ji i , 0) for every i which is
equivalent to either vi = 0 or vi = cji aTi , bi . This means that every vertex of P j corresponds to a
T
subset I+ ⊆ [n] of indices i with cji > 0, for which we choose vi = cji ai , bi and for the rest it
holds vi = 0. Thus, X
cji aTi , bi
v=
i∈I+
In the same way we derive the analogous result for the negative zonotope Qj .
Corollary 2. The geometric result concerning the structure of zonotopes can be extended to max-
pooling layers. For instance, a max-pooling layer of size 2 × 2 corresponds to a polytope that is
constructed as the Minkowski sum of pyramids which could stand as generalized case of zonotope.
12
Published as a conference paper at ICLR 2022
Notice, that for the relations (5) and (6) we used Cauchy-Schwartz inequality
|hx, yi| ≤ kxkkyk ⇔ −kxkkyk ≤ hx, yi ≤ kxkkyk
Inequality (5) holds at any point x ∈ B for some vertex aT , b ∈ P , therefore
min −ρ · d VP , cT , d ρ · d VP , cT , d
p(x) − p̃(x) ≥ =− max (8)
(c,d)∈VP̃ (cT ,d)∈VP̃
Hence, from the definition of the Hausdorff distance of two polytopes we derive the desired upper
bound
|p(x) − p̃(x)| ≤ ρ · H P, P̃ , ∀x ∈ B ⇒
max |p(x) − p̃(x)| ≤ ρ · H P, P̃
x∈B
Remark 2. Note that with similar proof one may replace the Hausdorff distance of the two polytopes
by the Hausdorff distance of their upper envelopes. This makes our theorem an exact generalization
of Proposition 2. However, this format is difficult to use in practice, because it is computationally
harder to determine the vertices of the upper envelope.
m
X m
X
kv(x) − ṽ(x)k1 = |vj (x) − ṽj (x)| = |(pj (x) − qj (x)) − (p̃j (x) − q̃j (x))|
j=1 j=1
Xm m
X
= |(pj (x) − p̃j (x)) − (qj (x) − q̃j (x))| ≤ |pj (x) − p̃j (x)| + |qj (x) − q̃j (x)|
j=1 j=1
Below we present a larger version of Fig. 5 demonstrating the execution of Zonotope K-means .
13
Published as a conference paper at ICLR 2022
f1
b1
f2
b2 c2
c1
f3 c3 f˜1
x1 b3 c̃1 b̃1 +
c4 x1
f4 v f˜2 +
c5 − v
x2 b4 c̃2 b̃2
c6 x2
f˜3 −
f5
c7
c̃3 b̃3
b5
f˜4
f6
{c̃j ãji }i∈(1,2), j∈(1,...,4) c̃4 b̃4
b6
{aji }i∈(1,2), j∈(1,...,7) f7
b7
c2 (aT
2 , b2 )
c̃1 (ãT
1 , b̃1 )
c1 (aT
1 , b1 ) c4 (aT
4 , b4 )
c̃2 (ãT
2 , b̃2 )
c5 (aT
5 , b5 ) c3 (aT
3 , b3 )
c̃3 (ãT
3 , b̃3 )
c6 (aT
6 , b6 )
c7 (aT
7 , b7 ) c̃4 (ãT T
4 , b̃4 ) ≡ c7 (a7 , b7 )
14
Published as a conference paper at ICLR 2022
aTi , bi of P . For the vertex u ∈ P we need to choose vertex v ∈ P̃ as close
P
vertex u = i∈I+ ci
to u as possible, in order to provide an upper bound for dist u, P̃ . Vertex v is selected as follows.
For each i ∈ I+ we select k such that c̃k ãTk b̃k is the center of the cluster where ci aTi , bi
belongs to. We denote the set of such clusters by C+ , where each cluster center k appears only once.
Then, vertex v is constructed as v = k∈C+ c̃k ãTk b̃k ∈ P̃ . We have that:
P
X X
ci aTi , bi − c̃k ãTk , b̃k
dist u, P̃ ≤
i∈I+ k∈C+
X X
ci aTi , bi − c̃k ãTk , b̃k
≤
k∈C+ i∈Ik+
X X c̃k ãTk , b̃k
ci aTi , bi −
≤
|Ik+ |
k∈C+ i∈Ik+
X X ci aTi , bi + εi
ci aTi ,
= bi −
|Ik+ |
k∈C+ i∈Ik+
X X 1 T
kεi k
≤ 1− |ci | ai , bi +
|Ik+ | |Ik+ |
k∈C+ i∈Ik+
X
1
|ci | aTi , bi
≤ |C+ | · δmax + 1 −
Nmax
i∈I+
where
we denote
by Ik+ the set of indices i ∈ I+ that belong to the center k ∈ C+ and εi =
c̃k ãk , b̃k − ci aTi , bi is the vector of the difference of the i−th generator, with its corresponding
T
i∈I+ k∈C+
was taken into account when computing the upper bound for maxu∈VP d(u, P̃ ), and thus both values
obtain the same upper bound. Therefore,
X
1
|ci | aTi , bi
H P, P̃ ≤ K+ · δmax + 1 −
Nmax
i∈I+
where K+ is the number of cluster centers corresponding to P̃ and I+ the indices corresponding to
all positive generators of P . Similarly,
X
1
|ci | aTi , bi
H Q, Q̃ ≤ K− · δmax + 1 −
Nmax
i∈I−
where K− , I− are defined in analogous way for the negative zonotope. Combining the relations gives
the desired bound.
X n
1 1
|ci | aTi , bi
· |v(x) − ṽ(x)| ≤ H P, P̃ + H Q, Q̃ ≤ K · δmax + 1 −
ρ Nmax i=1
15
Published as a conference paper at ICLR 2022
Below we illustrate the vectors on which K-means is applied for the multi-output case. The vectors
that are compressed are consist of all the edges associated to a hidden layer neuron. The corresponding
edges contain all the possible neural paths that begin from some node of the input, end in some node
of the output and pass through this hidden node.
x1 f1 v1
b1
x2 f2 v2
i
ai
c1
.. .. b ..
a i1
2
2 c 2i
aik cji
xk fi vj
cm
a id
.. .. b i
i
..
xd fn vm
bn
Figure 6: Neural Path K-means for multi-output neural network compression. In green color we
highlight the weights corresponding to the i−th vector used by Neural Path K-means.
In the main text we defined the null generators of zonotopes that occur by the execution of Neural
Path K-means. Below we provide an illustration for them.
Null Generators
Figure 7: Visualization of K-means in Rd+1+n , where d is the input dimension and n the hidden
layer size. We color points according to the j−th output component of the network. Black and white
points correspond to generators of Pj and Qj respectively. White vertices in positive (brown) clusters
and black vertices in negative (blue) clusters are null generators regarding j−th output.
|vj (x) − ṽj (x)| ≤ |pj (x) − p̃j (x)| + |qj (x) − q̃j (x)|
Any vertex of u ∈ Pj can be written as u = i∈Ij+ cji aTi , bi where the set of indices Ij+ satisfy
P
cji > 0, ∀i ∈ Ij+ and thus correspond to positive generators. To choose a nearby vertex from P̃j
T
we perform the following. For each i ∈ Ij+ we select the center ãk b̃k C̃ (k)T of the cluster to
where aTi bi C (i)T belongs, only if c̃jk > 0. Such a center only exists if cji aTi , bi is not a
16
Published as a conference paper at ICLR 2022
X X
cji aTi , bi − c̃jk ãTk , b̃k
max dist u, P̃j ≤
u∈VPj
i∈Ij+ k∈Cj+
X X X
cji aTi , bi − c̃jk ãTk , b̃k + aTi , bi
≤ |cji |
k∈Cj+ i∈Ijk+ i∈Nj+
T
X X c̃jk ãk , b̃k X
cji aTi , bi − |cji | aTi , bi
≤ +
|Ijk+ |
k∈Cj+ i∈Ijk+ i∈Nj+
T
X X
T
(cji + εji ) ai , bi + λi X
aTi , bi
≤ cji ai , bi − + |cji |
|Ijk+ |
k∈C i∈Ijk+ i∈Nj+
X X |εji |kλi k 1
T
≤ + 1− |cji | ai , bi +
|Ijk+ | |Ijk+ |
k∈Cj+ i∈Ijk+
" #
X X |εji | aTi , bi + |cji | kλi k X
|cji | aTi , bi
+ +
|Ijk+ |
k∈Cj+ i∈Ijk+ i∈Nj+
The maximum value of the upper bound occurs when Ij+ contains all indices that correspond
to cji > 0. To compute an upper bound for maxv∈VP̃ dist (Pj , v) we write the vertex v as
j
P T
P T
v = k∈Cj+ c̃jk ãk b̃k ∈ P̃j and choose the vertex u = i∈Ij+ cji ai , bi of Pj where
Ij+ is the set of all indices corresponding to generators that belong to these clusters. As in the
proof of Proposition
5,their distance was taken into account when computing the upper bound for
maxu∈VPj dist u, P̃j . Hence, both obtain the same upper bound which leads to
X X |εji |kλi k 1
|cji | aTi , bi
H Pj , P̃j ≤ + 1− +
|Ijk+ | |Ijk+ |
k∈Cj+ i∈Ijk+
" #
X X |εji | aTi , bi + |cji | kλi k X
|cji | aTi , bi
+ +
|Ijk+ |
k∈Cj+ i∈Ijk+ i∈Nj+
where Ij+ contains all indices corresponding to positive cji . Similarly, we deduce
X X |εji |kλi k 1
T
H Qj , Q̃j ≤ + 1− |cji | ai , bi +
|Ijk− | |Ijk− |
k∈Cj− i∈Ijk−
" #
X X |εji | aTi , bi |cji | kλi k X
|cji | aTi , bi
+ +
|Ijk− |
k∈Cj− i∈Ijk− i∈Nj−
where Ij− contains all i such that cji < 0. In total these bounds together with Proposition 4 give
17
Published as a conference paper at ICLR 2022
1 X X |εji |kλi k 1
T
· max |vj (x) − ṽj (x)| ≤ + 1− |cji | ai , bi +
ρ x∈B |Ijk | |Ijk |
k∈Cj i∈Ijk
" #
X X |εji | aTi , bi + |cji | kλi k X
|cji | aTi , bi
+ +
|Ijk |
k∈Cj i∈Ijk i∈Nj
Here we used the notation Cj = Cj+ ∪ Cj− = {1, 2, ..., K} and Ijk is either equal to Ijk+ or Ijk−
depending on k ∈ Cj . Note that {i|i ∈ Ijk , k ∈ Cj } = {1, 2, ..., n} \ Nj ⊆ {1, 2, ..., n}, since
every generator that is not null corresponds to some cluster center with the same sign. Also, using
Nmax ≥ |Ijk | ≥ Nmin , it follows that
n
1 X |εji |kλi k 1
|cji | aTi , bi
· max |vj (x) − ṽj (x)| ≤ + 1− +
ρ x∈B i=1
Nmin Nmax
n
" #
X |εji | aTi , bi + |cji | kλi k X
|cji | aTi , bi
+ +
i=1
N min
i∈Nj
We will compute the total cost that combines all outputs by applying the inequality
2
m m m
X X
2
X √
|uj | ≤m |uj | ⇔ |uj | ≤ m k(u1 , ..., um )k
j=1 j=1 j=1
n
"√ #
m ε(i) kλi k √
X 1 T
≤ + 1− m kC:,i k ai , bi +
i=1
Nmin Nmax
n
"√ √ # m X
X m ε(i) aTi , bi + m kC:,i k kλi k X
|cji | aTi , bi
+ +
i=1
Nmin j=1 i∈Nj
n
√ √
X
2 1
kC:,i k aTi , bi +
≤ mKδmax + m 1−
Nmax i=1
√ n m X
mδmax X X
aTi , bi + kC:,i k + |cji | aTi , bi
Nmin i=1 j=1 i∈Nj
as desired.
18