03 Schubert
03 Schubert
Erich Schubert
1 Introduction
Cosine similarity has some interesting properties that make it a popular choice
in certain applications, in particular in text analysis. First of all, it is easy to see
that sim(x, y) = sim(αx, y) = sim(x, αy) for any α > 0, i.e., the similarity is
invariant to scaling vectors with a positive scalar. In text analysis, this often is a
desirable property as repeating the contents of a document multiple times does
not change the information of the document substantially. Formally, Cosine sim-
ilarity can be seen as being the dot product of L2 normalized vectors. Secondly,
the computation of Cosine similarity is fairly efficient for sparse vectors: rather
A Triangle Inequality for Cosines 3
than storing the vectors as a long array of values, most of which are zero, they
can be encoded for example as pairs (i, v) of an index i and a value v, where
only the non-zero pairs are stored and kept in sorted order. The dot product of
two such vectors can then be efficiently computed by a merge operation, where
only thoseP indexes i need to be considered that are in both vectors because in
hx, yi = i xi yi only those terms matter where both xi and yi are not zero.
In popular literature, you will often find the claim that Cosine similarity is
more suited for high-dimensional data. As we will see below, it cannot be supe-
rior to Euclidean distance because of the close relationship of the two, hence this
must be considered a myth. Research on intrinsic dimensionality has shown that
Cosine similarity is also affected by the distance concentration effect [12] as well
as the hubness phenomenon [16], two key aspects of the “curse of dimensional-
ity” [22]. The main difference is that we are usually using the Cosine similarity
on sparse data, which has a much lower intrinsic dimensionality than the vector
space dimensionality suggests.
Consider the Euclidean distance of two normalized vectors x and y. By ex-
panding the binomials, we obtain:
r r
X X
dEuclidean (x, y) := (xi − yi )2 = (x2 + yi2 − 2xi yi )
i i i
q
2 2
p
= kxk + kyk − 2 hx, yi = hx, xi + hy, yi − 2 hx, yi (1)
p
if kxk = kyk =1: = 2 − 2 · sim(x, y) (2)
where the last step relies on the vectors being normalized to unit length. Hence
we have an extremely close relationship between Cosine similarity and (squared)
Euclidean distance of the normalized vectors:
y
sim(x, y) = 1 − 21 d2Euclidean x
kxk , kyk . (3)
While we can also compute Euclidean distance more efficiently for sparse
vectors using the scalar product form of Eq. 1, this computation is prone to a
numerical problem called “catastrophic cancellation” for small distances (when
hx, xi ≈ hx, yi ≈ hy, yi that can be problematic in clustering (see, e.g., [18,9]).
Hence, working with Cosines directly is preferable when possible, and additional
motivation for this work was to work directly with a triangle inequality on the
similarities, to avoid this numerical problem (as we will see below, we cannot
completely avoid this, unless we can afford to compute many trigonometric func-
tions).
In common literature, the term “Cosine distance” usually refers to a dissim-
ilarity function defined as
which unfortunately is not a metric, i.e., it does not satisfy the triangle inequality.
There are two less common alternatives, namely:
p
x y
dSqrtCosine (x, y) := 2 − 2 sim(x, y) = dEuclidean kxk , kyk (5)
darccos (x, y) := arccos(sim(x, y)) . (6)
which are less common (but, e.g., available in ELKI [19]) and which are metric.
Eq. 5 directly follows from Eq. 3, while the second one is the angle between the
vectors itself (the arc length, not the cosine of the angle), for which we easily
obtain the triangle inequality by looking at the arc through x, y, z. We will use
these metrics below to obtain a triangle inequality for Cosines.
Because the triangle inequality is the central rule to avoiding distance compu-
tations in many metric search indexes (as well as in many other algorithms), we
would like to obtain a triangle inequality for Cosine similarity. Given the close
relationship to squared Euclidean distance outlined in the previous section, one
obvious approach would be to just use Euclidean distance instead of Cosine. If
we know that our data is normalized (which is a best practice when using Cosine
similarities), we can make the computation slightly more efficient using Eq. 5,
but we wanted to avoid this because (i) computing the square root takes 10–
50 CPU cycles (depending on the exact CPU, precision, and input value) and
(ii) the subtraction in this equation is prone to catastrophic cancellation when
the two vectors are similar, i.e., we may have precision issues when finding the
nearest neighbors. Hence, we would like to develop techniques that primarily
rely on similarity instead of distance, yet allow a similar pruning to the (very
successful) metric search acceleration techniques.
Using Eq. 5 and the triangle inequality of Euclidean distance, we obtain
p p p
1 − sim(x, y) ≤ 1 − sim(x, z) + 1 − sim(z, y)
p p 2
sim(x, y) ≥ 1 − 1 − sim(x, z) + 1 − sim(z, y)
sim(x, y) ≥ sim(x, z) + sim(z, y) − 1
q
− 2 1 − sim(x, z))(1 − sim(z, y) (7)
which, unfortunately, does not appear to allow much further simplification. In
order to remove the square root, we can approximate it using the smaller of the
two similarities sim⊥ (x, y, z) := min{sim(x, z), sim(z, y)}:
sim(x, y) ≥ sim(x, z) + sim(z, y) − 1 − 2(1 − sim⊥ (x, y, z))
sim(x, y) ≥ sim(x, z) + sim(z, y) + 2 sim⊥ (x, y, z) − 3 (8)
This triangle inequality is tighter than the one based on Euclidean distance, and
hence we can expect better pruning power than using an index for Euclidean
distance or dSqrtCosine (Eq. 5) in a metric index; while the computational cost
has been reduced to the low “overhead” of Euclidean distances. Eq. 9 suggests
that it is the tightest possible bound we can obtain because it is directly using
the angles, rather than the chord length as used by Euclidean distance. This
bound yields a very interesting insight: while the triangle inequality for Euclidean
distances – and in the arc lengths – was additive, the main term of this equation
in the Cosine domain is multiplicative.
We also investigated approximations to further reduce the computation over-
head. By approximating the last term using the smaller similarity only, we get
which is a cheap bound, tighter than Eq. 8, but still too loose.
We can also expand and approximate the last term using both the smaller
and the larger value sim> (x, y, z) := max{sim(x, z), sim(z, y)}:
p
(1 − sim(x, z)2 ) · (1 − sim(z, y)2 )
p
= (1 − sim(x, z)) · (1 + sim(x, z)) · (1 − sim(z, y)) · (1 + sim(z, y))
p
≤ (1 + sim> (x, y, z))2 · (1 − sim⊥ (x, y, z))2
= (1 + sim> (x, y, z)) · (1 − sim⊥ (x, y, z))
= 1 + sim> (x, y, z) − sim⊥ (x, y, z) − sim(x, z) · sim(z, y)
A B
D A B
C D C
i.e., the two products of opposing sides sum to more than the product of diag-
onals. If the four points are concyclic (as illustrated in Fig. 1), this becomes an
equality. While this has some interesting properties for data indexing, it does
not appear to be suitable for angular space, as shown by the counter example
illustrated in Fig. 1, despite the similarity of the setting. The key difference is
that we are interested in the arc lengths, whereas Ptolemy’s inequality uses the
chord lengths. For a simple counter example, we place the four points equally
spaced on the equator (or any other great circle) of a 3d sphere for illustrative
purposes (the example will also work in 2 dimensional data) The rectangle (on
the sphere) becomes a great circle, and as we spaced them equally the angle from
one to the next is π2 . The diagonals connect antipodal points, and hence have
2
angle π. It is easy to see that π2 · π2 + π2 · π2 = π2 6≥ π 2 , and hence angles are
not Ptolemaic. Using the cosines of the angles and Eq. 4, we get 1 − cos π2 = 1
A Triangle Inequality for Cosines 7
4 Experiments
Table 1 summarizes the six bounds that we compare concerning their suitability
for metric indexing. Note that we will not investigate the actual performance in
a similarity index here, but plan to do this in future work. Instead, we want to
focus on the bounds themselves concerning three properties:
1. how tight the bounds are, i.e., how much pruning power we lose
2. whether we can observe numerical instabilities
3. the differences in the computational effort necessary
Fig. 3: Lower bounds for the similarity sim(x, y) given sim(x, z) and sim(z, y)
using different inequalities from Table 1.
In Fig. 2, we plot the resulting lower bound for the similarity sim(x, y) given
sim(x, z) and sim(z, y) using the Euclidean-based bound (Eq. 7) in Fig. 2a and
the Arccos-based bound (Eq. 9) in Fig. 2b. A striking difference is visible in the
negative domain: if x and z are opposite directions, and z and y are also opposite,
then x and y must in turn be similar. The Arccos-based bound produces positive
bounds here, while the Euclidean-based bound can go down to −7. But in many
cases where we employ Cosine similarity, our data will be restricted to the non-
negative domain, so this is likely not an issue, and could maybe be solved with a
simple sign check. Upon closer inspection, we can observe that the bounds found
by the Arccos-based approach tend to be substantially higher in particular for
input similarities around 0.5. Fig. 2c visualizes the difference between the two
bounds. We can see that the Euclidean bounds are never higher than the Arccos
bound, which is unsurprising as the latter is tight, and the first is a proper lower
bound. But we can also see that the difference between the two (and hence the
pruning power) can be as big as 0.5. This maximum is attained when the input
Cosine similarities are 0.5 (i.e., the known angles are 60◦ ): The Euclidean bound
is -1 then, while the Arccos-based bound is 0. In the typical use case of Cosine
A Triangle Inequality for Cosines 9
Fig. 5: Differences between simplified bounds and the tight arccos bound.
on non-negative values, both bounds are effectively trivial. However, there still
is a substantial difference for larger input similarities. Averaging over a uniform
sampled grid of input values, considering only those where both bounds are non-
negative, the average Euclidean bound is 0.2447, while the average Arccos-based
bound is 0.3121, about 27.5% higher. Hence, using the Arccos-based bound is
likely to yield better performance.
In the following, we focus on the non-negative domain, to improve the read-
ability of the figures. In Fig. 3, we show all six bounds from Table 1. Fig. 3a is
the Euclidean bound, Fig. 3b is the Arccos bound we just saw. Fig. 3c is the
multiplicative version (Eq. 10), which yields no noticeable difference to the Arc-
cos bound (mathematically, they are equivalent). Fig. 3d is the simplified bound
derived from Cosine, whereas Fig. 3e and Fig. 3f are the two bounds derived
from the multiplicative version of the Arccos bound. We observe that Mult-LB1
(Fig. 3f) is the best of the lower bounds, but also that none of the simplified
bounds is a very close approximation to the optimal bounds in the first row. We
obtain the following relationship of the presented lower bounds (c.f., Fig. 4):
bound is the best of the simplified bounds, the divergence from the arccos bound
can be quite substantial, at least when the two input similarities are not very
close. As it can be seen from the isolines in the figures (at steps of 0.1), even if
we would consider a bound that is worse by 0.1 or 0.2 acceptable, there remains
a fairly large region of relevant inputs (e.g., where one similarity is close to 1.0,
the other close to 0.8), where the loss in pruning performance may offset the
slightly larger computational cost of using the Mult bound instead.
were performed on an Intel i7-8650U using a single thread, and with the CPU’s
turbo-boost disabled such that the clock rate is stable at 1.9 GHz to reduce
measurement noise as well as heat effects. As a baseline, we include a simple
add operation to measure the cost of memory access to a pre-generated array of
2 million random numbers. Because trigonometric functions are fairly expensive,
we also evaluate the JaFaMa library for fast math as an alternative to the JDK
built-ins. JMH is set to perform 5 warmup iterations and 10 measurement itera-
tions of 10 seconds each, to improve the accuracy of our measurements. We try to
follow best practices in Java benchmarking (JMH is a well-suited tool for micro-
benchmarking in Java), but nevertheless, the results with different programming
languages (such as C) can be different due to different compiler optimization, and
the usual pitfalls with runtime benchmarks remain [8]. Table 2 gives the results
of our experiments. In these experiments, the runtime benefits of the simplified
equations are minuscule. Apparently, the CPU can alleviate the latency of the
square root to a large extend (e.g., via pipelining), and compared to the memory
access cost of the baseline operation, the additional 1.6 nanoseconds will likely
not matter for most applications. The benchmark, however, clearly shows the
benefit of the “Mult” version over the “Arccos” version, which mathematically
is equivalent but differs considerably in run time. While the use of JaFaMa as
replacement reduces the runtime considerably, the much simpler “Mult” version
still wins hands-down and hence is the version we ultimately recommend using.
While “Mult-LB2” is marginally faster, it is also much less accurate and hence
useful, as seen in Section 4.1.
5 Conclusions
In this article, we introduce a triangle inequality for Cosine similarity. We study
different ways of obtaining a triangle inequality, as well as different attempts
at finding an even faster bound. The experiments show that a mathematically
equivalent version of the Arccos-based bound is the best trade-off of accuracy
2
Eq. 10 expanded usingp(1 − x2 ) = (1 + x)(1 − x) to obtain the variant
sim(x, z) · sim(z, y) − (1+ sim(x, z))(1− sim(x, z))(1+ sim(z, y))(1− sim(z, y))
12 E. Schubert
We can not, however, rule out that there exists a more efficient equation that
could be used instead. As this paper shows, there can be more than one version
of the same bound that performs very differently due to the functions involved.
We hope to spur new research in the domain of accelerating similarity search
with metric indexes, as this equation allows many existing indexes (such as M-
trees, VP-trees, cover trees, LAESA, and many more) to be transformed into
an efficient index for Cosine similarity. Integrating this equation into algorithms
will enable the acceleration of data mining algorithms in various domains, and
the use of Cosine similarity directly (without having to transform the similarities
into distances first) may both allow simplification as well as optimization of al-
gorithms. Furthermore, we hope that this research can eventually be transferred
to other similarity functions besides Cosine similarity. We believe it is a valuable
insight that the triangle inequality for Cosine distance contains the product of
the existing similarities (but also a non-negligible correction term), whereas the
triangle inequality for distance metrics is additive. We wonder if there exists a
similarity equivalent of the definition of a metric (i.e., a “simetric”), with sim-
ilar axioms but for the dual case of similarity functions, but the results above
indicate that we will likely not be able to obtain a much more elegant general
formulation of a triangle inequality for similarities.
References
1. Beygelzimer, A., Kakade, S.M., Langford, J.: Cover trees for nearest
neighbor. In: Int. Conf. Machine Learning, ICML. pp. 97–104 (2006).
https://doi.org/10.1145/1143844.1143857
2. Bozkaya, T., Özsoyoglu, Z.M.: Indexing large metric spaces for similar-
ity search queries. ACM Trans. Database Syst. 24(3), 361–404 (1999).
https://doi.org/10.1145/328939.328959
3. Brin, S.: Near neighbor search in large metric spaces. In: Dayal, U., Gray, P.M.D.,
Nishio, S. (eds.) Int. Conf. Very Large Data Bases, VLDB. pp. 574–584. Morgan
Kaufmann (1995)
4. Chávez, E., Ludueña, V., Reyes, N., Roggero, P.: Faster proximity searching with
the distal SAT. In: Int. Conf. Similarity Search and Applications, SISAP. pp. 58–69
(2014). https://doi.org/10.1007/978-3-319-11988-5 6
5. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity
search in metric spaces. In: Int. Conf. Very Large Data Bases, VLDB. pp. 426–435
(1997)
6. Hetland, M.L., Skopal, T., Lokoc, J., Beecks, C.: Ptolemaic access methods: Chal-
lenging the reign of the metric space model. Inf. Syst. 38(7), 989–1006 (2013).
https://doi.org/10.1016/j.is.2012.05.011
A Triangle Inequality for Cosines 13
7. Jagadish, H.V., Ooi, B.C., Tan, K., Yu, C., Zhang, R.: idistance: An adaptive b+ -
tree based indexing method for nearest neighbor search. ACM Trans. Database
Syst. 30(2), 364–397 (2005). https://doi.org/10.1145/1071610.1071612
8. Kriegel, H., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: Are
we comparing algorithms or implementations? Knowl. Inf. Syst. 52(2), 341–378
(2017). https://doi.org/10.1007/s10115-016-1004-2
9. Lang, A., Schubert, E.: BETULA: numerically stable cf-trees for BIRCH clustering.
In: Int. Conf. Similarity Search and Applications, SISAP. pp. 281–296 (2020).
https://doi.org/10.1007/978-3-030-60936-8 22
10. Lokoc, J., Hetland, M.L., Skopal, T., Beecks, C.: Ptolemaic indexing of the sig-
nature quadratic form distance. In: Int. Conf. Similarity Search and Applications.
pp. 9–16 (2011). https://doi.org/10.1145/1995412.1995417
11. Micó, L., Oncina, J., Vidal, E.: A new version of the nearest-neighbour ap-
proximating and eliminating search algorithm (AESA) with linear preprocess-
ing time and memory requirements. Pattern Recognit. Lett. 15(1), 9–17 (1994).
https://doi.org/10.1016/0167-8655(94)90095-7
12. Nanopoulos, A., Radovanovic, M., Ivanovic, M.: How does high dimensionality
affect collaborative filtering? In: ACM Conf. Recommender Systems, RecSys. pp.
293–296 (2009). https://doi.org/10.1145/1639714.1639771
13. Navarro, G.: Searching in metric spaces by spatial approximation. VLDB J. 11(1),
28–46 (2002). https://doi.org/10.1007/s007780200060
14. Novak, D., Batko, M., Zezula, P.: Metric index: An efficient and scalable solution
for precise and approximate similarity search. Inf. Syst. 36(4), 721–733 (2011).
https://doi.org/10.1016/j.is.2010.10.002
15. Omohundro, S.M.: Five balltree construction algorithms. Tech. Rep. TR-89-063,
International Computer Science Institute (ICSI) (1989)
16. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Nearest neighbors in high-
dimensional data: the emergence and influence of hubs. In: Int. Conf. Machine
Learning, ICML. pp. 865–872 (2009). https://doi.org/10.1145/1553374.1553485
17. Ruiz, G., Santoyo, F., Chávez, E., Figueroa, K., Tellez, E.S.: Extreme pivots for
faster metric indexes. In: Int. Conf. Similarity Search and Applications, SISAP.
pp. 115–126 (2013). https://doi.org/10.1007/978-3-642-41062-8 12
18. Schubert, E., Gertz, M.: Numerically stable parallel computation of (co-)variance.
In: Int. Conf. Scientific and Statistical Database Management, SSDBM. pp. 10:1–
10:12 (2018). https://doi.org/10.1145/3221269.3223036
19. Schubert, E., Zimek, A.: ELKI: A large open-source library for data analysis -
ELKI release 0.7.5 ”heidelberg”. CoRR abs/1902.03616 (2019), http://arxiv.
org/abs/1902.03616
20. Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric
trees. Inf. Process. Lett. 40(4), 175–179 (1991). https://doi.org/10.1016/0020-
0190(91)90074-R
21. Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in gen-
eral metric spaces. In: ACM/SIGACT-SIAM Symposium on Discrete Algorithms,
SODA. pp. 311–321 (1993)
22. Zimek, A., Schubert, E., Kriegel, H.: A survey on unsupervised outlier detection
in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012).
https://doi.org/10.1002/sam.11161