NMSLib Manual
NMSLib Manual
2
Language Technologies Institute,
Carnegie Mellon University,
Pittsburgh, PA, USA
[email protected]
Version 1.5
1 Introduction
and Python (via Python bindings, see § 2.6). In addition, it is also possible to
build a query server (see § 2.5), which can be used from Java (or other languages
supported by Apache Thrift). Java has a native client, i.e., it works on many
platforms without requiring a C++ library to be installed.
Even though our methods are generic, they often outperform specialized
methods for the Euclidean and/or angular distance (i.e., for the cosine simi-
larity). Tables 1 and 2 contain results (as of May 2016) of NMSLIB compared
to the best implementations participated in a public evaluation code-named
ann-benchmarks. Our main competitors are:
Fig. 1: 1.19M vectors from GloVe (100 dimensions, trained from tweets), cosine
similarity.
Search methods for non-metric spaces are especially interesting. This domain
does not provide sufficiently generic exact search methods. We may know very
little about analytical properties of the distance or the analytical representation
may not be available at all (e.g., if the distance is computed by a black-box
device [49]). In many cases it is not possible to search exactly and instead one
has to resort to approximate search procedures.
This is why methods are evaluated in terms of efficiency-effectiveness trade-
offs rather than merely in terms of their efficiency. As mentioned previously,
we believe that there is no “one-size-fits-all” search method. Therefore, it is
important to provide a variety of methods each of which may work best for
some specific classes of data.
Our commitment to efficiency affected several design decisions:
query ball (centered at the query object q) with the radius r, or, formally, all
the objects {oi } such that d(oi , q) ≤ r. In generic spaces, the distance is not
necessarily symmetric. Thus, two types of queries can be considered. In a left
query, the object is the left argument of the distance function, while the query
is the right argument. In a right query, q is the first argument and the object is
the second, i.e., the right, argument.
The queries can be answered either exactly, i.e., by returning a complete
result set that does not contain erroneous elements, or, approximately, e.g.,
by finding only some answers. Thus, the methods are evaluated in terms of
efficiency-effectiveness trade-offs rather than merely in terms of their efficiency.
One common effectiveness metric is recall. In the case of the nearest neighbor
search, it is computed as an average fraction of true neighbors returned by the
method. If ground-truth judgements (produced by humans) are available, it is
also possible to compute an accuracy of a k-NN based classification (see § 3.5.2).
2 Getting Started
2.1 What’s new in version 1.5 (major changes)
– We have adopted a new method: a hierarchical (navigable) small-world graph
(HNSW), contributed by Yury Malkov [37], see § 5.5.2.
– We have improved performance of two core methods SW-graph (§ 5.5.1) and
NAPP (5.4.5).
– We have written basic tuning guidelines for SW-graph, HNSW, and NAPP
6.
– We have modified the workflow of our benchmarking utility experiment and
improved handling of the gold standard data, see § 3.4.6;
– We have updated the API so that methods can save and restore indices, see
§ 7.3.
– We have implemented a server, which can have clients in C++, Java, Python,
and other languages supported by Apache Thrift, see § 2.5.
– We have implemented generic Python bindings that work for non-vector
spaces, see § 2.6.
– Last, we retired older methods permutation, permutation incsort, and
permutation vptree. The latter two methods are superseded by proj incsort
and proj vptree, respectively.
2.2 Prerequisites
NMSLIB was developed and tested on 64-bit Linux. Yet, almost all the code can
be built and run on 64-bit Windows (two notable exceptions are: LSHKIT and
NN-Descent). Building the code requires a modern C++ compiler that supports
C++11. Currently, we support GNU C++ (≥ 4.7), Intel compiler (≥ 14), Clang
(≥ 3.4), and Visual Studio (≥ 12)7 . Under Linux, the build process relies on
7
One can use the free express version.
6 Bilegsaikhan Naidan and Leonid Boytsov
CMake. Under Windows, one could use Visual Studio projects stored in the
repository. These projects are for Visual Studio 14 (2015). However, they can be
downgraded to work with Visual Studio 12 (see § 3.2).
More specifically, for Linux we require:
Installing C++11 compilers can be tricky, because they are not always provided
as a standard package. This is why we briefly review the installation process
here. In addition, installing compilers does not necessarily make them default
compilers. One way to fix this is on Linux is to set environment variables CXX
and CC. For the GNU 4.7 compiler:
After Apache Thrift is installed, you need to build the library itself. Then,
change the directory to query server/cpp client server and type make (the make-
file may need to be modified, if Apache Thrift is installed to a non-standard
location). The query server has a similar set of parameters to the benchmarking
utility experiment. For example, you can start the server as follows:
There are also three sample clients implemented in C++, Python, and Java.
A client reads a string representation of the query object from the standard
stream. The format is the same as the format of objects in a data file. Here is an
example of searching for ten vectors closest to the first data set vector (stored
in row one) of a provided sample data file:
export DATA_FILE=../../sample_data/final8_10K.txt
head -1 $DATA_FILE | ./query_client -p 10000 -a localhost -k 10
For instructions on using generated code, please consult the Apache Thrift tu-
torial.
make
sudo make install
For an example of using our library in Python, see the script test nmslib vect.py.
Generic vector spaces are supported as well (see python gen bindings). How-
ever, they work only for spaces that properly define define serialization and
de-serialization (see a brief description in § 7.2).
Afterwards, one can simply use the provided Visual Studio solution file. The
solution file references several project (*.vcxproj) files: NonMetricSpaceLib.vcxproj
is the main project file that is used to build the library itself. The output is stored
in the folder similarity search\x64. A more detailed description of the build
process on Windows is given in § 3.2.
Note that the core library, the test utilities, as well as examples of the stan-
dalone applications (projects sample standalone app1 and sample standalone app2)
can be built without installing Boost.
A build process is different under Linux and Windows. In the following sec-
tions, we consider these differences in more detail.
If you do not set variables CXX and CC, the default C++ compiler is used (which
can be fine, if it is the right compiler already).
To create makefiles for a release version of the code, type:
cmake -DCMAKE_BUILD_TYPE=Release .
If you did not create any makefiles before, you can shortcut by typing:
NMSLIB Manual 11
cmake .
cmake -DCMAKE_BUILD_TYPE=Debug .
make
If cmake complains about the wrong version of the GCC, it is most likely that you
forgot to set the environment variables CXX and CC (as described above). If this
is the case, make these variables point to the correction version of the compiler.
Important note: do not forget to delete the cmake cache and make file, before
recreating the makefiles. For example, you can do the following (assuming the
current directory is similarity search):
Also note that, for some reason, cmake might sometimes ignore environmental
variables CXX and CC. In this unlikely case, you can specify the compiler directly
through cmake arguments. For example, in the case of the GNU C++ and the
Release build, this can be done as follows:
The build process creates several binaries. Most importantly, the main bench-
marking utility experiment. The directory similarity search/release con-
tains release versions of these binaries. Debug versions are placed into the folder
similarity search/debug.
Important note: a shortcut command:
cmake .
(re)-creates makefiles for the previously created build. When you type cmake .
for the first time, it creates release makefiles. However, if you create debug make-
files and then type cmake ., this will not lead to creation of release makefiles!
If the user cannot install necessary libraries to a standard location, it is still
possible to build a project. First, download Boost to some local directory. As-
sume it is $HOME/boost download dir. Then, set the corresponding environment
variable, which will inform cmake about the location of the Boost files:
export BOOST_ROOT=$HOME/boost_download_dir
Second, the user needs to install the additional libraries. Assume that the
lib-files are installed to $HOME/local lib, while corresponding include files are
installed to $HOME/local include. Then, the user needs to invoke cmake with
the following arguments (after possibly deleting previously created cache and
makefiles):
12 Bilegsaikhan Naidan and Leonid Boytsov
cmake . -DCMAKE_LIBRARY_PATH=$HOME/local_lib \
-DCMAKE_INCLUDE_PATH=$HOME/local_include \
-DBoost_NO_SYSTEM_PATHS=true
Note the last option. Sometimes, an old version of Boost is installed. Setting the
variable Boost NO SYSTEM PATHS to true, tells cmake to ignore such an installa-
tion.
To use the library in external applications, which do not belong to the library
repository, one needs to install the library first. Assume that an installation loca-
tion is the folder NonMetrLibRelease in the home directory. Then, the following
commands do the trick:
cmake \
-DCMAKE_INSTALL_PREFIX=$HOME/NonMetrLibRelease \
-DCMAKE_BUILD_TYPE=Release .
make install
A directory sample standalone app contains two sample programs (see files
sample standalone app1.cc and sample standalone app2.cc) that use NMSLIB
binaries installed in the folder $HOME/NonMetrLibRelease.
one can simply run the binary eclipse (in a newly created directory eclipse).
On the first start, Eclipse will ask you select a repository location. This would be
the place to store the project metadata and (optionally) actual project source
files. The following description is given for Eclipse Europe. It may be a bit
different with newer versions of Eclipse.
After selecting the workspace, the user can import the Eclipse project stored
in the GitHub repository. Go to the menu File, sub-menu Import, category
General and choose to import an existing project into the workspace as shown
in Fig. 3. After that select a root directory. To this end, go to the direc-
tory where you checked out the contents of the GitHub repository and enter
a sub-directory similarity search. You should now be able to see the project
Non-Metric-Space-Library as shown in Fig 4. You can now finalize the import
by pressing the button Finish.
10
There are a few categories of people, including students, who can ask for a free
license, though.
NMSLIB Manual 13
After building you can debug the project. To do this, you need to create
a debug configuration. As an example, one configuration can be found in the
project folder launches. Right click on the item sample.launch, choose the
option Debug as (in the drop-down menu), and click on sample (in the pop-up
menu). Do not forget to edit command line arguments before you actually debug
the application!
After switching to a debug perspective, the Eclipse may stop the debugger
in the file dl-debug.c as shown in Fig. 7. If this happened, simply, press the
continue icon a couple of times until the debugger enters the code belonging to
the library.
Additional configurations can be created by right clicking on the project
name (left pane), selecting Properties in the pop-up menu and clicking on
Run/Debug settings. The respective screenshot is shown in Fig. 6.
Note that this manual contains only a basic introduction to Eclipse. If the
user is new to Eclipse, we recommend reading additional documentation available
online.
Download Visual Studio 2015 Express for Desktop. Download and install respec-
tive Boost binaries. Please, use the default installation directory on disk c:. In
NMSLIB Manual 15
the end of the section, we explain how to select a different location of the Boost
files, as well as to how downgrade the project to build it with Visual Studio 2013
(if this is really necessary).
After downloading Visual Studio and installing Boost (version 59, 64-bit
binaries, for MSVC-14), it is straightforward to build the project using the pro-
vided Visual Studio solution file. The solution file references several (sub)-project
(*.vcxproj) files, which can be built either separately or all together.
The main sub-project is NonMetricSpaceLib, which is built before any other
sub-projects. Sub-projects: sample standalone app1, sample standalone app2
are examples of using the library in a standalone mode. Unlike building under
Linux, we provide no installation procedure yet. In a nutshell, the installation
consists in copying the library binary as well as the directory with header files.
There are three possible configurations for the binaries: Release, Debug, and
RelWithDebInfo (release with debug information). The corresponding output
files are placed into the subdirectories:
similarity_search\x64\Release,
similarity_search\x64\Debug,
similarity_search\x64\RelWithDebInfo.
16 Bilegsaikhan Naidan and Leonid Boytsov
Unlike other compilers, there seems to be no way to detect the CPU type in
the Visual Studio automatically.11 And, by default, only SSE2 is enabled (be-
cause it is supported by all 64-bit CPUs). Therefore, if the user’s CPU supports
AVX extensions, it is recommended to modify code generation settings as shown
in the screenshot in Fig. 8. This should be done for all sub-projects and all
binary configurations. Note that you can set a property for all projects at once,
if you select all the sub-projects, right-click, and then choose Properties in the
pop-up menu.
The core library, the semi unit test binary as well as examples of the stan-
dalone applications can be built without installing Boost. However, Boost li-
braries are required for the binaries experiment.exe, tune vptree.exe, and
test integr.exe.
We would re-iterate that one needs 64-bit Boost binaries compiled with the
same version of the Visual Studio as the NMSLIB binaries. If you download the
installer for Boost 59 and install it to a default location, then you do not have to
change project files. Should you install Boost into a different folder, the location
of Boost binaries and header file need to be specified in the project settings for
all three build configurations (Release, Debug, RelWithDebInfo). An example
of specifying the location of Boost libraries (binaries) is given in Fig. 9.
11
It is not also possible to opt for using only SSE4.
NMSLIB Manual 17
In the unlikely case that the user has to use older Visual Studio 12, the project
files are to be downgraded. To do so, one have to manually edit every *.vcxproj
file by replacing each occurrence of <PlatformToolset>v140</PlatformToolset>
with <PlatformToolset>v120</PlatformToolset>. Additionally, one has to
download Boost binaries compatible with the older Visual studio and modify
the project files accordingly. In particular, one may need to modify options
Additional Include Directories and Additional Library Directories.
§ 3.5.2) fall in a certain pre-recorded range. Because almost all our methods are
randomized, there is a great deal of variance in the observed performance char-
acteristics. Thus, some tests may fail infrequently, if e.g., the actual recall value
is slightly lower or higher than an expected minimum or maximum. From an
error message, it should be clear if the discrepancy is substantial, i.e., something
went wrong, or not, i.e., we observe an unlikely outcome due to randomization.
The exact search method, however, should always have an almost perfect recall.
Variance is partly due to using low-dimensional test sets. In the future, we
plan to change this. For high-dimensional data sets, the outcomes are much more
stable despite the randomized nature of most implementations.
Finally, one is encouraged to run a small-scale test using the script test run.sh
on Linux (test run.bat on Windows). The test includes several important meth-
ods, which can be very efficient. Both the Linux and the Windows scripts expect
some dense vector space file (see § 4 and § 9) and the number of threads as the
first two parameters. The Linux script has an additional third parameter. If the
user specifies 1, the software creates a plot (see § 3.7). On Linux, the user can
verify that for the Colors data set [25] the plot looks as follows:
102
101
HNSW
NAPP
SW-graph
VP-tree
recall
On Windows, the script test run.bat creates only the data (output file K1̄0.dat)
and the report file (output file K1̄0.rep). The report file can be checked man-
ually. The data file can be used to create a plot via Excel or Google Docs.
20 Bilegsaikhan Naidan and Leonid Boytsov
3.4.1 Space and distance value type A distance function can return an
integer (int), a single-precision (float), or a double-precision (double) real
value. A type of the distance and its value is specified as follows:
3.4.2 Input Data/Test Set There are two options that define the data to
be indexed:
The input file can be indexed either completely, or partially. In the latter case,
the user can create the index using only the first --maxNumData elements.
For testing, the user can use a separate query set. It is, again, possible to
limit the number of queries:
3.4.3 Query Type Our framework supports the k-NN and the range search.
The user can request to run both types of queries:
the user requests to run queries of five different types: 1-NN, 10-NN, as well
three range queries with radii 0.01, 0.1, and 1.
A method can have a single set of index-time parameters, which is specified via:
In addition to the set of index-time parameters, the method can have multiple
sets of query-time parameters, which are specified using the following (possibly
repeating) option:
For each set of query-time parameters, i.e., for each occurrence of the option
--queryTimeParams, the benchmarking utility experiment, carries out an eval-
uation using the specified set of queries and a query type (e.g., a 10-NN search
with queries from a specified file). If the user does not specify any query-time
parameters, there is only one evaluation to be carried out. This evaluation uses
default query-time parameters. In general, we ensure that whenever a query-time
parameter is missed, the default value is used.
22 Bilegsaikhan Naidan and Leonid Boytsov
Note that this parameter is a coefficient: the actual number of entries is defined
relative to the result set size. For example, if a range search returns 30 entries
and the value of --maxCacheGSRelativeQty is 10, then 30 × 10 = 300 entries
are saved in the gold standard cache file.
– A number of points closer to the query than the nearest returned point. This
metric is equal pos(o1 ) minus one. If o1 is always the true nearest object, its
positional distance is one and, thus, the number of points closer is always
equal to zero.
– A relative position error for point oi is equal to pos(oi )/i, an aggregate value
is obtained by computing the geometric mean over all returned oi ;
– Recall, which is is equal to the fraction of all correct answers retrieved.
– Classification accuracy, which is equal to the fraction of labels correctly
predicted by a k-NN based classification procedure.
The first two metrics represent a so-called rank (approximation) error. The
closer the returned objects are to the query object, the better is the quality of
the search response and the lower is the rank approximation error.
26 Bilegsaikhan Naidan and Leonid Boytsov
If the user specifies the option --outFilePrefix, the benchmarking results are
stored to the file system. A prefix of result files is defined by the parameter
--outFilePrefix while the suffix is defined by a type of the search procedure
(the k-NN or the range search) as well as by search parameters (e.g., the range
search radius). For each type of search, two files are generated: a report in a
human-readable format, and a tab-separated data file intended for automatic
NMSLIB Manual 27
processing. The data file contains only the average values, which can be used to,
e.g., produce efficiency-effectiveness plots as described in § 3.7.
An example of human readable report (confidence intervals are in square
brackets) is given in Table 2. In addition to averages, the human-readable report
provides 95% confidence intervals. In the case of bootstrapping, statistics col-
lected for several splits of the data set are aggregated. For the retrieval time and
the number of distance computations, this is done via a classic fixed-effect model
adopted in meta analysis [27]. When dealing with other performance metrics, we
employ a simplistic approach of averaging split-specific values and computing the
sample variance over spit-specific averages.12 Note for all metrics, except rela-
tive position error, an average is computed using an arithmetic mean. For the
relative error, however, we use the geometric mean [29].
We provide the Python script to generate nice performance graphs from tab-
separated data file produced by the benchmarking utility experiment. The plot-
ting script is genplot configurable.py. In addition to Python, it requires Latex
and PGF. This script is supposed to run only on Linux. For a working example
of using script, please, see the another script: test run.sh. This script runs several
experiments, saves data, and generates a plot (if this is asked by the user).
Consider the following example of using genplot configurable.py:
../scripts/genplot_configurable.py \
-n MethodName \
-i result_K\=1.dat -o plot_1nn \
-x 1~norm~Recall \
-y 1~log~ImprEfficiency \
-a axis_desc.txt \
-m meth_desc.txt \
-l "2~(0.96,-.2)" \
-t "ImprEfficiency vs Recall" \
--xmin 0.01 --xmax 1.2 --ymin -2 --ymax 10
Here the goal is to process the tab-separated data file result K=1.dat, which
was generated by 1-NN search, and save the plot to an output file plot 1nn.pdf.
Note that one should not explicitly specify the extension of the output file (as
.pdf is always implied). Also note that, in addition to the PDF-file, the script
generates the source Latex file. The source Latex file can be post-edited and/or
embedded directly into a Latex source (see PGF documentation for details).
This can be useful for scientific publishing.
The parameter -n specifies the name of the field that stores method/indices
mnemonic names. In the case of the benchmarking utility experiment this field
12
The distribution of many metric values is not normal. There are approaches to resolve
this issue (e.g., apply a transformation), but an additional investigation is needed to
understand which approaches work best.
28 Bilegsaikhan Naidan and Leonid Boytsov
-t "ImprEfficiency vs Recall"
The title of the plot is defind by -t (specify -t "" if you do not want to
print the title). Finally, note that the user can specify the bounding rectangle
for the plot via the options --xmin, --xmax, --ymin, and --ymax.
4 Spaces
4.2 Lp -norms
The Lp distance between vectors x and y are given by the formula:
n
!1/p
X
p
Lp (x, y) = |xi − yi | (1)
i=1
In the limit (p → ∞), the Lp distance becomes the Maximum metric, also known
as the Chebyshev distance:
n
L∞ (x, y) = max |xi − yi | (2)
i=1
L∞ and all spaces Lp for p ≥ 1 are true metrics. They are symmetric, equal
to zero only for identical elements, and, most importantly, satisfy the triangle
inequality. However, the Lp norm is not a metric if p < 1.
In the case of dense vectors, we have reasonably efficient implementations
for Lp distances where p is either integer or infinity. The most efficient imple-
mentations are for L1 (Manhattan), L2 (Euclidean), and L∞ (Chebyshev). As
explained in the author’s blog, we compute exponents through square rooting.
13
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/
NMSLIB Manual 31
Angular distance angulardist, angulardist sparse, angulardist sparse fast 13, 1.4, 3.5
Pn
xi yi
arccos √Pn i=1 2
√P n 2
i=1 xi i=1 yi
SQFD sqfd minus func, sqfd heuristic func:alpha=..., 0.05, 0.05, 0.03
sqfd gaussian func:alpha=... (see § 4.7 for details)
Non-metric spaces (symmetric distance)
Lp (generic p < 1) lp:p=..., lp sparse:p=... 0.1-3, 0.1-1
Pn p 1/p
i=1 |xi − yi |
Cosine distance cosinesimil, cosinesimil sparse, cosinesimil sparse fast 13, 1.4, 3.5
Pn
xi y i
1 − √Pn i=12
√Pn 2
i=1 xi i=1 yi
This works best when the number of digits (after the binary digit) is small, e.g.,
if p = 0.125.
Any Lp space can have a dense and a sparse variant. Sparse vector spaces
have their own mnemonic names, which are different from dense-space mnemonic
names in that they contain a suffix sparse (see also Table 3). For instance l1
and l1 sparse are both L1 spaces, but the former is dense and the latter is
sparse. The mnemonic names of L1 , L2 , and L∞ spaces (passed to the bench-
marking utility) are l1, l2, and linf, respectively. Other generic Lp have the
name lp, which is used in combination with a parameter. For instance, L3 is
denoted as lp:p=3.
Distance functions for sparse-vector spaces are far less efficient, due to a
costly, branch-heavy, operation of matching sparse vector indices (between two
sparse vectors).
The cosine distance is not a true metric, but it can be converted into one by
applying a monotonic transformation (i.e.., subtracting the cosine distance from
one and taking an inverse cosine). The resulting distance function is a true
metric, which is called the angular distance. The angular distance is computed
using the following formula:
Pn !
i=1 xi yi
d(x, y) = arccos pPn 2
pPn
2
i=1 xi i=1 yi
In the case of sparse spaces, to compute the scalar product, we need to ob-
tain an intersection of vector element ids corresponding to non-zero elements.
A classic text-book intersection algorithm (akin to a merge-sort) is not particu-
larly efficient, apparently, due to frequent branching. For single-precision floating
point vector elements, we provide a more efficient implementation that relies on
the all-against-all comparison SIMD instruction mm cmpistrm. This implemen-
tation (inspired by the set intersection algorithm of Schlegel et al. [47]) is about
2.5-3 times faster than a pure C++ implementation based on the merge-sort
approach.
This divergence is symmetric, but it is not a metric function. However, the square
root of the Jensen-Shannon divergence is a proper a metric [20], which we call
the Jensen-Shannon metric.
A straightforward implementation of Eq. 3 is inefficient for two reasons (at
least when one uses the GNU C++ compiler) (1) computation of logarithms is a
slow operation (2) the case of zero xi and/or yi requires conditional processing,
i.e., costly branches.
A better method is to pre-compute logarithms of data at index time. It is also
necessary to compute logarithms of a query vector. However, this operation has
a little cost since it is carried out once for each nearest neighbor or range query.
Pre-computation leads to a 3-10 fold improvement depending on the sparsity of
vectors, albeit at the expense of requiring twice as much space. Unfortunately,
it is not possible to avoid computation of the third logarithm: it needs to be
computed in points that are not known until we see the query vector.
However, it is possible to approximate it with a very good precision, which
should be sufficient for the purpose of approximate searching. Let us rewrite
Equation 3 as follows:
n
1X xi + yi
xi log xi + yi log yi − (xi + yi ) log =
2 i=1 2
n n
1X X (xi + yi ) xi + yi
= [xi log xi + yi log yi ] − log =
2 i=1 i=1
2 2
n
1X
= xi log xi + yi log yi −
2 i=1
n
X (xi + yi ) 1 min(xi , yi )
log + log max(xi , yi ) + log 1 + (4)
i=1
2 2 max(xi , yi )
min(xi ,yi )
We can pre-compute all the logarithms in Eq. 4 except for log 1 + max(x ,y
i i ) .
However, its argument value is in a small range: from one to two. We can dis-
cretize the range, compute logarithms in many intermediate points and save the
computed values in a table. Finally, we employ the SIMD instructions to im-
plement this approach. This is a very efficient approach, which results in a very
little (around 10−6 on average) relative error for the value of the Jensen-Shannon
divergence.
Another possible approach is to use an efficient approximation for logarithm
computation. As our tests show, this method is about 1.5x times faster (1.5 vs
1.0 billions of logarithms per second), but for the logarithms in the range [1, 2],
the relative error is one order magnitude higher (for a single logarithm) than for
the table-based discretization approach.
34 Bilegsaikhan Naidan and Leonid Boytsov
Bregman divergences are typically non-metric distance functions, which are equal
to a difference between some convex differentiable function f and its first-order
Taylor expansion [10,11]. More formally, given the convex and differentiable func-
tion f (of many variables), its corresponding Bregman divergence df (x, y) is
equal to:
df (x, y) = f (x) − f (y) − (f (y) · (x − y))
where x · y denotes the scalar product of vectors x and y. In this library, we im-
plement the generalized KL-divergence
P andPthe Itakura-Saito
P divergence, which
correspond to functions f = xi log xi − xi and f = − log xi . The gener-
alized KL-divergence is equal to:
n
X xi
xi log − xi + yi ,
i=1
yi
We currently provide implementations for the Levenshtein distance and its length-
normalized variant. The original Levenshtein distance is equal to the minimum
number of insertions, deletions, and substitutions (but not transpositions) re-
quired to obtain one string from another [31]. The distance between strings p
and s is computed using the classic O(m × n) dynamic programming solution,
where m and n are lengths of strings p and s, respectively. The normalized Lev-
enshtein distance is obtained by dividing the original Levenshtein distance by
the maximum of string lengths. If both strings are empty, the distance is equal
to zero.
While the original Levenshtein distance is a metric distance, the normalized
Levenshtein function is not, because the triangle inequality may not hold. In
NMSLIB Manual 35
practice, when there is little variance in string length, the violation of the tri-
angle inequality is infrequent and, thus, the normalized Levenshtein distance is
approximately metric for many real data sets.
Technically, the classic Levenshtein distance is equal to Cn,m , where Ci,j is
computed via the classic recursion:
0, if i = j = 0
C
i−1,j + 1, if i > 0
Ci,j = min (5)
Ci,j−1 + 1,
if j > 0
Ci−1,j−1 + [pi 6= sj ], if i, j > 0
Images can be compared using a family of metric functions called the Signature
Quadratic Form Distance (SQFD). During the preprocessing stage, each image
is converted to a set of n signatures (the number of signatures n is a parameter).
To this end, a fixed number of pixels is randomly selected. Then, each pixel
is represented by a 7-dimensional vector with the following components: three
color, two position, and two texture elements. These 7-dimensional vectors are
clustered by the standard k-means algorithm with n centers. Finally, each cluster
is represented by an 8-dimensional vector, called signature. A signature includes
a 7-dimensional centroid and a cluster weight (the number of cluster points
divided by the total number of randomly selected pixels). Cluster weights form
a signature histogram.
The SQFD is computed as a quadratic form applied to a 2n-dimensional vec-
tor constructed by combining images’ signature histograms. The combination
vector includes n unmodified signature histogram values of the first image fol-
lowed by n negated signature histogram values of the second image. Unlike the
classic quadratic form distance, where the quadratic form matrix is fixed, in the
case of the SQFD, the matrix is re-computed for each pair of images. This can
be seen as computing the distance between infinite-dimensional vectors each of
which has only a finite number of non-zero elements.
To compute the quadratic form matrix, we introduce the new global enu-
meration of signatures, in which a signature k from the first image has number
k, while the signature k from the second image has number n + k. To obtain
36 Bilegsaikhan Naidan and Leonid Boytsov
a quadratic form matrix element in row i column j we first compute the Eu-
clidean distance d between the i-th and the j-th signature. Then, the value d is
transformed using one of the three functions: negation (the minus function −d),
1
a heuristic function α+d , and the Gaussian function exp(−αd2 ). The larger is
the distance, the smaller is the coefficient in the matrix of the quadratic form.
Note that the SQFD is a family of distances parameterized by the choice of
the transformation function and α. For further details, please, see the thesis of
Beecks [4].
5 Search Methods
Implemented search methods can be broadly divided into the following cate-
gories:
– Space partitioning methods (including a specialized method bbtree for Breg-
man divergences) § 5.1;
– Locality Sensitive Hashing (LSH) methods § 5.2;
– Filter-and-refine methods based on projection to a lower-dimensional space
§ 5.3;
– Filtering methods based on permutations § 5.4;
– Methods that construct a proximity graph § 5.5;
– Miscellaneous methods § 5.6.
In the following subsections (§ 5.1-5.6), we describe implemented methods,
explain their parameters, and provide examples of their use via the benchmarking
utility experiment (experiment.exe on Windows). Note that a few parameters
are query-time parameters, which means that they can be changed without re-
building the index see § 3.4.6. For the description of the utility experiment see
§ 3.4. For several methods we provide basic tuning guidelines, see § 6.
schemes exploit locality. They either divide the data into clusters or create,
possibly approximate, Voronoi partitions. In the latter case, for example, we can
select several centers/pivots πi and associate data points with the closest center.
If the current partition contains fewer than bucketSize (a method parame-
ter) elements, we stop partitioning of the space and place all elements belonging
to the current partition into a single bucket. If, in addition, the value of the
parameter chunkBucket is set to one, we allocate a new chunk of memory that
contains a copy of all bucket vectors. This method often halves retrieval time at
the expense of extra memory consumed by a testing utility (e.g., experiment)
as it does not deallocate memory occupied by the original vectors. 15
Classic hierarchical space partitioning methods for metric spaces are exact. It
is possible to make them approximate via an early termination technique, where
we terminate the search after exploring a pre-specified number of partitions. To
implement this strategy, we define an order of visiting partitions. In the case
of clustering methods, we first visit partitions that are closer to a query point.
In the case of hierarchical space partitioning methods such as the VP-tree, we
greedily explore partitions containing the query.
In NMSLIB, the early termination condition is defined in terms of the max-
imum number of buckets (parameter maxLeavesToVisit) to visit before termi-
nating the search procedure. By default, the parameter maxLeavesToVisit is
set to a large number (2147483647), which means that no early termination is
employed. The parameter maxLeavesToVisit is supported by many, but not all
space partitioning methods.
There are several ways to obtain/specify optimal parameters for the VP-tree:
– It can be used with other VP-tree based methods, in particular, with the
projection VP-tree (see § 5.3.2).
– It allows the user to specify a separate query set, which can be useful when
queries cannot be accurately modelled by a bootstrapping approach (sam-
pling queries from the main data set).
– Once the optimal values are computed, they can be further re-used without
the need to start the tunning procedure each time the index is created.
– However, the user is fully responsible for specifying the size of the test data
set and the value of the parameter desiredRecall: the system will not try
to change them for optimization purposes.
If automatic tunning fails, the user can restart the procedure with the smaller
value of desiredRecall. Alternatively, the user can manually specify values of
parameters: alphaLeft, alphaRight, expLeft, and expRight (by default expo-
nents are one).
The following is an example of testing the VP-tree with the benchmarking
utility experiment without the auto-tunning (note the separation into index-
and query-time parameters):
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method vptree \
--createIndex bucketSize=10,chunkBucket=1 \
--queryTimeParams alphaLeft=2.0,alphaRight=2.0,\
expLeft=1,expRight=1,\
maxLeavesToVisit=500
To initiate auto-tuning, one may use the following command line (note that
we do not use the parameter maxLeavesToVisit here):
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
40 Bilegsaikhan Naidan and Leonid Boytsov
--method vptree \
--createIndex tuneK=1,desiredRecall=0.9,\
bucketSize=10,chunkBucket=1
5.1.2 Multi-Vantage Point Tree It is possible to have more than one pivot
per tree level. In the binary version of the multi-vantage point tree (MVP-tree),
which is implemented in NMSLIB, there are two pivots. Thus, each partition
divides the space into four parts, which are similar to partitions created by two
levels of the VP-tree. The difference is that the VP-tree employs three pivots to
divide the space into four parts, while in the MVP-tree two pivots are used.
In addition, in the MVP-tree we memorize distances between a data object
and the first maxPathLen (method parameter) pivots on the path connecting the
root and the leaf that stores this data object. Because mapping an object to a
vector of distances (to maxPathLen pivots) defines the contractive embedding in
the metric spaces with L∞ distance, these values can be used to improve the
filtering capacity of the MVP-tree and, consequently to reduce the number of
distance computations.
The following is an example of testing the MVP-tree with the benchmarking
utility experiment:
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method mvptree \
--createIndex maxPathLen=4,bucketSize=10,chunkBucket=1 \
--queryTimeParams maxLeavesToVisit=500
5.1.3 GH-Tree A GH-tree [52] is a binary tree. In each node the data set
is divided using two randomly selected pivots. Elements closer to one pivot are
placed into a left subtree, while elements closer to the second pivot are placed
into a right subtree.
The following is an example of testing the GH-tree with the benchmarking
utility experiment:
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method ghtree \
--createIndex bucketSize=10,chunkBucket=1 \
--queryTimeParams maxLeavesToVisit=10
NMSLIB Manual 41
No parameters
5.1.4 List of Clusters The list of clusters [14] is an exact search method for
metric spaces, which relies on flat (i.e., non-hierarchical) clustering. Clusters are
created sequentially starting by randomly selecting the first cluster center. Then,
close points are assigned to the cluster and the clustering procedure is applied
to the remaining points. Closeness is defined either in terms of the maximum
radius, or in terms of the maximum number (bucketSize) of points closest to
the center.
Next we select cluster centers according to one of the policies: random selec-
tion, a point closest to the previous center, a point farthest from the previous
center, a point that minimizes the sum of distances to the previous center, and
a point that maximizes the sum of distances to the previous center. In our ex-
perience, a random selection strategy (a default one) works well in most cases.
The search algorithm iterates over the constructed list of clusters and checks
if answers can potentially belong to the currently selected cluster (using the
triangle inequality). If the cluster can contain an answer, each cluster element
is compared directly against the query. Next, we use the triangle inequality to
verify if answers can be outside the current cluster. If this is not possible, the
search is terminated.
We modified this exact algorithm by introducing an early termination condi-
tion. The clusters are visited in the order of increasing distance from the query to
a cluster center. The search process stops after vising a maxLeavesToVisit clus-
ters. Our version is supposed to work for metric spaces (and symmetric distance
functions), but it can also be used with mildly-nonmetric symmetric distances
such as the cosine distance.
An example of testing the list of clusters using the bucketSize as a parameter
to define the size of the cluster:
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method list_clusters \
--createIndex useBucketSize=1,bucketSize=100,strategy=random \
--queryTimeParams maxLeavesToVisit=5
An example of testing the list of clusters using the radius as a parameter to
define the size of the cluster:
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method list_clusters \
--createIndex useBucketSize=0,radius=0.2,strategy=random \
--queryTimeParams maxLeavesToVisit=5
NMSLIB Manual 43
5.1.5 SA-tree The Spatial Approximation tree (SA-tree) [39] aims to approx-
imate the Voronoi partitioning. A data set is recursively divided by selecting
several cluster centers in a greedy fashion. Then, all remaining data points are
assigned to the closest cluster center.
A cluster-selection procedure first randomly chooses the main center point
and arranges the remaining objects in the order of increasing distances to this
center. It then iteratively fills the set of clusters as follows: We start from the
empty cluster list. Then, we iterate over the set of data points and check if there
is a cluster center that is closer to this point than the main center point. If no
such cluster exists (i.e., the point is closer to the main center point than to any
of the already selected cluster centers), the point becomes a new cluster center
(and is added to the list of clusters). Otherwise, the point is added to the nearest
cluster from the list.
After the cluster centers are selected, each of them is indexed recursively
using the already described algorithm. Before this, however, we check if there
are points that need to be reassigned to a different cluster. Indeed, because the
list of clusters keeps growing, we may miss the nearest cluster not yet added to
the list. To fix this, we need to compute distances among every cluster point and
cluster centers that were not selected at the moment of the point’s assignment
to the cluster.
Currently, the SA-tree is an exact search method for metric spaces without
any parameters. The following is an example of testing the SA-tree with the
benchmarking utility experiment:
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method satree
5.1.6 bbtree A Bregman ball tree (bbtree) is an exact search method for
Bregman divergences [11]. The bbtree divides data into two clusters (each cov-
ered by a Bregman ball) and recursively repeats this procedure for each cluster
until the number of data points in a cluster falls below bucketSize. Then, such
clusters are stored as a single bucket.
At search time, the method relies on properties of Bregman divergences to
compute the shortest distance to a covering ball. This is a rather expensive
iterative procedure that may require several computations of direct and inverse
gradients, as well as of several distances.
Additionally, Cayton [11] employed an early termination method: The algo-
rithm can be told to stop after processing a maxLeavesToVisit buckets. The
resulting method is an approximate search procedure.
Our implementation of the bbtree uses the same code to carry out the nearest-
neighbor and the range searching. Such an implementation of the range searching
is somewhat suboptimal and a better approach exists [12].
44 Bilegsaikhan Naidan and Leonid Boytsov
The following is an example of testing the multi-probe LSH with the bench-
marking utility experiment. We aim to achieve the recall value 0.25 (parameter
desiredRecall) for the 1-NN search (parameter tuneK):
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method lsh_multiprobe \
--createIndex desiredRecall=0.25,tuneK=1,\
T=5,L=25,H=16535
The classic version of the LSH for L2 can be tested as follows:
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method lsh_gaussian \
--createIndex W=2,L=5,M=40,H=16535
There are two ways to use LSH for L1 . First, we can invoke the implemen-
tation based on the Cauchy distribution:
release/experiment \
--distType float --spaceType l1 --testSetQty 5 --maxNumQuery 100 \
--knn 1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method lsh_cauchy \
--createIndex W=2,L=5,M=10,H=16535
Second, we can use L1 implementation based on thresholding. Note that it
does not use the width parameter W:
release/experiment \
--distType float --spaceType l1 --testSetQty 5 --maxNumQuery 100 \
--knn 1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method lsh_threshold \
--createIndex L=5,M=60,H=16535
All but the classic random projections are distance-based and can be applied to
an arbitrary space with the distance function. Random projections can be applied
only to vector spaces. A more detailed description of projection approaches is
given in § A
We provide two basic implementations to generate candidates. One is based
on brute-force searching in the projected space and another builds a VP-tree
over objects’ projections. In what follows, these methods are described in detail.
release/experiment \
--distType float --spaceType cosinesimil --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method proj_incsort \
--createIndex projType=rand,projDim=4 \
--queryTimeParams useCosine=1,dbScanFrac=0.01
mnemonic name is proj vptree. In that, one needs to specify both the param-
eters of the VP-tree (see § 5.1.1) and the projection parameters as in the case
of brute-force searching of projections (see § 5.3.1).
The major difference from the brute-force search over projections is that,
instead of choosing between L2 and cosine distance as the distance in the pro-
jected space, one uses a methods’ parameter projSpaceType to specify an ar-
bitrary one. Similar to the regular VP-tree implementation, optimal αlef t and
αright are determined by the utility tune vptree via a grid search like procedure
(tune vptree.exe on Windows).
This method, unfortunately, tends to perform worse than the VP-tree ap-
plied to the original space. The only exception are spaces with high intrinsic
(and, perhaps, representational) dimensionality where VP-trees (even with an
approximate search algorithm) are useless unless dimensionality is reduced sub-
stantially. One example is Wikipedia tf-idf vectors, see § 9.
The following is an example of testing the VP-tree over projections with the
benchmarking utility experiment:
release/experiment \
--distType float --spaceType cosinesimil --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method proj_vptree \
--createIndex projType=rand,projDim=4,projSpaceType=cosinesimil \
--queryTimeParams alphaLeft=2,alphaRight=2,dbScanFrac=0.01
and an end position in the i-th list. For each data point, we allocate a zero-
initialized counter. We further create a projection of the query and use numPivot
binary searches to find numPivot data points that have the closest i-th projection
coordinates. In each of the i list, we make both highi and lowi point to the found
data entries. In addition, for each data point found, we increase its counter. Note
that a single data point may appear the closest with respect to more than one
projection coordinate!
After that, we run a series of iterations. In each iteration, we increase numPivot
pointers highi and decrease numPivot pointers lowi (unless we reached the be-
ginning or the end of a list). For each data entry at which the pointer points,
we increase the value of the counter. Obviously, when we complete traversal of
all numPivot lists, each counter will have the value numPivot (recall that each
data point appears exactly once in each of the lists). Thus, sooner or later the
value of a counter becomes equal to or larger than numPivot × minFreq, where
minFreq is a method’s parameter, e.g., 0.5.
The first point whose counter becomes equal to or larger than numPivot ×
minFreq, becomes the first candidate entry to be compared directly against the
query. The next point whose counter matches the threshold value numPivot ×
minFreq, becomes the second candidate and so on so forth. The total num-
ber of candidate entries is defined by the parameter dbScanFrac. Instead of
all numPivot lists, it its possible to use only numPivotSearch lists that cor-
respond to the smallest absolute value of query’s projection coordinates. In
this case, the counter threshold is numPivotSearch × minFreq. By default,
numPivot = numPivotSearch.
Note that parameters numPivotSearch and dbScanFrac were introduced by
us, they were not employed in the original version of OMEDRANK.
The following is an example of testing OMEDRANK with the benchmarking
utility experiment:
release/experiment \
--distType float --spaceType cosinesimil --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method omedrank \
--createIndex projType=rand,numPivot=8 \
--queryTimeParams minFreq=0.5,dbScanFrac=0.02
Rather than relying on distance values directly, we can assess similarity of objects
based on their relative distances to reference points (i.e., pivots). For each data
point x, we can arrange pivots π in the order of increasing distances from x
(for simplicity we assume that there are no ties). This arrangement is called a
permutation. The permutation is essentially a pivot ranking. Technically, it is a
vector whose i-th element keeps an (ordinal) position of the i-th pivot (in the
set of pivots sorted by a distance from x).
50 Bilegsaikhan Naidan and Leonid Boytsov
5.4.4 Metric Inverted File (MI-File) relies on the inverted index over
permutations [2]. We select (a potentially large) subset of pivots (parameter
numPivot). Using these pivots, we compute a permutation for every data point.
Then, numPivotIndex most closest pivots are memorized in a data file. If a
pivot number i is the pos-th most distant pivot for the object x, we add the
pair (pos, x) to the posting list number i. All posting lists are kept sorted in the
order of the increasing first element (equal to the ordinal position of the pivot
in a permutation).
During searching, we compute the permutation of the query and select post-
ing lists corresponding to numPivotSearch most closest pivots. These posting
lists are processed as follows: Imagine that we selected posting list i and the po-
sition of pivot i in the permutation of the query is pos. Then, using the posting
list i, we retrieve all candidate records for which the position of the pivot i in
their respective permutations is from pos − maxPosDiff to pos + maxPosDiff.
This allows us to update the estimate for the L1 distance between retrieved can-
didate records’ permutations and the permutation of the query (see [2] for more
details).
Finally, we select at most dbScanFrac · N objects (N is the total number of
indexed objects) with the smallest estimates for the L1 between their permu-
tations and the permutation of the query. These objects are compared directly
against the query. The filtering step of the MI-file is expensive. Therefore, this
method is efficient only for computationally-intensive distances.
NMSLIB Manual 53
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method mi-file \
--createIndex numPivot=128,numPivotIndex=16 \
--queryTimeParams numPivotSearch=4,dbScanFrac=0.01
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--cachePrefixGS napp_gold_standard \
--method napp \
--createIndex numPivot=32,numPivotIndex=8,chunkIndexSize=1024 \
--queryTimeParams numPivotSearch=8 \
--saveIndex napp_index
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method napp \
--createIndex numPivot=32,numPivotIndex=8,chunkIndexSize=1024 \
--queryTimeParams useSort=1,dbScanFrac=0.01,numPivotSearch=8
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 --range 0.1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method perm_incsort_bin \
--createIndex numPivot=32,binThreshold=16 \
--queryTimeParams dbScanFrac=0.05
NMSLIB Manual 55
dates for the list of graph nodes, but this operation takes little time compared
to searching for NN neighboring points.
An example of testing this method using the utility experiment is as follows:
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--cachePrefixGS sw-graph \
--method sw-graph \
--createIndex NN=3,initIndexAttempts=5,indexThreadQty=4 \
--queryTimeParams initSearchAttempts=1,efSearch=10 \
--saveIndex sw-graph_index
used to keep only the best neighbors. Specifically, the maximum number of neigh-
bors in all layers but the ground layer is maxM (an index-time parameter, which
is equal to M by default). The maximum number of neighbors for the ground
layer is maxM0 (an index-time parameter, which is equal to 2×M by default). The
choice of the heuristic is controlled by the parameter delaunay type.
A search algorithm is similar to the indexing algorithm. It starts from the
maximum-level layer and proceeds to lower-level layers by searching one layer
at a time. For all layers higher than the ground layer, the search algorithm is
a 1-NN search that greedily follows the closest neighbor (this is equivalent to
having efSearch=1). The closest point found at the layer h + 1 is used as a
starting point for the search carried out at the layer h. For the ground layer, we
carry an k-NN search whose quality is controlled by the parameter efSearch (in
the paper by Malkov and Yashunin [37] this parameter is denoted as ep). The
ground-layer search relies one the same algorithm as we use for the SW-graph,
but it does not carry out multiple sub-searches starting from different random
data points.
For L2 and the cosine similarity, HNSW has optimized implementations,
which are enabled by default. To enforce the use of the generic algorithm, set
the parameter skip optimized index to one.
Similar to SW-graph, the indexing algorithm can be expensive. It is, there-
fore, accelerated by running parallel searches in multiple threads. The number of
threads is defined by the parameter indexThreadQty. By default, this parameter
is equal to the number of virtual cores.
A sample command line to test HNSW using the utility experiment:
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method hnsw \
--createIndex M=10,efConstruction=20,indexThreadQty=4,searchMethod=0 \
--queryTimeParams efSearch=10
HNSW is capable of saving an index for optimized L2 and the cosine-similarity
implementations. Here is an example for L2 :
release/experiment \
--distType float --spaceType cosinesimil --testSetQty 5 --maxNumQuery 100 \
--knn 1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--cachePrefixGS hnsw \
--method hnsw \
--createIndex M=10,efConstruction=20,indexThreadQty=4,searchMethod=4 \
--queryTimeParams efSearch=10
--saveIndex hnsw_index
This process is governed by parameters rho and delta. Parameter rho defines
a fraction of the data set that is randomly sampled for neighborhood propaga-
tion. A good value that works in many cases is rho = 0.5. As the indexing
algorithm iterates, fewer and fewer neighborhoods change (when we attempt to
improve the local neighborhood structure via neighborhood propagation). The
parameter delta defines a stopping condition in terms of a fraction of modi-
fied edges in the k-NNgraph (the exact definition can be inferred from code). A
good default value is delta=0.001. The indexing algorithm is multi-threaded:
the method uses all available cores.
When NN-descent was incorporated into NMSLIB, there was no open-source
search algorithm released, only the code to construct a k-NNgraph. Therefore,
we use the same algorithm as for the SW-graph [35,36]. The new, open-source,
version of NN-descent (code-named kgraph), which does include the search al-
gorithm, can be found on GitHub.
Here is an example of testing this method using the utility experiment:
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method nndes \
--createIndex NN=10,rho=0.5,delta=0.001 \
--queryTimeParams initSearchAttempts=3
NN For a newly added point find this number of most closest points
that make the initial neighborhood of the point. When more
points are added, this neighborhood may be expanded.
efConstruction The depth of the search that is used to find neighbors during
indexing. This parameter is analogous to efSearch.
initIndexAttempts The number of random search restarts carried out to add one
point.
indexThreadQty The number of indexing threads. The default value is equal to
the number of (logical) CPU cores.
initSearchAttempts A number of random search restarts.
NN For each point find this number of most closest points (neigh-
bors).
rho A fraction of the data set that is randomly sampled for neigh-
borhood propagation.
delta A stopping condition in terms of the fraction of updated edges
in the k-NNgraph.
initSearchAttempts A number of random search restarts.
Note: mnemonic method names are given in round brackets.
62 Bilegsaikhan Naidan and Leonid Boytsov
No parameters.
Note: mnemonic method names are given in round brackets.
release/experiment \
--distType float --spaceType l2 --testSetQty 5 --maxNumQuery 100 \
--knn 1 \
--dataFile ../sample_data/final8_10K.txt --outFilePrefix result \
--method mult_index \
--createIndex methodName=vptree,indexQty=5 \
--queryTimeParams maxLeavesToVisit=2
6 Tuning Guidelines
In the following subsections, we provide brief tuning guidelines. These guide-
lines are broad-brush: The user is expected to find optimal parameters through
experimentation on some development data set.
6.1 NAPP
Generally, increasing the overall number of pivots numPivot helps to improve
performance. However, using a large number of pivots leads to increased indexing
times. √
A good compromise is to use numPivot somewhat larger than N , where
N is the overall number of data points. Similarly, the number
√ of pivots to be
indexed (numPivotIndex) should be somewhat larger than numPivot. Finally,
17
In fact, all of the methods except for the sequential, i.e., brute-force, search are
randomized.
NMSLIB Manual 63
The basic guidelines are similar for both methods. Specifically, increasing the
value of efConstruction improves the quality of a constructed graph and leads
to higher accuracy of search. However this also leads to longer indexing times.
Similarly, increasing the value of efSearch improves recall at the expense of
longer retrieval time. The reasonable range of values for these parameters is
100-2000.
In the case of SW-graph, the user can also specify the number of sub-
searches: initIndexAttempts and initSearchAttempts used during indexing
and retrieval, respectively. However, we find that in most cases the number of
sub-searches needs to be set to one. Yet, for large values of efConstruction
and efSearch (e.g., larger than 2000) it sometimes makes sense to increase the
number of sub-searches rather than further increasing efConstruction and/or
efSearch.
The recall values are also affected by parameters NN (for SW-graph) and M
(HNSW). Increasing the values of these parameters (to a certain degree) leads to
better recall and shorter retrieval times (at the expense of longer indexing time).
For low and moderate recall values (e.g., 60-80%) increasing these parameters
may lead to longer retrieval times. The reasonable range of values for these
parameters is 5-100.
Finally, in the case of HNSW, there is a trade-off between retrieval perfor-
mance and indexing time related to the choice of the pruning heuristic (controlled
by the parameter delaunay type). Specifically, by default delaunay type is
equal to 1. Using delaunay type=1 improves performance—especially at high
recall values (> 80%) at the expense of longer indexing times. Therefore, for
lower recall values, we recommend using delaunay type=0.
It is possible to add new spaces and search methods. This is done in three steps,
which we only outline here. A more detailed description can be found in § 7.2
and § 7.3.
In the first step, the user writes the code that implements a functionality
of a method or a space. In the second step, the user writes a special helper file
containing a method that creates a class or a method. In this helper file, it is
necessary to include the method/space header.
64 Bilegsaikhan Naidan and Leonid Boytsov
Because we tend to give the helper file the same name as the name of header
for a specific method/space, we should not include method/space headers using
quotes (in other words, use only angle brackets). Such code fails to compile under
the Visual Studio. Here is an example of a proper include-directive:
#include <method/vptree.h>
In the third step, the user adds the registration code to either the file
init spaces.h (for spaces) or to the file init methods.h (for methods). This step
has two sub-steps. First, the user includes the previously created helper file into
either init spaces.h or init methods.h. Second, the function initMethods or
initSpaces is extended by adding a macro call that actually registers the space
or method in a factory class.
Note that no explicit/manual modification of makefiles (or other configura-
tion files) is required. However, you have to re-run cmake each time a new source
file is created (addition of header files does not require a cmake run). This is
necessary to automatically update makefiles so that they include new source
files.
Is is noteworthy that all implementations of methods and spaces are mostly
template classes parameterized by the distance value type. Recall that the dis-
tance function can return an integer (int), a single-precision (float), or a
double-precision (double) real value. The user may choose to provide specializa-
tions for all possible distance values or decide to focus, e.g., only on integer-valued
distances.
The user can also add new applications, which are meant to be a part of the
testing framework/library. However, adding new applications does require minor
editing of the meta-makefile CMakeLists.txt (and re-running cmake § 3.1) on
Linux, or creation of new Visual Studio sub-projects on Windows (see § 3.2). It
is also possible to create standalone applications that use the library. Please, see
§ 3.1 and § 3.2 for details.
In the following subsections, we consider extension tasks in more detail. For il-
lustrative purposes, we created a zero-functionality space (DummySpace), method
(DummyMethod), and application (dummy app). These zero-functionality examples
can also be used as starting points to develop fully functional code.
function SaveIndex. Note, however, that most methods do not support index
(de)-serialization.
Depending on parameters passed to the benchmarking utility, two test sce-
narios are possible. In the first scenario, the user specifies separate data and test
files. In the second scenario, a test file is created by bootstrapping: The data
set is randomly divided into training and a test set. Then, we call the function
RunAll and subsequently Execute for all possible test sets.
The function Execute is a main workhorse, which creates queries, runs searches,
produces gold standard data, and collects execution statistics. There are two
types of queries: nearest-neighbor and range queries, which are represented by
(template) classes RangeQuery and KNNQuery. Both classes inherit from the class
Query. Similar to spaces, these template classes are parameterized by the type
of the distance value.
Both types of queries are similar in that they implement the Radius function
and the functions CheckAndAddToResult. In the case of the range query, the
radius of a query is constant. However, in the case of the nearest-neighbor query,
the radius typically decreases as we compare the query with previously unseen
data objects (by calling the function CheckAndAddToResult). In both cases, the
value of the function Radius can be used to prune unpromising partitions and
data points.
This commonality between the RangeQuery and KNNQuery allows us in many
cases to carry out a nearest-neighbor query using an algorithm designed to an-
swer range queries. Thus, only a single implementation of a search method—that
answers queries of both types—can be used in many cases.
A query object proxies distance computations during the testing phase. Namely,
the distance function is accessible through the function IndexTimeDistance,
which is defined in the class Space. During the testing phase, a search method
can compute a distance only by accessing functions Distance, DistanceObjLeft
(for left queries) and DistanceObjRight for right queries, which are member
functions of the class Query. The function Distance accepts two parameters
(i.e., object pointers) and can be used to compare two arbitrary objects. The
functions DistanceObjLeft and DistanceObjRight are used to compare data
objects with the query. Note that it is a query object memorizes the number of
distance computations. This allows us to compute the variance in the number
of distance evaluations and, consequently, a respective confidence interval.
that are responsible for reading objects from files, interpreting the structure of
the data blobs (stored in the Object), and computing a distance between two
objects.
For dense vector spaces the easiest way to create a new space, is to create
a functor (function object class) that computes a distance. Then, this function
should be used to instantiate a template VectorSpaceGen. A sample implemen-
tation of this approach can be found in sample standalone app1.cc. However,
as we explain below, additional work is needed if the space should work cor-
rectly with all projection methods (see § 5.3) or any other methods that rely on
projections (e.g., OMEDRANK § 5.3.3).
To further illustrate the process of developing a new space, we created a
sample zero-functionality space DummySpace. It is represented by the header file
space dummy.h and the source file space dummy.cc. The user is encouraged to
study these files and read the comments. Here we focus only on the main aspects
of creating a new space.
The sample files include a template class DummySpace (see Table 11), which
is declared and defined in the namespace similarity. It is a direct ancestor of
the class Space.
It is possible to provide the complete implementation of the DummySpace
in the header file. However, this would make compilation slower. Instead, we
recommend to use the mechanism of explicit template instantiation. To this
end, the user should instantiate the template in the source file for all possible
combination of parameters. In our case, the source file space dummy.cc contains
the following lines:
template class SpaceDummy<int>;
template class SpaceDummy<float>;
template class SpaceDummy<double>;
Most importantly, the user needs to implement the function HiddenDistance,
which computes the distance between objects, and the function CreateObjFromStr
that creates a data point object from an instance of a C++ class string. For
simplicity—even though this is not the most efficient approach—all our spaces
create objects from textual representations. However, this is not a principal lim-
itation, because a C++ string can hold binary data as well. Perhaps, the next
most important function is ReadNextObjStr, which reads a string representation
of the next object from a file. A file is represented by a reference to a subclass
of the class DataFileInputState.
Compared to previous releases, the new Space API is substantially more
complex. This is necessary to standardize reading/writing of generic objects. In
turn, this has been essential to implementing a generic query server. The query
server accepts data points in the same format as they are stored in a data file.
The above mentioned function CreateObjFromStr is used for de-serialization of
both the data points stored in a file and query data points passed to the query
server.
Additional complexity arises from the need to update space parameters after
a space object is created. This permits a more complex storage model where,
NMSLIB Manual 67
/*
* Write a string representation of the next object to a file. We totally delegate
* this to a Space object, because it may package the string representation, by
* e.g., in the form of an XML fragment.
*/
virtual void WriteNextObj(const Object& obj, const string& externId,
DataFileOutputState &) const;
/** End of standard functions to read/write/create objects */
...
/*
* CreateDenseVectFromObj and GetElemQty() are only needed, if
* one wants to use methods with random projections.
*/
virtual void CreateDenseVectFromObj(const Object* obj, dist_t* pVect,
size_t nElem) const {
throw runtime_error("Cannot create vector for the space: " + StrDesc());
}
virtual size_t GetElemQty(const Object* object) const {return 0;}
protected:
virtual dist_t HiddenDistance(const Object* obj1,
const Object* obj2) const;
// Don't permit copying and/or assigning
DISABLE_COPY_AND_ASSIGN(SpaceDummy);
};
68 Bilegsaikhan Naidan and Leonid Boytsov
e.g., parameters are stored in a special dedicated header file, while data points
are stored elsewhere, e.g., split among several data files. To support such func-
tionality, we have a function that opens a data file (OpenReadFileHeader) and
creates a state object (sub-classed from DataFileInputState), which keeps the
current file(s) state as well as all space-related parameters. When we read data
points using the function ReadNextObjStr, the state object is updated. The
function ReadNextObjStr may also read an optional external identifier for an
object. When it produces a non-empty identifier it is memorized by the query
server and is further used for query processing (see § 2.5). After all data points
are read, this state object is supposed to be passed to the Space object in the
following fashion:
unique_ptr<DataFileInputState>
inpState(space->ReadDataset(dataSet, externIds, fileName, maxNumRec));
space->UpdateParamsFromFile(*inpState);
For a more advanced implementation of the space-related functions, please, see
the file space vector.cc.
Remember that the function HiddenDistance should not be directly ac-
cessible by classes that are not friends of the Space. As explained in § 7.1,
during the indexing phase, HiddenDistance is accessible through the function
Space::IndexTimeDistance. During the testing phase, a search method can
compute a distance only by accessing functions Distance, DistanceObjLeft, or
DistanceObjRight, which are member functions of the Query. This is by far not
a perfect solution and we are contemplating about better ways to proxy distance
computations.
Should we implement a vector space that works properly with projection
methods and classic random projections, we need to define functions GetElemQty
and CreateDenseVectFromObj. In the case of a dense vector space, GetElemQty
should return the number of vector elements stored in the object. For sparse
vector spaces, it should return zero. The function CreateDenseVectFromObj
extracts elements stored in a vector. For dense vector spaces, it merely copies
vector elements to a buffer. For sparse space vector spaces, it should do some
kind of basic dimensionality reduction. Currently, we do it via the hashing trick
(see § A).
Importantly, we need to “tell” the library about the space, by registering the
space in the space factory. At runtime, the space is created through a helper
function. In our case, it is called CreateDummy. The function, accepts only one
parameter, which is a reference to an object of the type AllParams:
pmgr.GetParamRequired("param1", param1);
NMSLIB Manual 69
pmgr.GetParamRequired("param2", param2);
pmgr.CheckUnused();
This macro should be placed into the function initSpaces in the file init spaces.h.
Last, but not least we need to add the include-directive for the helper function,
which creates the class, to the file init spaces.h as follows:
#include "factory/space/space_dummy.h"
Similar to the space and query classes, a search method is implemented using
a template class, which is parameterized by the distance function value (see
Table 12). Note again that the constructor of the class does not create an index
in the memory. The index is created using either the function CreateIndex (from
scratch) or the function LoadIndex (from a previously created index image). The
index can be saved to disk using the function SaveIndex. It does not have to
be a comprehensive index that contains a copy of the data set. Instead, it is
sufficient to memorize only the index structure itself (because the data set is
always loaded separately). Also note that most methods do not support index
(de)-serialization.
There are two search functions each of which receives two parameters. The
first parameter is a pointer to a query (either a range or a k-NN query). The
second parameter is currently unused. Note again that during the search phase,
a search method can compute a distance only by accessing functions Distance,
DistanceObjLeft, or DistanceObjRight, which are member functions of a
query object. The function IndexTimeDistance should not be used in a func-
tion Search, but it can be used in the function CreateIndex. If the user attempts
72 Bilegsaikhan Naidan and Leonid Boytsov
to invoke IndexTimeDistance during the test phase, the program will ter-
minate. 18
Finally, we need to “tell” the library about the method, by registering the
method in the method factory, similarly to registering a space. At runtime, the
method is created through a helper function, which accepts several parameters.
One parameter is a reference to an object of the type AllParams. In our case,
the function name is CreateDummy:
#include <method/dummy.h>
namespace similarity {
template <typename dist_t>
Index<dist_t>* CreateDummy(bool PrintProgress,
const string& SpaceType,
Space<dist_t>& space,
const ObjectVector& DataObjects) {
return new DummyMethod<dist_t>(space, DataObjects);
}
#include "factory/method/dummy.h"
Then, this file is further modified by adding the following lines to the function
initMethods:
When adding the method, please, consider expanding the test utility test integr.
This is especially important if for some combination of parameters the method
is expected to return all answers (and will have a perfect recall). Then, if we
break the code in the future, this will be detected by test integr.
18
As noted previously, we want to compute the number of times the distance was
computed for each query. This allows us to estimate the variance. Hence, during the
testing phase, the distance function should be invoked only through a query object.
NMSLIB Manual 73
To create a test case, the user needs to add one or more test cases to the
file test integr.cc. A test case is an instance of the class MethodTestCase. It
encodes the range of plausible values for the following performance parameters:
the recall, the number of points closer to the query than the nearest returned
point, and the improvement in the number of distance computations.
8 Notes on Efficiency
There are also situations when efficient automatic vectorization is hardly pos-
sible. For instance, we provide an efficient implementation of the scalar product
for sparse single-precision floating point vectors. It relies on the all-against-all
comparison SIMD instruction mm cmpistrm. However, it requires keeping the
data in a special format, which makes automatic vectorization impossible.
Intel SSE extensions that provide SIMD instructions are automatically de-
tected by all compilers but the Visual Studio. If some SSE extensions are not
available, the compilation process will produce warnings like the following one:
9 Data Sets
Currently we provide mostly vector space data sets, which come in either dense
or sparse format. For simplicity, these are textual formats where each row of
the file contains a single vector. If a row starts with a prefix in the form:
label:<non-negative integer value> <white-space>, the integer value is
76 Bilegsaikhan Naidan and Leonid Boytsov
The vectors are sparse and most values are not specified. It is up to a designer
of the space to decide on the default value for an unspecified vector element. All
existing implementations use zero as the default value. Again, elements can be
separated by spaces or commas/columns instead of spaces.
In addition, the directory previous releases scripts contains the full set
of scripts that can be used to re-produce our NIPS’13, SISAP’13, DA’14, and
VLDB’15 results [7,8,41,38]. However, one would need to use older software ver-
sion (1.0 for NIPS’13 and 1.1 for VLDB’15). Additionally, to reproduce our previ-
ous results, one needs to obtain data sets using scripts data/get data nips2013.sh
and data/get data vldb2015.sh. Note that for all evaluations except VLDB’15,
you need the previous version of software (1.0), which can be download from
here.
If we use any of the provided data sets, please consider citing the sources
(see Section 10) for details. Also note that, the data will be downloaded in
the compressed form. You would need the standard gunzip or bunzip2 to
uncompress all the data except the Wikipedia (sparse and dense) vectors. The
Wikipedia data is compressed using 7z, which provides superior compression
ratios.
The code that was written entirely by the authors is distributed under the
business-friendly Apache License. The best way to acknowledge the use of this
code in a scientific publication is to provide the URL of the GitHub repository19
and to cite our engineering paper [7]:
@incollection{Boytsov_and_Bilegsaikhan:sisap2013,
year={2013},
isbn={978-3-642-41061-1},
booktitle={Similarity Search and Applications},
volume={8199},
19
https://github.com/searchivarius/NonMetricSpaceLib
NMSLIB Manual 77
11 Acknowledgements
20
Bileg and Leo gratefully acknowledge support by the iAd Center and the
21
Open Advancement of Question Answering Systems (OAQA) group . We also
20
https://web.archive.org/web/20160306011711/http://www.iad-center.com/
21
http://oaqa.github.io/
78 Bilegsaikhan Naidan and Leonid Boytsov
thank Lawrence Cayton for providing data sets (and allowing us to make them
public); Nikita Avrelin for implementing the first version of the SW-graph;
Yury Malkov and Dmitry Yashunin for contributing a hierarchical modifica-
tion of the SW-graph, as well as for guideliness for tuning the original SW-
graph; David Novak for the suggestion to use external pivots in permutation
algorithms; Daniel Lemire for contributing the implementation of the original
Schlegel et al. [47] intersection algorithm. We also thank Andrey Savchenko,
Alexander Ponomarenko, and Yury Malkov for suggestions to improve the li-
brary and the documentation.
References
32. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate
string searches. In Data Engineering, 2008. ICDE 2008. IEEE 24th International
Conference on, pages 257–266. IEEE, 2008.
33. Q. Lv, M. Charikar, and K. Li. Image similarity search with compact data struc-
tures. In Proceedings of the thirteenth ACM international conference on Informa-
tion and knowledge management, pages 208–217. ACM, 2004.
34. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient
indexing for high-dimensional similarity search. In Proceedings of the 33rd inter-
national conference on Very large data bases, pages 950–961. VLDB Endowment,
2007.
35. Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Scalable distributed al-
gorithm for approximate nearest neighbor search problem in high dimensional gen-
eral metric spaces. In Similarity Search and Applications, pages 132–147. Springer,
2012.
36. Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Approximate nearest
neighbor algorithm based on navigable small world graphs. Inf. Syst., 45:61–68,
2014.
37. Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest
neighbor search using Hierarchical Navigable Small World graphs. ArXiv e-prints,
Mar. 2016.
38. B. Naidan, L. Boytsov, and E. Nyberg. Permutation search methods are efficient,
yet faster search is possible. PVLDB, 8(12):1618–1629, 2015.
39. G. Navarro. Searching in metric spaces by spatial approximation. The VLDB
Journal, 11(1):28–46, 2002.
40. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for
similarities in the amino acid sequence of two proteins. J Mol Biol, 48(3):443–453,
March 1970.
41. A. Ponomarenko, N. Avrelin, B. Naidan, and L. Boytsov. Comparative analysis of
data structures for approximate nearest neighbor search. In DATA ANALYTICS
2014, The Third International Conference on Data Analytics, pages 125–130, 2014.
42. A. Ponomarenko, Y. Malkov, A. Logvinov, , and V. Krylov. Approximate nearest
neighbor search small world approach, 2011. Available at http://www.iiis.org/
CDs2011/CD2011IDI/ICTA_2011/Abstract.asp?myurl=CT175ON.pdf.
43. R. Řehůřek and P. Sojka. Software Framework for Topic Modelling with Large
Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP
Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.
cz/publication/884893/en.
44. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-
thy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale
Visual Recognition Challenge, 2014.
45. H. Sakoe and S. Chiba. A dynamic programming approach to continuous speech
recognition. In Proceedings of the Seventh International Congress on Acoustics,
pages 65–68, August 1971. paper 20C13.
46. D. Sankoff. The early introduction of dynamic programming into computational
biology. Bioinformatics, 16(1):41–47, 2000.
47. B. Schlegel, T. Willhalm, and W. Lehner. Fast sorted-set intersection using simd
instructions. In ADMS@ VLDB, pages 1–8, 2011.
48. C. Silpa-Anan and R. I. Hartley. Optimised kd-trees for fast image descriptor
matching. In CVPR, 2008.
49. T. Skopal. Unified framework for fast exact and approximate search in dissimilarity
spaces. ACM Trans. Database Syst., 32(4), Nov. 2007.
NMSLIB Manual 81
The classic random projections work only for vector spaces (both sparse and
dense). At index time, we generate projDim vectors by sampling their elements
from the standard normal distribution N (0, 1) and orthonormalizing them. 22
Coordinates in the projection spaces are obtained by computing scalar products
between a given vector and each of the projDim randomly generated vectors.
In the case of sparse vector spaces, the dimensionality is first reduced via
the hashing trick: the value of the element i is equal to the sum of values for all
elements whose indices are hashed into number i. After hashing, classic random
projections are applied. The dimensionality of the intermediate space is defined
by a method’s parameter intermDim.
The hashing trick is used purely for efficiency reasons. However, for large
enough values of the intermediate dimensionality, it has virtually no adverse
affect on performance. For example, in the case of Wikipedia tf-idf vectors (see
§ 9), it is safe to use the value intermDim=4096.
Random projections work best if both the source and the target space are
Euclidean, whereas the distance is either L2 or the cosine distance. In this case,
there are theoretical guarantees that the projection preserves well distances in
the original space (see e.g. [6]).
22
If the dimensionality of the projection space is larger than the dimensionality of the
original space, only the first projDim vectors are orthonormalized. The remaining
are simply divided by their norms.
82 Bilegsaikhan Naidan and Leonid Boytsov
A.2 FastMap
FastMap introduced by Faloutsos and Lin [23] is also a type of the random-
projection method. At indexing time, we randomly select projDim pairs Ai and
Bi . The i-th coordinate of vector x is computed using the formula:
Given points A and B in the Euclidean space, Eq. 7 gives the length of the
orthogonal projection of x to the line connecting A and B. However, FastMap
can be used in non-Euclidean spaces as well.