0% found this document useful (0 votes)
18 views

Feature Detection 2

Uploaded by

Golu Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Feature Detection 2

Uploaded by

Golu Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

International Journal of Computer Vision (2021) 129:23–79

https://doi.org/10.1007/s11263-020-01359-2

Image Matching from Handcrafted to Deep Features: A Survey


Jiayi Ma1 · Xingyu Jiang1 · Aoxiang Fan1 · Junjun Jiang2 · Junchi Yan3

Received: 9 January 2020 / Accepted: 15 July 2020 / Published online: 4 August 2020
© The Author(s) 2020

Abstract
As a fundamental and critical task in various visual applications, image matching can identify then correspond the same or
similar structure/content from two or more images. Over the past decades, growing amount and diversity of methods have been
proposed for image matching, particularly with the development of deep learning techniques over the recent years. However,
it may leave several open questions about which method would be a suitable choice for specific applications with respect to
different scenarios and task requirements and how to design better image matching methods with superior performance in
accuracy, robustness and efficiency. This encourages us to conduct a comprehensive and systematic review and analysis for
those classical and latest techniques. Following the feature-based image matching pipeline, we first introduce feature detection,
description, and matching techniques from handcrafted methods to trainable ones and provide an analysis of the development
of these methods in theory and practice. Secondly, we briefly introduce several typical image matching-based applications
for a comprehensive understanding of the significance of image matching. In addition, we also provide a comprehensive
and objective comparison of these classical and latest techniques through extensive experiments on representative datasets.
Finally, we conclude with the current status of image matching technologies and deliver insightful discussions and prospects
for future works. This survey can serve as a reference for (but not limited to) researchers and engineers in image matching
and related fields.

Keywords Image matching · Graph matching · Feature matching · Registration · Handcrafted features · Deep learning

1 Introduction

Vision-based artificial systems, as widely used to guide


Communicated by V. Lepetit. machines to perceive and understand the surroundings for
better decision making, have been playing a significant role
B Junchi Yan in the age of global automation and artificial intelligence.
[email protected] However, how to process the perceived information under
Jiayi Ma specific requirements and understand the differences and/or
[email protected] relationships among multiple visual targets are crucial topics
Xingyu Jiang in various fields, including computer vision, pattern recog-
[email protected] nition, image analysis, security, and remote sensing. As a
Aoxiang Fan critical and fundamental problem in these complicated tasks,
[email protected] image matching, also known as image registration or cor-
Junjun Jiang respondence, aims to identify then correspond the same or
[email protected] similar structure/content from two or more images. This tech-
1 Electronic Information School, Wuhan University, Wuhan
nique is used for high-dimensional structure recovery as well
430072, China as information identification and integration, such as 3-D
2 School of Computer Science and Technology, Harbin
reconstruction, visual simultaneous localization and map-
Institute of Technology, Harbin 150001, China ping (VSLAM), image mosaic, image fusion, image retrieval,
3 Department of Computer Science and Engineering, and MoE
target recognition and tracking, as well as change detection,
Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao etc.
Tong University, Shanghai 200240, China

123
24 International Journal of Computer Vision (2021) 129:23–79

Image matching has rich meaning in pairing two objects, ing illumination, and different imaging sensors, which
thus deriving many specific tasks, such as sparse feature can have negative impact on similarity measurement and
matching, dense matching (like image registration and stereo match searching. As a result, usually these methods can
matching), patch matching (retrieval), 2-D and 3-D point set only work well under small rotation, scaling, and local
registration, and graph matching. Image matching in general deformation.
consists of two parts, namely, the nature of the matched fea- – Feature-based matching methods are often more efficient
tures and the matching strategy, which indicate what are used and can better handle geometrical deformation. But they
to match and how to match them, respectively. The ultimate are based on salient feature detection and description, fea-
goals are to geometrically warp the sensed image into the ture matching, and geometrical model estimation which
common spatial coordinate system of the reference image and can also be challenging. On the one hand, in feature-
align their common area pixel-to-pixel (i.e., image registra- based image matching, it is difficult to define and extract
tion). To this end, a direct strategy, also known as area-based a high percentage and a large number of features belong-
method, registers two images by using the similarity mea- ing to the same positions in 3-D space in the real world
surement of the original image pixel intensity or information to ensure the matchability. On the other hand, matching
after pixel-domain transformation in the sliding windows of N feature points to N feature points detected in another
predefined size or even the entire images, without attempting image would create a total of N ! possible matchings, and
to detect any salient image structure. thousands of features are usually extracted from high-
Another classic and widely adopted pipeline called feature- resolution images and dominated outliers and noise are
based method, i.e., feature detection and description, feature typically included in the points sets, which lead to signifi-
matching, transform model estimation, image resampling cant difficulties for existing matching methods. Although
and transformation, has been introduced in the prestigious various local descriptors have been proposed and cou-
survey paper (Zitova and Flusser 2003) and applied in var- pled with detected features to ease the matching process,
ious fields. The feature-based image matching is popular the use of local appearance information will unavoid-
due to its flexibility and robustness and the capability of ably result in ambiguity and numerous false matches,
wide range applications. In particular, feature detection can especially for images with low quality, repeated contents,
extract the distinctive structure from an image, and fea- and those undergoing serious nonrigid deformations and
ture description may be regarded as an image representation extreme viewpoint changes.
method that is widely used in image coding and similarity – A predefined transformation model is often required to
measurements such as image classification and retrieval. In indicate the geometrical relation between two images
addition, due to the strong ability in deep feature acquisition or point sets. But it may vary on different data and
and non-linear expression, applying deep learning techniques is unknown beforehand thus hard to model. A sim-
for image information representation and/or similarity mea- ple parametric model is often insufficient for image
surement, as well as parameter regression of image pair pairs that involve non-rigid transformations caused by
transformation, are hot topics in nowadays image matching ground surface fluctuation and image viewpoint varia-
community, which have been proven to achieve better match- tions, multi-targets with different motion properties, and
ing performance and present greater potential compared with also local distortions.
traditional methods. – The emergence of deep learning has provided a new way
In real-world settings, images for matching are usually and has shown great potential to address image match-
taken from the same or similar scene/object while captured ing problems. However, it still faces several challenges.
at different times, from different viewpoints or imaging The option of learning from images for direct registra-
modalities. In particular, a robust and efficient matching strat- tion or transformation model estimation is limited when
egy is desirable to establish correct correspondences, thus applied to wide baseline image stereo or registration
stimulating various methods for achieving better efficiency, under complex and serious deformation. The application
robustness and accuracy. Although numerous techniques of convolutional neural networks (CNNs) onto sparse
have been devised over the decades, developing a unified point data for matching, registration, and transforma-
framework remains a challenging task in terms of the fol- tion model estimation is also difficult, because the points
lowing aspects: to be matched–known as unstructured or non-Euclidean
data due to their disordered and dispersed nature–make
– Area-based methods that directly match images often it difficult to operate and extract the spatial relation-
depend on an appropriate patch similarity measurement ships between two or more points (e.g., neighboring
for creating pixel level matches between images. They elements, relative positions, and length and angle infor-
can be computational expensive and are sensitive to mation among multi-points) using a deep convolutional
image distortion, appearance changes by noise, vary- technique.

123
International Journal of Computer Vision (2021) 129:23–79 25

Fig. 1 Structure of this survey

Existing surveys are focused on different parts of image related matching tasks are also reviewed. The overall orga-
matching tasks and fail to cover the literature from the last nization is presented in Fig. 1; Sects. 2 and 3 describe the
decade. For instance, the early reviews (Zitova and Flusser feature detection and description techniques respectively,
2003; Tuytelaars and Mikolajczyk 2008; Strecha et al. 2008; from handcrafted methods to trainable ones. Patch match-
Aanæs et al. 2012; Heinly et al. 2012; Awrangjeb et al. ing is classified as a feature description domain, and 3-D
2012; Li et al. 2015) typically focus on handcrafted meth- point set features are also reviewed. In Sect. 4, we present
ods, which are not sufficient to provide a valuable reference different matching methods, including area-based image
for investigating CNN-based methods. Most recent reviews matching, pure point set registration, image descriptor sim-
involve trainable techniques, but they merely cover a single ilarity matching and mismatch removal, graph matching,
part of image matching community, either focus on detectors and learning-based methods. Sections 5 and 6 respectively
(Huang et al. 2018; Lenc and Vedaldi 2014) or descriptors introduce the image matching-based visual applications and
(Balntas et al. 2017; Schonberger et al. 2017) or specific evaluation metrics, including the performance comparison.
matching tasks (Ferrante and Paragios 2017; Haskins et al. In Sect. 7, we conclude and discuss possible future develop-
2020; Yan et al. 2016b; Maiseli et al. 2017), and many others ments.
pay more attention on related applications (Fan et al. 2019;
Guo et al. 2016; Zheng et al. 2018; Piasco et al. 2018). In this
survey, we aim to provide an up-to-date and comprehensive 2 Feature Detection
summary and assessment of existing image matching meth-
ods, especially for the recently introduced learning-based Early image features are annotated manually, which are
methods. More importantly, we have provided a detailed still used in some low-quality image matching. With the
evaluation and analysis for mainstream methods which are development of computer vision and the requirement for
missing in existing literature. auto-matching approaches, many feature detection methods
This survey mainly focuses on feature-based matching, have been introduced to extract stable and distinct features
although patch matching, point set registration, and other from images.

123
26 International Journal of Computer Vision (2021) 129:23–79

2.1 Overview of Feature Detectors traced to Moravec detector (Moravec 1977), which first intro-
duced the concept of “interest points” to define the distinct
Detected features represent specific semantic structures in feature points, which are extracted based on the autocor-
an image or the real world and can be divided into corner relation of the local intensity. This method calculates and
feature (Moravec 1977; Harris et al. 1988; Smith and Brady searches the minimum intensity variation of each pixel from
1997; Rosten and Drummond 2006; Rublee et al. 2011), blob a shifted window in eight directions, and the interest point is
feature (Lowe 2004; Bay et al. 2006; Agrawal et al. 2008; Yi detected if the minimum is superior to the given threshold.
et al. 2016), line/edge (Harris et al. 1988; Smith and Brady However, the Moravec detector is not invariant to the
1997; Canny 1987; Perona and Malik 1990), and morpho- direction or image rotation due to the discontinuous compar-
logical region feature (Matas et al. 2004; Mikolajczyk et al. ing directions and sizes. The famous Harris corner detector
2005). However, the most popular features that are used for (Harris et al. 1988) was introduced to address the anisotropy
matching are the points (a.k.a. keypoints or interest points). and computation complexity problem. The goal of the Har-
The points are easy to extract and define with a simplified ris method is to find the directions of the fastest and lowest
form compared with the line and region features, which can grey-value changes using a two-order moment matrix or an
be roughly classified into corner and blob. auto-correlation matrix; thus, it is invariant to orientation and
A good interest point must be easy to find and ideally illumination and has reliable repeatability and distinctive-
fast to compute, as an interest point at a good location ness. Harris was further improved in Shi and Tomasi (1993)
is crucial for further feature description and matching. To for better tracking performance by making the features more
promote (i) matchability, (ii) the capability for subsequent “spread out” and locating more accurately.
applications, and (iii) matching efficiency and reduction of
storage requirements, many required properties have been 2.2.2 Intensity-Based Detectors
proposed for reliable feature extraction (Zitova and Flusser
2003; Tuytelaars and Mikolajczyk 2008), including repeata- Several template- or intensity comparison-based corner
bility, invariance, robustness and efficiency. The common detectors have been proposed by comparing the intensity
idea for feature detection is to construct a feature response to of the surrounding pixels with that of the center pixel to
distinguish salient point, line, and region from one another, simplify the image gradient computing. Due to their binary
along with flat and nondistinctive image areas. This idea can nature, they are widely used in many modern applications,
be subsequently classified into gradient-, intensity-, second- particularly some with storage and real-time requirements.
order derivative-, contour curvature-, region segmentation-, The intensity-based corner detector, namely, smallest uni-
and learning-based detectors. In the following, we provide value segment assimilating nucleus (SUSAN) (Smith and
a comprehensive introduction of feature detectors with these Brady 1997), is based on the brightness similarity between
methods, focusing more on learning-based methods to guide the local radius region pixels and the nucleus. SUSAN can
researchers on how the traditional and trainable detectors be implemented rapidly because it does not require gradient
work and give insights on their strengths and weaknesses. computation. Many analogous methods have been proposed
based on the concept of brightness comparison, the most
2.2 Corner Features famous of which is the FAST detector (Trajković and Hedley
1998). FAST uses binary comparison with each pixel along
A corner feature can for example be defined as the crossing a circle pattern against the central pixel and then determines
point of two straight lines with the forms of “L”, “T”, “X”, or more reliable corner features using a machine learning (i.e.,
a high curvature point of a contour. The common idea of cor- ID3 tree Quinlan 1986) strategy, which is trained on a large
ner detection is to compute a corner response and distinguish number of similar scene images and can generate the best
it from edge, flat, or other less distinctive image areas. Differ- criteria for corner selection.
ent strategies can be utilized for traditional corner searching, As an improvement of SUSAN, FAST is extremely effi-
namely, gradient-, intensity-, and contour curvature-based. cient with high repeatability and is used more widely. To
Refer to Zitova and Flusser (2003), Li et al. (2015), Tuyte- improve FAST without loss of efficiency, FAST-ER (Ros-
laars and Mikolajczyk (2008) and Rosten et al. (2010) for ten et al. 2010) was introduced to enhance the repeatability
details. by generalizing the detector based on further pixel intensity
comparison centered on the nucleus. Another improvement
2.2.1 Gradient-Based Detectors is the AGAST (Mair et al. 2010), in which two more pixel
brightness comparison criteria are defined, after which an
A gradient-based corner response prefers the use of the first- optimal and specialized decision tree is trained in an extended
order information in image to distinguish the corner feature. configuration space, thus rendering the FAST detector more
The earliest automatic corner detection method could be generic and adaptive. To combine the efficiency of FAST

123
International Journal of Computer Vision (2021) 129:23–79 27

and the reliability of the Harris detector, Rublee et al. (2011) cosine, local curvature, and tangential deflection (Mokhtar-
proposed an integrated feature detector and descriptor for ian and Suomela 1998; Rosenfeld and Weszka 1975; Pinheiro
matching called ORB. The ORB uses the Harris response to and Ghanbari 2010). The latter estimates the curvature in an
select a certain number of FAST corners as the final detected indirect way and is often used as a significance measure, such
features. The gray-scale centroid of the local patch and the as counting the number of curve points through several mov-
center pixel itself are formed as a vector to represent the ing rectangles along the curve (Masood and Sarfraz 2007),
main direction of the ORB feature, which helps calculate using the perpendicular distances from the chord connecting
the similarity of the binary descriptor in ORB. Recently, a the two endpoints of the curve to curve points (Ramer 1972),
Sadder-like detector (Aldana-Iuit et al. 2016) has been pro- and other alternatives (Zhang et al. 2010, 2015). Compared
posed to extract interest points. In this detector, the saddle with indirect estimation methods, the direct ones are more
condition is verified efficiently by intensity comparisons on sensitive to noise and local variation due to the less neigh-
two concentric rings with certain geometric constraints. The boring point consideration.
Sadder detector can achieve higher repeatability and greater Finally, corners can be determined with threshold strat-
spread out than traditional methods even modern trainable egy to remove false and indistinctive points (Mokhtarian and
ones (Komorowski et al. 2018). Suomela 1998; Awrangjeb and Lu 2008). Additional details
can be obtained from a contour curvature-based corner survey
2.2.3 Curvature-Based Detectors (Awrangjeb et al. 2012). In addition and more recently, a mul-
tiscale segmentation-based corner detector, named MSFD
Another strategy for corner feature extraction is based on (Mustafa et al. 2018), has been proposed for wide-baseline
detected high-level image structures, such as edges, con- scene matching and reconstruction. Feature points in MSFD
tours, and salient regions. Corner features can be defined are detected at the intersection of the boundaries of three
immediately as the midpoint/endpoint or sparse sampling or more regions by using off-the-shelf segmentation meth-
from an edge or contour (Belongie et al. 2002). These are ods. MSFD can generate rich and accurate corner features
subsequently used for shape matching or point registration, for wide-baseline image matching and high reconstruction
especially for an image pair of less texture or binary type. performance.
The curvature-based strategy aims to extract the corner point The above-mentioned corner feature detectors are easily
with the maximum curvature searching based on the detected located in the contour or edge structures of an image (i.e., not
image curve-like edges. This strategy starts with an edge such spread-out or uneven distribution), and are limited by the
extraction and selection method, and the two subsequent scale and affine transformation between two images. Among
steps are the curve smoothing and curvature estimation. The the three types of corner detection strategies, the gradient-
corners are finally determined by selecting the curvature based methods are able to locate more accurately, whereas
extremum points. In general, an edge detector is often first the intensity-based methods show advantage for efficiency.
in need for contour curvature-based corner detection. The contour curvature-based methods require more compu-
In curve smoothing, the slope and curvature are difficult to tation but they are a better choice for processing textureless or
evaluate due to the quantized position of a curve point. Noise binary images, such as infrared and medical images, because
and local deformation in a curve may also lead to a serious the image cue-based feature descriptors are unworkable for
impact on the feature stability and distinctiveness. There- these types of images and the point-based descriptors are
fore, smoothing methods should be implemented before often coupled for the matching task (i.e., point set registra-
or during the curvature calculation to make the curvature tion or shape matching). Please refer to Sects. 3 and 4 for
extremum points more distinct from other curve points. Two details.
smoothing strategies, namely, direct and indirect methods,
are generally utilized. A direct smoothing, such as Gaus- 2.3 Blob Features
sian smoothing (Mokhtarian and Suomela 1998; Pinheiro
and Ghanbari 2010), removes noise and may change curve A blob feature is commonly indicated as a local closed region
locations to a certain extent. In comparison, in the indirect (e.g., with a regular shape of circle or ellipse), inside which
smoothing strategy, e.g., the region of support method or the the pixels are considered similar to one another and are dis-
chord-length-based method (Ramer 1972; Awrangjeb and Lu tinct from the surrounding neighborhoods. The blob feature
2008), may preserve the curve point locations. can be written in the form of (x, y, θ ), with (x, y) being
As for curvature estimation, for each point of the smoothed the pixel coordinate of the feature location and θ indicat-
curve, a significance response measure is needed for corner ing the blob shape information of the feature, including
searching, i.e., curvature. Curvature estimation methods are scale and/or affine. Numerous blob feature detectors have
also generally classified as direct and indirect. The former been introduced over the past decades, and they can be
is based on an algebraic or geometric estimation, such as roughly classified into second-order partial derivative- and

123
28 International Journal of Computer Vision (2021) 129:23–79

region segmentation-based detectors. Second-order partial surround extremum (Agrawal et al. 2008) strategy feature
derivative-based methods are based on the Laplacian scale detector with the Laplace calculation approximated by the
selection and/or Hessian matrix calculation for affine invari- proposed bilateral filtering to enhance the efficiency, and
ant. While segmentation-based methods prefer to detect blob the efficient approximation of DoH with piecewise triangle
features by segmenting the morphological regions first, then filters in DARTs (Marimon et al. 2010). In addition, a cosine-
estimate the affine information with ellipse fitting. Compared modulated Gaussian filter is utilized in the SIFT-ER detector
with corner features, blob features are more useful for visual (Mainali et al. 2013) to obtain high feature detectability with
applications with high precision requirement, because more minimum scale-space localization errors, in which the fil-
image cues are utilized for feature identification and repre- terbank system has a highly accurate filter approximation
sentation, thus enabling the blob features to be more accurate without any image sub/upsampling. An edge foci-based blob
and robust to image transformation. detector (Zitnick and Ramnath 2011) has also been intro-
duced for the matching task. In this detector, the edge foci is
2.3.1 Second-Order Partial Derivative-Based Detectors defined as the point in an image that is roughly equidistant
from the closest edge with orientations perpendicular to this
In methods based on second-order partial derivatives, the point.
Laplacian of Gaussian (LoG) (Lindeberg 1998) is applied Unlike the circle-like Gaussian response function, a non-
based on scale space theory. Here, the Laplace operator is first linear partial differential equation is applied in KAZA
used for edge detection in accordance with the zero crossings detector for blob feature searching with nonlinear diffu-
in the second-order differential of an image, and the Gaussian sion filtering (Alcantarilla et al. 2012). An accelerated
convolution filtering is then applied as a preprocessing to version called AKAZA (Alcantarilla and Solutions 2011) is
reduce noise. implemented by embedding the fast explicit diffusion in a
LoG can detect the local extremum point and the area pyramidal framework to dramatically speedup feature detec-
with normalized response arising from the circular symme- tion in nonlinear scale spaces. However, it still suffers from
try of the Gaussian kernel. Different standard deviations of high computation complexity. Another method is WADE
the Gaussian function can detect the scale-invariant blobs (Salti et al. 2013), which implements nonlinear feature detec-
in different scales by searching the extremum in the multi- tion by a wave propagation function.
scale space as the final stable blob feature. The difference of
Gaussians (DoG) (Lowe et al. 1999; Lowe 2004) filter can 2.3.2 Segmentation-Based Detectors
be used to approximate the LoG filter, and greatly speeds
up the computations. Another classical blob feature detec- The segmentation-based blob detectors begin with an irreg-
tion strategy is based on the determinant of Hessian (DoH) ular region segmentation based on constant pixel intensity or
(Mikolajczyk and Schmid 2001, 2004). This is more affine zero gradient. One of the most famous region segmentation-
invariant because the eigenvalue and eigenvector of the sec- based blob feature is maximally stable extremal region
ond matrix can be applied to estimate and correct the affine (MSER) (Matas et al. 2004). It extracts regions that remain
region. stable under a large range of intensity thresholding values.
Interest point detection by using DoG, DoH, and both This approach does not need extra processing for scale esti-
has been widely utilized in recent visual applications. The mation, and is robust to large viewpoint changes. The term
famous SIFT (Lowe et al. 1999; Lowe 2004) extracts key- “maximally stable” describes the threshold selection process,
point as the local extrema in a DoG pyramid, filtered using given that every extremal region is a connected component of
the Hessian matrix of the local intensity values (the according a watershed image by thresholding. An extension to MSER
description part will be reviewed in the next section). Miko- was introduced in Kimmel et al. (2011) to exploit shape struc-
lajczyk et al. combined the Harris and Hessian detectors with ture cues. Other improvements are based on the watershed
the Laplacian and Hessian matrices for scale and affine fea- regions of principal curvature images (Deng et al. 2007; Fer-
ture detection (Mikolajczyk and Schmid 2001, 2004), i.e., raz and Binefa 2012) or considered color information for a
the Harris/Hessian-Laplacian/affine. SURF (Bay et al. 2006) higher discrimination (Forssén 2007).
accelerates the SIFT by approximating the Hessian matrix- Similar to MSER, other segmentation-based features,
based detector using Haar wavelet calculation, together with such as intensity- and edge-based regions (Tuytelaars and
an integral image strategy, thus simplifying the construction Van Gool 2004), are also used for affine covariant region
of a second-order differential template. detection. However, feature detection of this type is of less
Several SIFT- and SURF-based improvements, have been use for feature matching, and it is gradually developed toward
successively proposed for better property in subsequent saliency detection and segmentation in computer vision. Spe-
applications. Such improvements include a fully affine invari- cific method investigation and comprehensive reviews can be
ant SIFT detector (ASIFT) (Morel and Yu 2009), a center- found in Mikolajczyk et al. (2005) and Li et al. (2015).

123
International Journal of Computer Vision (2021) 129:23–79 29

2.4 Learnable Features vent the network from proposing new keypoints in case no
anchor exists in the proximity (Barroso-Laguna et al. 2019).
Over the recent years, data-driven learning-based methods Self-supervised and unsupervised methods train detectors
have achieved significant progress in general visual pattern without any human annotations, and only the geometric con-
recognition tasks, and have also been applied to image feature straints between two images are required for optimization
detection. This pipeline can be roughly classified into the guidance; a simple human aid is sometimes asked for pre-
using of classical learning and deep learning. training (DeTone et al. 2018). In addition, many methods
integrate feature detection into the entire matching pipeline
2.4.1 Classical Learning-Based Detectors by jointly training with feature description and matching (Yi
et al. 2016; DeTone et al. 2018; Ono et al. 2018; Shen et al.
Early from the past decade, classical learning-based meth- 2019; Dusmanu et al. 2019; Choy et al. 2016; Rocco et al.
ods, such as decision tree, support vector machine (SVM), 2018; Dusmanu et al. 2019; Revaud et al. 2019), which can
and other classifiers by opposition to Deep Learning, have enhance the final matching performance and optimize the
already been used in handcrafted keypoint detection (Tra- entire procedure in an end-to-end manner.
jković and Hedley 1998; Strecha et al. 2009; Hartmann et al. For instance, TILDE (Verdie et al. 2015) trains multiple
2014; Richardson and Olson 2013). FAST (Trajković and piecewise linear regression models to detect repeatable key-
Hedley 1998) detector was the first attempt to use tradi- points under drastic imaging changes of weather and lighting
tional learning for reliable and matchable point identification, conditions. First, it identifies good keypoint candidates in
and similar strategies have been applied in many subsequent multiple training images taken from the same viewpoints
improvements (Mair et al. 2010; Rublee et al. 2011). Strecha using DoG for training set collection, and then trains a gen-
et al. (2009) trained the Wald-Boost classifier to learn key- eral regressor to predict a score map, whose maxima after
points with high repeatability on pre-aligned training sets. non-maximum suppression (NMS) can then be regarded as
More recently, Hartmann et al. (2014) showed that it can the desired interest points.
be learnt from a structure-from-motion (SfM) pipeline to DetNet (Lenc and Vedaldi 2016) is the first fully general
predict which candidate points are matchable, thus signifi- formulation for learning local covariant features; it casts the
cantly reducing the number of interest points without losing detection task as a regression problem and then derives a
excessive true matches. Meanwhile, Richardson and Olson covariance constraint to automatically learn stable anchors
(2013) reported that hand-designed detectors can be learned for local feature detection under geometric transformations.
by random sampling in the space of convolutional filters and Meanwhile, Quad-net (Savinov et al. 2017) realizes keypoint
tried to find the optimal filter using a learning strategy over detection under transformation-invariant quantile ranking
frequency-domain constraints. However, classical learning with a single real-valued response function, enabling it to
has only been used for reliable feature selection through clas- learn the detector completely from scratch by optimizing
sifier learning, rather than the extraction of interest features for a repeatable ranking. A similar detector in Zhang and
directly from raw images until the emergence of deep learn- Rusinkiewicz (2018) combines this “ranking” loss with a
ing. “peakedness” loss and produces a more repeatable detector.
Zhang et al. (2017b) proposed TCDET detector by defin-
2.4.2 Deep Learning-Based Detectors ing a novel formulation based on the new concepts of
“standard patch” and “canonical feature” to place equal focus
Inspired by the handcrafted feature detectors, a general solu- on discriminativeness and covariant constraint. The proposed
tion for CNN-based detection is to construct response maps detector can detect discriminative and repeatable features
to search the interest points in a supervised (Yi et al. 2016; under diverse image transformations. Key.Net (Barroso-
Verdie et al. 2015; Zhang et al. 2017b), self-supervised Laguna et al. 2019) combines handcrafted and learned CNN
(Zhang and Rusinkiewicz 2018; DeTone et al. 2018), or filters within a shallow multiscale architecture and proposes a
unsupervised manner (Lenc and Vedaldi 2016; Savinov et al. light/efficient trainable detector. The handcrafted filters pro-
2017; Ono et al. 2018; Georgakis et al. 2018; Barroso-Laguna vide anchor structures for localizing, scoring, and ranking
et al. 2019). The task is often converted into a regression repeatable features that are fed to learned filters. CNN is used
problem that can be trained in a differentiable way under to represent the scale space by detecting keypoints at differ-
the transformation and imaging condition invariance con- ent levels; the loss function is defined to detect robust feature
straints. Supervised methods have shown the benefits of using points from different scales and maximize the repeatability
anchors (e.g., obtained from SIFT method) to guide their score. The affine region-based interest point is also learned
training, but the performance could be largely restricted by using CNNs in Mishkin et al. (2017, 2018).
the method of anchor construction, because the anchor itself The methods of integrating a detector into a matching
is intrinsically difficult to reasonably define and may pre- pipeline are similar to those solely designed for detection

123
30 International Journal of Computer Vision (2021) 129:23–79

reviewed above. The main difference may lie in the way of 2.5 3-D Feature Detectors
training, and the core challenge is to make the entire pro-
cess differentiable. For example, Yi et al. (2016) attempted Dedicated on 3-D keypoint detectors, Tombari et al. (2013)
to train a detector, an orientation estimator, and a descriptor provided an excellent survey on the state-of-the-art meth-
jointly based on inputting four patches. Their proposed LIFT ods and a detailed evaluation of their performances. In
can be regarded as a trainable version of SIFT and requires brief, the existing methods were divided into two categories,
supervision from the SfM system for determining the fea- fixed-scale detectors and adaptive-scale detectors. In both
ture anchor. The training procedure is conducted individually categories, keypoints are selected as local extrema of a pre-
from descriptor to detector and can use the learned results defined saliency measurement. The difference lies in the
to guide the detector training, thus promoting detectability. involvement of the scale characteristic, which defines the
Unlike LIFT, SuperPoint (DeTone et al. 2018) introduces support for the subsequent description stage. The fixed-
a fully convolutional model by inputting full-sized images scale detectors tend to search keypoints at a specific scale
and jointly computing pixel-level interest point locations and level, which is given as prior information. The adaptive-scale
associated descriptors in one forward pass; a synthetic dataset detectors either extend the scale concept for 2-D images by
is constructed for pseudo-ground truth generation and pre- adopting a scale space defined on the surface or implement
training, and the homography adaption module enables it to the traditional scale-space analysis by embedding 3-D data
achieve self-supervised training while promoting detection onto a 2-D plane.
repeatability.
LF-Net (Ono et al. 2018) confines the end-to-end pipeline 2.5.1 Fixed-Scale Detectors
to one branch to optimize the entire procedure in a dif-
ferentiable way; it also uses a fully convolutional network Chen and Bhanu (2007) introduced the local surface patch
operating on full-sized images to generate a rich feature (LSP) method. The saliency of a point in LSP is measured
score map, which can then be used to extract keypoint by its shape index (Dorai and Jain 1997), as defined by the
locations and the feature attributes, such as scale and ori- principal curvatures at the point. Zhong (2009) introduced
entation; simultaneously, it performs a differentiable form the intrinsic shape signature (ISS) method, in which saliency
of NMS, namely, so f targmax, for subpixel location and is derived from the eigenvalue decomposition of the scat-
increasing the accuracy and saliency of keypoint. Similar ter matrix of the support region. In this approach, the ratio
to LF-Net, RF-Net (Shen et al. 2019) selects high-response of eigenvalues is used to prune some points, and the final
pixels as keypoints on multiscales, but the response maps saliency is determined by the eigenvector. In this way, points
are constructed by receptive feature maps. Bhowmik et al. with large variations along each principal direction are iden-
(2020) indicated that increased accuracy for these low-level tified. Analogous to ISS, Mian et al. (2010) also utilized the
matching scores does not necessarily translate to better per- scatter matrix to prune nondistinctive points but with a differ-
formance in high-level vision tasks, thus they embedded ent curvature-based saliency measurement. Sun et al. (2009)
the feature detector in a complete vision pipeline, where presented the heat kernel signature (HKS) method, based
the learnable parameters are trained in an end-to-end man- on the properties of the heat diffusion process on a shape.
ner. The authors overcome the discrete nature of keypoint In this method, the saliency measurement is defined by the
selection and descriptor matching using principles from rein- restriction of the heat kernel to the temporal domain. The
forcement learning. Luo et al. (2020) proposed ASFeat to heat kernel is uniquely determined by the underlying man-
explore local shape information of feature points and enhance ifold, which makes HKS a compact characterization of the
the accuracy of points detection, by jointly learning local shape.
feature detectors and descriptors. Another detection-related
learning-based method is to estimate the orientation (Moo Yi 2.5.2 Adaptive-Scale Detectors
et al. 2016), while the spatial transformation network (STN)
(Jaderberg et al. 2015) could also be a great reference in It is desirable to adaptively fit with the scale in detection.
deep learning-based detectors for rotation invariance (Yi et al. For this purpose, Unnikrishnan and Hebert (2008) proposed
2016; Ono et al. 2018). a Laplace-Beltrami scale space by computing the designed
Unlike local feature descriptors, there is little review on function on the increasing support around each point. This
salient feature detectors, particularly for the recent CNN- function is defined by a novel operator that reflects the local
based techniques. To our best knowledge, the most recent mean curvature of the underlying shape and provides the
survey (Lenc and Vedaldi 2014) focuses on local feature saliency information. Zaharescu et al. (2009) presented the
detection. It introduces the basic idea of several well- MeshDoG method, which is analogous to the DoG operator
known methods from handcrafted detectors to accelerated in the 2-D case (Lowe 2004); nonetheless, the operator is
and learned ones. computed on a scalar function defined on the manifold. The

123
International Journal of Computer Vision (2021) 129:23–79 31

output of the DoG operator represents the saliency for key- mentation in real time while maintaining state-of-the-art
points detection. Castellani et al. (2008) also built scale space performance. Multiscale sampling or changed receptive field
using the DoG operator but directly on the 3-D mesh. Mian would make these deep learning-based detectors invariant
et al. (2010) proposed an automatic scale selection technique to scale, where the scale or rotation information is directly
for extracting scale invariant features. The scale space is built estimated in networks. They can achieve promising results,
by increasing the support size, and automatic scale selection because the deep learning techniques can easily distinguish
at each keypoint is performed by using NMS along scale. The the same structures, despite the fact that images suffer from
disadvantage of sensitivity to scale of HKS was addressed by apparent variance and geometrical transformation. The accu-
Bronstein and Kokkinos (2010), who used Fourier transform racy can be optimized directly in the loss function of the
magnitude to extract a scale-invariant quantity from the HKS learning-based methods, and the differentiable form of NMS
without the need to perform scale selection. Sipiran and Bus- is often used for subpixel accuracy location and repeatability
tos (2011) extended the well-known Harris operator (1988) enhancement.
into 3-D data with an adaptive-scale determination technique.
Readers are referred to Tombari et al. (2013) for further dis-
cussion on other adaptive-scale detectors. Salti et al. (2015) 3 Feature Description
devised a learning-based 3-D keypoint detector, whereby the
keypoint detection problem was cast as a binary classifica- Once discriminative interest points are detected from raw
tion problem, to determine whose support can be correctly images, a local patch descriptor is required to be coupled
matched by a predefined 3-D descriptor. for each feature in order to establish feature correspondence
correctly and efficiently across two or more images. In other
2.6 Summary words, the feature descriptors are commonly used to trans-
form the original local information around the interest point
The basic idea of feature detectors is to distinguish the interest into a stable and discriminative form, usually as a high-
feature from others through the response value, thus leading dimensional vector, so that two corresponding features are
to the solutions of two problems: (i) how to define discrimi- as close as possible in the descriptor space, and two non-
nant patterns in an image, and (ii) how to repeatedly detect the corresponding features are as far as possible.
salient feature under different image conditions and image
qualities (Zhang et al. 2017b). Along with the development of 3.1 Overview of Feature Descriptors
these detectors, the main improvements and common strate-
gies are related to four aspects, i.e., feature response type and The processing procedure of feature description can be
improvements on efficiency, robustness, and accuracy, which divided into three steps: local low-level feature extrac-
lead to an increase in the matchability of detected features tion, spatial pooling, and feature normalization (Lowe 2004;
and the improved performance of their subsequent applica- Rublee et al. 2011; Brown et al. 2010). First, the low-level
tions. information of a local image region has to be extracted. This
For traditional methods, using more image cues can result information consists of pixel intensity and gradient or is
in better robustness and repeatability, but usually requires obtained from a series of steerable filters. Subsequently, the
more computational cost. In addition to using low-order fea- local patch is divided into several parts and the local informa-
ture detectors, several strategies, such as approximate and tion is pooled in each part, then concatenate them by using
pre-compute, are designed to largely speed up the computa- pooling methods, such as rectangular gridding (Lowe 2004),
tion and maintain the matchability. To ensure the robustness, polar gridding (Mikolajczyk and Schmid 2005), Gaussian
scale and affine information estimation is usually required sampling (Tola et al. 2010), and others (Rublee et al. 2011);
when searching stable features. While for accuracy enhance- the joint feature representation is transformed into a more dis-
ment, a local extremal searching for subpixel accuracy and criminative one that may preserve significant information in
NMS strategy in pixel and scale space to avoid features a simplified form for better matching performance. Finally,
locally gathered, are two popular choices in traditional a descriptor is obtained from the normalized results of the
pipelines. pooled local information, which aims to map the aggregated
As for learning-based detectors, repeatable and salient results into a long vector of either floating-point or binary
keypoints can be extracted based on high-level cues cap- values for easily evaluating the similarity between image fea-
tured by CNNs, except for intensity, gradient, or second-order tures.
derivative. While the efficiency would largely depend on Similar to feature detectors, existing descriptors are pro-
the network structure, and early deep learning methods are posed and improved to become highly robust, efficient, and
often time-consuming. Methods proposed recently, such as discriminant for addressing image matching problems. Esti-
SuperPoint and Key.Net, have already achieved good imple- mating a good size and orientation for a cropped image

123
32 International Journal of Computer Vision (2021) 129:23–79

patch is core problems in the task of feature description 3.2.1 Gradient Statistic-Based Descriptors
and matching. By correctly identifying the size and orien-
tation, the matching methods can be robust and invariant to Gradient statistic methods are often used to form float
global and/or local deformations, such as rotation and scal- type descriptors such as the histogram of oriented gradients
ing. The original intention of feature description is focused (HOG) (Dalal and Triggs 2005) as introduced in SIFT (Lowe
on discrimination enhancement compared with direct simi- et al. 1999; Lowe 2004) and its improvement versions (Bay
larity measurement using raw image information. Numerous et al. 2006; Morel and Yu 2009; Dong and Soatto 2015; Tola
well-designed descriptors can improve the discrimination et al. 2010), and they are still widely used in several modern
and matching performance, by using pooling parameter visual tasks. In SIFT, feature scale and orientation are respec-
optimization, sampling rule design, or the use of machine tively determined by DoG computation and the largest bin
learning and deep learning techniques. in a histogram of gradient orientation from a local circular
Feature description has drawn increasing attention. Descrip- region around the detected keypoint, thus achieving scale
tors can be regarded as distinguishable and robust representa- and rotation invariance. In the description stage, the local
tions for given images and are widely used not only in image region of detected feature is first rectangularly divided into
matching but also in image coding for image retrieval, face 4 × 4 non-overlapping grids based on the normalized scale
recognition, and other tasks that are based on image similar- and rotation, then a histogram of gradient orientation with
ity measurements. However, direct similarity measurements 8 bins is conducted in each cell and embedded into a 128-
for two image patches using raw image information will be dimensional float vector as the SIFT descriptor.
regarded as an area-based image matching method, which Another representative descriptor, namely, SURF (Bay
will be reviewed in the next section. As for image patch-based et al. 2006), can accelerate the SIFT operator by using the
feature descriptors, we will review the traditional ones, i.e., responses of Haar wavelets to approximate gradient com-
floating and binary descriptors, in terms of their data types. putation; integral images are also applied to avoid repeated
A new subsection will be added for the recent data-driven computation in Haar wavelet responses, enabling more effi-
methods, including classical machine learning- and emerg- cient computation than SIFT. Other improvements based
ing deep learning-based methods. We will comprehensively on these two typically focus on discrimination, efficiency,
review handcrafted and learning-based feature description robustness, and coping with specific image data or tasks.
methods and show the connections among these methods For instance, CSIFT (Abdel-Hakim and Farag 2006) uses
to provide useful instructions for the readers toward their additional color information to enhance the discrimination,
further research, especially for developing better description and ASIFT (Morel and Yu 2009) simulates all image views
approaches using deep learning/CNN techniques. In addi- obtainable by varying the two camera axis orientation param-
tion, we will also review the 3-D feature descriptors, where eters for fully affine invariance. Mikolajczyk and Schmid
features are typically obtained from point data without any (2005) use a polar division and histogram statistics of gradi-
image pixel information but with spatial position relation- ent orientations. SIFT-rank (Toews and Wells 2009) has been
ships (e.g., 3-D point cloud registration). proposed to investigate ordinal image description based on
off-the-shelf SIFT for invariant feature correspondence. A
Weber’s law-based method (WLD) (Chen et al. 2009) has
been studied to compute a histogram by encoding differen-
3.2 Handcrafted Feature Descriptors tial excitations and orientations at certain locations.
Arandjelović and Zisserman (2012) used a square root
Handcrafted feature descriptors often depend on expert pri- (Hellinger) kernel instead of the standard Euclidean dis-
ori knowledge, which are still widely used in many visual tance measurement to transform the original SIFT space
applications. Following the construction procedure of a tra- to the RootSIFT space and yielded superior performance
ditional local descriptor, the first step is to extract low-level without increasing processing or storage requirements. Dong
information, which can be briefly classified into image gradi- and Soatto (2015) modified SIFT by pooling the gradi-
ent and intensity. Subsequently, the commonly used pooling ent orientation across different domain sizes and proposed
and normalizing strategies, such as statistic and comparison, DSP-SIFT descriptor. Another efficient dense descriptor
are applied to generate long and simple vectors for discrim- for wide-baseline stereo based on SIFT, namely, DAISY
inative description with respect to the data type (float or (Tola et al. 2010), uses a log-polar grid arrangement and
binary). Therefore, handcrafted descriptors mostly rely on Gaussian pooling strategy to approximate the histograms of
the knowledge of their authors, and description strategies gradient orientations. Inspired by DAISY, DARTs (Marimon
can be classified into gradient statistic-, local binary pat- et al. 2010) can efficiently compute scale space and reuse
tern statistic-, local intensity comparison- and local intensity it for descriptors, thus resulting in high efficiency. Several
order statistic-based methods. handcrafted float-type descriptors have also been proposed

123
International Journal of Computer Vision (2021) 129:23–79 33

recently and shown promising performance; for example, the for fast computing and matching with low memory cost while
pattern of local gravitational force local descriptor (Bhat- remaining robust to scale, rotation, and noise. Handcrafted
tacharjee and Roy 2019) is inspired from the law of universal binary descriptors and classical machine learning techniques
gravitation and can be regarded as a combination of force are also widely studied and these shall be introduced in the
magnitude and angle. learning-based subsection.

3.2.2 Local Binary Pattern Statistic-Based Descriptors 3.2.4 Local Intensity Order Statistic-Based Descriptors

Different from SIFT-like approaches, several intensity statistic- Thus far, many methods have been devised using orders
based methods, which are inspired by the local binary pattern of pixel values rather than raw intensities, achieving more
(LBP) (Ojala et al. 2002), have been proposed in the past promising performance (Tang et al. 2009; Toews and Wells
decades. LBP has properties that favor its usage in inter- 2009). Pooling by intensity orders is invariant to rotation
est region description, such as tolerance against illumination and monotonic intensity changes and also encodes ordi-
change and computational simplicity. The drawbacks are nal information into descriptor; the intensity order-pooling
that the operator produces a rather long histogram and is scheme may enable the descriptors to be rotation-invariant
insignificantly robust in flat image areas. Center-symmetric without estimation of a reference orientation as SIFT, which
LBP (CS-LBP) (Heikkilä et al. 2009) (using SVM for clas- appears as a major error source for most existing methods.
sifier training) is a modified version of LBP combining the To solve this problem, Tang et al. proposed the ordinal spa-
strengths of SIFT and LBP to address the flat area problem. tial intensity distribution (Tang et al. 2009) method, which
Specifically, CS-LBP uses a SIFT-like grid and replaces the normalizes captured texture information and structure infor-
gradient information with an LBP-based feature. To address mation using an ordinal and spatial intensity histogram; the
the noise, center-symmetric local ternary pattern (CS-LTP) proposed method is invariant to any monotonically increas-
(Gupta et al. 2010) suggests the use of a histogram of rel- ing brightness changes.
ative orders in patch and a histogram of LBP codes, such Fan et al. (2011) pooled local features based on their gra-
as histogram of relative intensities. The two CS-based meth- dient and intensity orders in multiple support regions and
ods are designed to be more robust to Gaussian noise than proposed the multi-support region order-based gradient his-
previously considered descriptors. RLBP (Chen et al. 2013) togram and the multi-support region rotation and intensity
improves the robustness of LBP by changing the coding bit; monotonic invariant descriptor methods. A similar strategy
a completed modeling of the LBP operator and an associ- was used in LIOP (Wang et al. 2011, 2015), to encode the
ated completed LBP scheme (Guo et al. 2010) have been local ordinal information of each pixel. In that work, the over-
developed for texture classification. LBP-like methods are all ordinal information was used to divide the local patch into
widely used in texture representation and face recognition subregions, which were used to accumulate LIOP. LIOP was
community, and additional details can be found in the review further improved into OIOP/MIOP (Wang et al. 2015), which
literature (Huang et al. 2011). can then encode overall ordinal information for noise and
distortion robustness. They also proposed a learning-based
3.2.3 Local Intensity Comparison-Based Descriptors quantization to improve its distinctiveness.

Another form of descriptors is based on the comparison 3.3 Learning-Based Feature Descriptors
of local intensities, which is also called binary descriptors
and the core challenge is the selection rule for comparison. Handcrafted descriptors, as reviewed above, require exper-
Because of their limited distinctiveness, these methods are tise to design and may disregard useful patterns hidden in
mostly limited to short-baseline matching. Calonder et al. the data. This requirement has prompted the investigations
(2010) proposed the BRIEF descriptor built by concatena- on learning-based descriptors, which have recently become
tion of the results of a binary test of intensities for several dominantly popular due to their data-driven property and
random point pairs in image patch. Rublee et al. (2011) pro- promising performance. In the following, we will discuss
posed rotated BRIEF combined with oriented FAST corners a group of classical learning-based descriptors introduced
and selected robust binary tests using an machine learning before the deep learning era.
strategy in their ORB algorithm to alleviate the limitations in
rotation and scale change. Leutenegger et al. (2011) devel- 3.3.1 Classical Learning-Based Descriptors
oped the BRISK method using a concentric circle sampling
strategy with increasing radius. Inspired by the retina struc- The learning-based descriptors can be traced back to PCA-
ture, Alahi et al. (2012) proposed the FREAK descriptor by SIFT (Ke et al. 2004), in which principal component analysis
comparing image intensities over a retinal sampling pattern (PCA) is used to form a robust and compact descriptor by

123
34 International Journal of Computer Vision (2021) 129:23–79

reducing the dimensionality of a vector made of the local descriptors is to directly learn representations from image
image gradients. Cai et al. (2010) investigated the use of patches. In Trzcinski and Lepetit (2012), they proposed to
linear discriminant projections to reduce dimensionality and project image patches to a discriminant subspace by using a
improve the discriminability of local descriptors. Brown et al. linear combination of a few simple filters and then threshold
(2010) introduced a learning framework with a set of building their coordinates for creating the compact binary descrip-
blocks for constructing descriptors by using Powell mini- tor. The success of descriptors (e.g., SIFT) during image
mization and linear discriminant analysis (LDA) technique matching indicates that non-linear filters, such as gradient
to find the optimal parameters. Simonyan et al. (2014) pre- response, are more suitable than linear ones. Trzcinski et al.
sented a novel formulation to represent the spatial pooling (2017) proposed to learn a hash function of the same form as
and dimensionality reduction in descriptor learning as con- an AdaBoost strong classifier, i.e. the sign of a linear com-
vex optimization problems based on Brown’s work (Brown bination of nonlinear weak learners, for each descriptor bit.
et al. 2010). Meanwhile, Trzcinski et al. (2012, 2014) applied This work is more general and powerful than Trzcinski and
the boosting trick to learn boosted, complex non-linear local Lepetit (2012), which is based on simple thresholded lin-
visual feature representations from multiple gradient-based ear projections. Trzcinski et al. (2017) proposed to generate
weak learners. binary descriptors that are independently adapted per patch.
Apart from the above-mentioned float-valued descrip- This objective is achieved by inter- and intra-class online
tors, binary descriptors are also of great interest in classical optimization for descriptors.
descriptor learning due to their beneficial properties, such as
low storage requirements and high matching speed. A nat- 3.3.2 Deep Learning-Based Descriptors
ural way to obtain binary descriptors is to learn it from the
provided float-valued descriptors. This task is convention- Descriptors using deep techniques are usually formulated as a
ally achieved by the hashing methods, thus suggesting that supervised learning problem. The objective is to learn a rep-
compact representations of high-dimensional data should resentation that can enable the two matched features to be
be learned while maintaining their similarity in the new as close as possible while the unmatched ones are far apart
space. Locality sensitive hashing (LSH) (Gionis et al. 1999) in the measuring space (Schonberger et al. 2017). Descrip-
is arguably a popular unsupervised hashing method. This tor learning is often conducted with cropped local patches
method generates embeddings via random projections and centered on the detected keypoints; thus, it is also known as
has been used for many large-scale search tasks. Some vari- patch matching. In general, existing methods consist of two
ants of LSH include kernelized LSH (Kulis and Grauman forms, namely, metric learning (Weinberger and Saul 2009;
2009), spectral hashing (Weiss et al. 2009), semantic hashing Zagoruyko and Komodakis 2015; Han et al. 2015; Kedem
(Salakhutdinov and Hinton 2009) and p-stable distribution- et al. 2012; Wang et al. 2017; Weinberger and Saul 2009)
based LSH (Datar et al. 2004). These variants are unsuper- and descriptor learning (Simo-Serra et al. 2015; Balntas et al.
vised by design. 2016a, 2017; Zhang et al. 2017c; Mishchuk et al. 2017; Wei
Supervised hashing methods have also been extensively et al. 2018; He et al. 2018; Tian et al. 2019; Luo et al. 2019),
investigated, where different machine learning strategies according to the output of deep learning-based descriptors.
have been proposed to learn feature spaces tailored to specific These two forms are often jointly trained. Specifically, metric
tasks. In this case, a plethora of methods have been proposed learning methods often learn a discriminative metric for simi-
(Kulis and Darrell 2009; Wang et al. 2010; Strecha et al. larity measurement with raw patches or generated descriptors
2012; Liu et al. 2012a; Norouzi and Blei 2011; Gong et al. as inputs. By contrast, descriptor learning tends to generate
2013; Shakhnarovich 2005), among which image matching the descriptor representation from raw images or patches.
is considered an important experimental validation task. For Such a process requires a measurement method, such as L2
example, the LDA technique is utilized in Strecha et al. distance or trained metric network, for similarity evaluation.
(2012) to aid hashing. Semi-supervised sequential learning In contrast with single metric learning, the use of CNNs to
algorithms are proposed in Liu et al. (2012a) and Wang et al. generate description vectors is more flexible and may save
(2010) to find discriminative projections. Minimal loss hash- time by avoiding repeated computation when a large number
ing (Norouzi and Blei 2011) provided a new formulation to of candidate patches are available for correspondence search.
learn binary hash functions on the basis of structural SVMs Deep learning has achieved satisfying performance in feature
with latent variables. Gong et al. (2012) proposed searching description due to its strong ability in information extraction
a rotation of zero-centered data to minimize the quantization and representation.
error of mapping the descriptor to the vertices of a zero- Descriptors with deep learning techniques can be regarded
centered binary hypercube. as an extension of those based on classical learning (Schon-
Trzcinski and Lepetit (2012) and Trzcinski et al. (2017) berger et al. 2017). For instance, the Siamese structure in
reported that a straightforward way of developing binary Chopra et al. (2005) and the commonly used loss func-

123
International Journal of Computer Vision (2021) 129:23–79 35

tions, such as hinge, Siamese, triplet, ranking, and contrastive the descriptor to be sufficiently “spread out”. It was carried
losses, have been borrowed and modified in recent deep out to fully utilize the descriptor space.
methods. Specifically, Zagoruyko and Komodakis (2015) Descriptor learning based on average precision attention
proposed their DeepCompare and demonstrated the mech- (He et al. 2018), introduces a general-purpose learning to
anism by which to directly learn from raw image pixels with rank formulation. This approach is defined to a constraint
a general patch similarity function. In such scenario, various wherein the true matches should be ranked above all false
Siamese-type CNN models are applied to encode the sim- path matches and is optimized on the basis of the binary
ilarity function. These models are then trained to identify and real-value local feature descriptors. BinGAN (Zieba
the positive and negative image patch pairs. The attempted et al. 2018) proposes a regularization method for genera-
different network structures include Siamese with shared tive adversarial networks (Goodfellow et al. 2014) to learn
or unshared weights and central-surround form. MatchNet discriminative yet compact binary representations of image
(Han et al. 2015) is proposed to simultaneously learn the patches. In comparison, other methods focused on binary
descriptor and metric. Such a technique is implemented by descriptor learning are proposed in Erin Liong et al. (2015),
cascading a Siamese-like description network and fully con- Lin et al. (2016a) and Duan et al. (2017). Except for loss
volutional decision network. The task is converted into a function, network structure, regularization and hard nega-
classification problem under a cross-entropy loss. DeepDesc tive mining, Wei et al. (2018) learned a discriminative deep
(Simo-Serra et al. 2015) uses CNNs to learn discriminant descriptor by using kernelized subspace pooling. Tian et al.
patch representations together with L2 distance measuring. (2019) used second-order similarity in their SOSNet. In Con-
In particular, it trains a Siamese network with pairs of posi- textDesc, a more recent method, Luo et al. (2019) combined
tive and negative patches by minimizing the pairwise hinge the local patch similarity constraint with the spatial geomet-
loss, and the proposed hard negative mining strategy has rical constraint of interest point to train their networks, which
alleviated the unbalanced positive and negative samples. largely improves the matching performance.1
Consequently, the description performance is siginificantly As mentioned in the CNN-based detectors, an increas-
enhanced. Wang et al. (2014) proposed a novel deep rank- ing number of end-to-end learning methods integrate the
ing model to learn fine-grained image similarity. The model feature description together with the detectors into the com-
employs a triplet-based hinge loss and ranking function to plete matching pipeline. These methods are similar to those
characterize fine-grained image similarity relationships. A that have been singly designed for the description reviewed
multiscale neural network architecture is utilized to capture above. The main difference may lie on the way of training and
the global visual properties and image semantics. the design of the entire network structure. The core challenge
Kumar et al. (2016) first used the global loss to enlarge is to make the whole process differentiable and trainable. For
the distance margin between positive and negative patch example, LIFT (Yi et al. 2016) attempts to simultaneously
pairs. It is implemented through triplet and Siamese networks implement keypoint detection, orientation estimation, and
trained with a combination of triplet and global losses. TFeat feature description, by end-to-end CNN networks.
(Balntas et al. 2016b) proposes to utilize triplets of training SuperPoint (DeTone et al. 2018) proposes a self-supervised
samples for CNN-based patch description and matching. It is framework for training interest point detectors and descrip-
implemented with shallow convolutional networks and fast tors for multiple view geometrical problems. The fully con-
hard negative mining strategy. In L2Net (Tian et al. 2017), volutional model operates on full-sized images and jointly
Tian et al. applied a progressive sampling strategy to optimize computes pixel-level interest point locations and associated
the relative distance-based loss function in the Euclidean descriptors, which is in contrast with path-based networks.
space. The authors of that work considered the intermediate LF-Net (Ono et al. 2018) devises a two-branch setup and cre-
feature map and compactness of descriptor to achieve bet- ates virtual target responses iteratively to allow training from
ter performance. HardNet (Mishchuk et al. 2017) achieves scratch without handcrafted priors. This technique realizes
better improvement than L2Net by using a simple hinge feature map generation, scale-invariant keypoint detection
triplet loss with the “hardest-within-batch” mining. PN-Net using top K selection and NMS, orientation estimation, and
(Balntas et al. 2016a) uses ideas introduced in the field of descriptor extraction. In LF-Net, the target function includes
distance metric learning and online boosting by simultane- image level loss (satisfying additional constraints among
ously training with positive and negative constraints. The image pairs, depth map, and essential matrix), patch-wise
proposed SoftPN loss function exhibits faster convergence loss (learning keypoints that are good for matching and
and lower error than hinge loss or SoftMax ratio (Wang et al. involves the orientation and scale component geometric con-
2014; Zagoruyko and Komodakis 2015). Zhang et al. (2017c) sistency), and triplet loss for descriptor learning.
trained their networks by using their proposed global orthog-
onal regularization together with triplet loss for encouraging
1 https://image-matching-workshop.github.io/leaderboard/.

123
36 International Journal of Computer Vision (2021) 129:23–79

Subsequently, RF-Net (Shen et al. 2019) creates an end- of conventional handcrafted 3-D feature descriptors, while
to-end trainable matching framework that is modified from the learning-based methods are left out. In the following sec-
the LF-Net structure. First, the constructed receptive feature tion, we provide a brief introduction of the state-of-the-art
maps lead to effective keypoint detection. Second, a general handcrafted descriptors and the learning-based ones.
loss function term, that is, neighbor mask, facilitates training
patch selection to enhance the stability in descriptor train- 3.4.1 Handcrafted 3-D Descriptors
ing. D2-Net (Dusmanu et al. 2019) uses a single CNN to
play a dual role: simultaneously achieving a dense feature Guo et al. (2016) divided the handcrafted descriptors into
descriptor and a feature detector. In Bhowmik et al. (2020), spatial distribution histogram- and geometric attribute his-
a keypoint selection and descriptor matching are optimized togram-based descriptors, with the former representing the
under high-level vision tasks by using principles from rein- local feature by histograms that encode spatial distribu-
forcement learning. In addition, Li et al. (2020) introduced tions of the points in the support region. In general, the
dual-resolution correspondence networks to obtain pixel- local reference frame/axis is constructed for each keypoint.
wise correspondences in coarse-to-fine manner by extracting Accordingly, the 3-D support region is partitioned into bins
different resolution feature maps. to form a histogram. The values of each bin are calcu-
Except for feature matching for the same target or scene, lated by accumulating the spatial distribution measurements.
semantic matching for images that are captured from sim- Some representative work include spin image (Johnson and
ilar targets/scenes has also been studied using CNNs and Hebert 1999), 3-D shape context (Frome et al. 2004), unique
distinct promotion has been achieved. The semantic match- shape context (Tombari et al. 2010a), rotational projection
ing problem may pose a challenge for handcrafted methods statistics (Guo et al. 2013) and tri-spin-image (Guo et al.
due to the required understanding of semantic similarity. To 2015). The spatial distribution histogram descriptors repre-
this end, UCN (Choy et al. 2016) uses deep metric learning sent the local features by generating histograms from the
to directly learn a feature space that preserves either geo- statistics of geometric attributes (e.g., normals, curvatures)
metric or semantic similarity. The use of such an approach in the support region. These histograms include local sur-
also helps generate dense and accurate correspondences for face patch (Chen and Bhanu 2007), THRIFT (Flint et al.
either geometric or semantic correspondence tasks. Specif- 2007), point feature histogram (Rusu et al. 2008), fast point
ically, UCN implements a fully convolutional architecture feature histogram (Rusu et al. 2009) and signature of his-
with a correspondence contrastive loss for fast training and togram of orientations (Tombari et al. 2010b). Apart from the
testing, and proposes a convolutional spatial transformer for geometric attribute and spatial distribution histogram-based
local patch normalization. NCN (Rocco et al. 2018) devel- descriptors, Zaharescu et al. (2009) introduced the Mesh-
ops an end-to-end trainable CNN architecture based on the HoG descriptor, which is analogous to SIFT (Lowe 2004),
classic idea of disambiguating feature matching by using and uses gradient information to generate a histogram.
semi-local constraints to find reliable dense correspondences The spectral descriptors, such as global point signature
between a pair of images. This framework identifies sets (Rustamov 2007), HKS (Sun et al. 2009) and wave kernel
of spatially consistent matches by analyzing the neighbor- signature (WKS) (Aubry et al. 2011), also make up an impor-
ing consensus patterns for a global geometric model. The tant category in this area. The descriptors are obtained from
model can be efficiently trained via weak supervision with- the spectral decomposition of the Laplace-Beltrami operator
out any manual annotations of point correspondences. This associated with the shape. The Global Point Signature (Rus-
type of framework can be applied for both category-level and tamov 2007) utilizes the eigenvalues and eigenfunctions of
instance-level matching tasks, and other similar methods are the Laplace–Beltrami operator on the shape to represent the
presented in Han et al. (2017), Plötz and Roth (2018), Chen local feature of points. The HKS (Sun et al. 2009) and WKS
et al. (2018), Laskar and Kannala (2018), Kim et al. (2018, (Aubry et al. 2011) are based on the heat diffusion process
2020), Ufer and Ommer (2017) and Wang et al. (2018). and the temporal evolution of quantum mechanical particles
on the shape, respectively.
3.4 3-D Feature Descriptors
3.4.2 Learning-Based 3D Descriptors
Extensive studies on 3-D feature descriptors have been con-
ducted. As previously mentioned, many researchers have Efforts have also been devoted to generalizing spectral
turned their attention to deep learning paradigm due to descriptors by using different learning schemes. Litman and
its revolutionary success in numerous different areas. This Bronstein (2014) generalized the spectral descriptors to a
fact motivates us to categorize modern descriptors into two generic family and proposed to learn from examples for
groups, i.e. handcrafted and learning-based ones. Guo et al. obtaining optimized descriptors for a specific task. The learn-
(2016) presented a comprehensive performance evaluation ing scheme resembles the spirit of Wiener filter in signal

123
International Journal of Computer Vision (2021) 129:23–79 37

processing. Rodolà et al. (2014) proposed a learning method with low computation and storage request. These descriptors
that enables the wave kernel descriptor to recognize a broader should also maintain their discriminative and invariant fea-
class of deformations from the example set by using the tures against serious deformations and imaging conditions. In
random forest classifier. Windheuser et al. (2014) proposed the following section, we provide a comprehensive analysis
a metric learning method to improve the representation of of the handcrafted descriptors and introduce the mechanism
the spectral descriptors. Modern deep learning techniques by which the learning-based methods can partly address these
have also been successfully applied. Masci et al. (2015) pro- challenges and achieve promising performance.
posed the first attempt and introduced a generalization of the Following the construction procedure of traditional local
CNN paradigm to non-Euclidean manifolds for shape corre- descriptors, the first step is to extract the low-level infor-
spondences. Subsequently, Boscaini et al. proposed to learn mation, which can be briefly classified into image gradient
descriptors by spectral convolutional networks (Boscaini and intensity. Specifically, the gradient information can be
et al. 2015), and anisotropic CNNs (Boscaini et al. 2016). regarded as a higher order image cue than raw intensity. The
Monti et al. (2017) proposed a unified framework for gener- pooling strategy together with a histogram or statistic manner
alizing CNN architectures to non-Euclidean domains (graphs is often required to form a float descriptor. Thus, this strategy
and manifolds). Xie et al. (2016) constructed a deep metric is more invariant to geometrical transformations (perhaps the
network to form a binary spectral shape descriptor for shape pool and statistic strategy make it more independent to pixel
characterization. The input is based on the eigenvalue decom- position and geometrical variety). Nevertheless, it requires
position of the Laplace-Beltrami operator. additional computation in gradient calculation and statistics
In the spatial domain, the differences of various deep as well as the distance measure of float-type data. LBP-based
learning methods often lie in the representation of the con- methods typically have high discriminative ability and good
sumed data. Wei et al. (2016) trained a deep CNN on the robustness to illumination change and image contrast, which
depth map representation of shapes to find correspondences. are frequently used in texture representation and face recog-
Zeng et al. (2017) proposed to use a 3D deep CNN for nition.
learning a local volumetric patch descriptor. This descriptor In contrast with the gradient and/or statistic-based meth-
consumes a voxel grid of truncated distance function val- ods, the simple comparison strategy on image intensity would
ues of the local region. Elbaz et al. (2017) proposed a deep sacrifice great discrimination and robustness. A classical
neural network auto-encoder to address the 3D matching machine learning technique is often designed to identify sub-
problem. The authors used a random sphere cover set algo- stantial useful bits. These types of methods are typically in
rithm to detect feature points and project each local region need of the reference orientation estimation to achieve rota-
into a depth map as input to the neural network for producing tion invariant, which appears to be a major error source for
descriptors. Khoury et al. (2017) parameterized the input by most existing methods. However, the use of intensity order
using spherical histograms centered at each point and uti- is intrinsically invariant to rotation and intensity changes
lized fully connected networks to generate low-dimensional without any geometrical estimation. It can achieve promising
descriptors. Georgakis et al. (2018) recently employed a performance due to the combination of the use of intensity
Siamese architecture network that processes depth maps. order and statistical strategy.
Zhou et al. (2018) proposed to learn from the images of Learning-based methods have largely avoided the require-
multiple views for the description of 3D keypoints. Wang ment of manual experience and knowledge priori. They
et al. (2018b) parameterized the multiscale localized neigh- automatically optimize and obtain the optimal parameters
borhoods of a keypoint into regular 2D grids as the input of a and directly construct the wanted descriptor. Traditional
triplet-architecture CNN. Deng et al. (2018) first presented an learning methods aim to enable the generated descriptors
order-free network on the basis of PointNet (Qi et al. 2017a). superior in terms of efficiency, low storage, and discrimina-
This network can consume raw point clouds to exploit the tion. However, the used image cues, such as intensity and
full sparsity in the 3D matching task. gradient, are still with low order, and they highly rely on the
framework in handcrafted methods. Nevertheless, the target
3.5 Summary function, training skills, and datasets that appeared at that
time are significant and useful for designing better learning-
As previously mentioned, the image patch descriptor is des- based methods. Thus, the emergence of deep learning has
ignated to enable accurate and effective correspondence further advanced this procedure in traditional learning.
establishment between detected feature points. The objective Several skills can help improve the discriminability and
is to transform the original image information into a discrim- robustness of deep descriptors. On the one hand, the central-
inative and stable representation that makes the two matched surround and triplet (even more) structure may provide
features as close as possible, while the unmatched ones are far substantial significant information to learn. The hard nega-
apart. To this end, the descriptors should be easy to compute tive sample mining strategy would make the structure focused

123
38 International Journal of Computer Vision (2021) 129:23–79

on hard samples (may result in overfitting as well) and thus Direct feature matching aims to establish the corre-
can achieve better matching performance. More reliable loss spondences from two given feature sets by directly using
functions should also be designed according to the basic and the spatial geometrical relations and optimization meth-
intrinsic properties of description task. For instance, recently ods, which can be roughly classified into graph matching
designed triplet, ranking, contrastive, and global losses, are and point set registration. In comparison, indirect feature
superior than early simple hinge and cross-entropy losses. matching methods typically casts the matching task into a
On the other hand, valid and comprehensive ground truth two-stage problem. Such task commonly starts with estab-
datasets are also required for better performance in match- lishing preliminary correspondences through the similarity
ing and generalization ability. Training a descriptor together of descriptors with the distance judging from the measur-
with detectors into the complete matching pipeline through ing space. Thereafter, the false matches are removed from
an end-to-end manner has also drawn great attention at the putative match sets by using extra local and/or global
present. This can jointly optimize the detector and descriptor, geometrical constraints. Dense matching from sparse feature
thus can achieve encouraging performance, and the unsu- correspondences often requires a post-process of transform
pervised training in it can perform without the need of any model estimation, followed by image resampling and inter-
labeled ground truth patch data. The current descriptors can polation (warping).
achieve significant matching performance across image pairs We will separate the learning-based methods from area-
of appearance variances, such as illumination and day-night, and feature-based methods and introduce them in a new sub-
by using deep techniques. However, these descriptors still section. From the aspect of input data, learning from images
suffer from serious geometrical deformation, such as large and point data are the two main forms in learning-based
rotation or low-overlapped image pair. The low generaliza- matching. These methods can achieve better performance
tion ability for new types of data is also another limitation. for some scenarios compared to the traditional ones. The
The overall performance of descriptor also depends on matching task in 3-D cases is also briefly introduced in this
the appropriate detector. Different combinations of detectors section.
and descriptors may result in varied matching performance.
For this reason, the descriptors should be chosen according 4.2 Area-Based Matching
to a specific task and the type of image data. The advanced
descriptors using deep learning have shown great potential. Area-based methods aim for image registration and estab-
lish dense pixel correspondences by directly using the pixel
intensity of the entire image. A similarity metric together
with an optimization method is in need for geometrical
4 Matching Methods
transformation estimation and common area alignment by
minimizing the overall dissimilarity between the target and
The matching task aims to establish the correct image pixel
warped moving images. Consequently, several manual simi-
or point correspondences between two images with or with-
larity metrics are frequently used, including correlation-like,
out using the feature detection and/or description. This task
domain transformation, and mutual information (MI) meth-
has played a significant role for the entire image matching
ods. The optimization methods and transform models are
pipeline. Different definitions of matching task are intro-
also required to perform the final registration task (Zitova
duced for specific applications and scenarios and may show
and Flusser 2003).
their own strengths.
In the image registration community, correlation-like
methods, which are regarded as a classical representative in
4.1 Overview of Matching Methods area-based methods, correspond two images by maximizing
the similarities of two sliding windows (Zitova and Flusser
Over the past decades in the image matching community, 2003; Li et al. 2015). For example, the maximum correlation
existing methods can be roughly classified into two cat- of wavelet features has been developed for automatic regis-
egories, saying area-based and feature-based (Zitova and tration (Le Moigne et al. 2002). However, this type of method
Flusser 2003; Litjens et al. 2017). Area-based methods may greatly suffer from the serious image deformations (can
typically refer to dense matching, also known as image regis- only be successfully applied when slight rotation and scaling
tration, which usually do not detect features. In feature-based are presented), windows containing a smooth area without
methods, when the feature points and their local descriptors any prominent details, and huge computational burden.
are extracted from the image pairs, the image matching task Domain transformed methods tend to align two images
could be converted into matching them in indirect and direct on the basis of converting the original images into another
ways, which correspond to the use and non-use of the local domain, such as phase correlation based on Fourier shift the-
image descriptors. orem (Reddy and Chatterji 1996; Liu et al. 2005; Chen et al.

123
International Journal of Computer Vision (2021) 129:23–79 39

1994; Takita et al. 2003; Foroosh et al. 2002), and Walsh has been a long-standing research area over decades and is
transform-based methods (Lazaridis and Petrou 2006; Pan still of great interest to researchers. From the problem setting
et al. 2008). Such methods are robust against the correlated perspective, GM can be divided into two categories, namely,
and frequency-dependent noise and non-uniform, time vary- exact and inexact matching. Exact matching methods con-
ing illumination disturbances. Nevertheless, these methods sider GM to be a special case of the graph or subgraph
have some limitations in case of image pairs with signifi- isomorphism problem. It aims to find the bijection of two
cantly different spectral contents and small overlap area. binary (sub)graphs; consequently, all edges are strictly pre-
Based on information theory, the MI, such as non-rigid served babai2018groups,cook2006mining,levi1973note). In
image registration using MI together with B-splines (Klein fact, this requirement is too strict for real-world tasks like
et al. 2007) and conditional MI (Loeckx et al. 2009), is a mea- computer vision. Hence researchers often resort to inexact
surement of statistical dependency between two images and matching with weighted attributes on nodes and edges. Such
works with the entire image (Maes et al. 1997). Thus, MI is an approach enjoys good flexibility and utility in practice.
particularly suitable for the registration of multi-modalities Therefore, we primarily concentrate on the review of inexact
(Chen et al. 2003a, b; Johnson et al. 2001). Recently, Cao matching methods in this survey.
et al. (2020) proposed a structure consistency boosting trans- To some extent, GM possesses a simple yet general for-
form to enhance the structural similarity in multi-spectral mulation of the feature matching problem, which encodes
and multi-modal image registration problem, thus avoiding the geometrical cues into the node affinities (first-order rela-
spectral information distortion. However, the MI exhibits tions) and edge affinities (second-order relations) to deduce
difficulty in determining the global maximum of the entire the true correspondences between two graphs. Aside from the
searching space, inevitably reducing its robustness. More- geometrical cues, the high-level information of feature points
over, optimization methods (e.g., continuous optimization, can also be incorporated in GM (e.g. descriptor similarities as
discrete optimization, and their hybrid form) and transfor- node affinities). This information only serves as a supplemen-
mation models (e.g., rigid, affine, thin plate spline (TPS), tary one and is not necessarily required. In the general and
elastic body, and diffusion models) are considered suffi- recent form, GM can be formulated as a Quadratic Assign-
ciently mature. Please refer to Zitova and Flusser (2003), ment Problem (QAP) (Loiola et al. 2007). Although different
Dawn et al. (2010), Sotiras et al. (2013) and Ferrante and forms exist in the literature, the main body of research has
Paragios (2017) for representative literature and further focused on the Lawler’s QAP (Lawler 1963). Given two
details. graphs G 1 = (V1 , E 1 ) and G 2 = (V2 , E 2 ), where |V1 | = n 1 ,
The area-based methods are acceptable for medical or |V2 | = n 2 , each node vi ∈ V1 or v j ∈ V2 represents a feature
remote sensing image registration, which many feature-based point, and each edge ei ∈ E 1 or e j ∈ E 2 is defined over a
methods are not workable anymore because the images often pair of nodes. Without loss of generality we assume n 1 ≥ n 2 ,
contain less textural details and large variance of image Lawler’s QAP formulation of GM then can be written as:
appearance due to the different imaging sensors. However,
the area-based methods may greatly suffer from the serious max J (X) = vec(X) Kvec(X),
(1)
geometrical transformations and local deformations. While s.t. X ∈ {0, 1}n 1 ×n 2 , X1n 2 ≤ 1n 1 , X 1n 1 = 1n 2 ,
deep learning has proven its efficacy, in which the early ones
are usually employed as a direct extension of the classical where X denotes the permutation matrix, i.e. Xi j = 1 indi-
registration framework, and later ones use a reinforcement cates that node vi ∈ V1 corresponds to node v j ∈ V2 and
learning paradigm to iteratively estimate the transformation, Xi j = 0 otherwise, vec(X) denotes the column-wise vector-
even directly estimate the deformative field in an end-to-end ization of X, and 1n 1 and 1n 2 respectively denote the column
manner. The area-based matching with learning strategies vectors of all ones, K denotes the affinity matrix, whose
will be reviewed in the part of learning-based matching. diagonal and non-diagonal entries encode the first-order and
second-order edge affinities between the two graphs. No
4.3 Graph Matching Methods universal approach can be utilized to construct the affinity
matrix; however, a simple strategy is to use the similarities
Given the feature points extracted from an image, we can of feature descriptors [e.g. Shape Context (Belongie et al.
construct a graph by associating each feature point to a 2001)] and differences of edge length to determine node and
node and specifying edges. This procedure naturally pro- edge affinities.
vides convenience to investigate the intrinsic structure of The Koopmans–Beckmann’s QAP is another popular for-
image data, especially for the matching problem. By this mulation. The form is different from Lawler’s QAP as
definition, graph matching (GM) refers to the establishment expressed as:
of node-to-node correspondences between two or multiple
graphs. For its importance and fundamental challenge, GM J (X) = tr(K 
p X) + tr(A1 XA2 X ), (2)

123
40 International Journal of Computer Vision (2021) 129:23–79

where A1 and A2 are the weighted adjacency matrices of 4.3.2 Convex Relaxations
the two graphs, respectively, and K p is the node affinity
matrix. In Zhou and De la Torre (2015), the relation between Many studies have turned to investigating convex relaxations
Koopmans–Beckmann’s and Lawler’s QAP has been inves- of the original problem to obtain theoretical advantages for
tigated, which reveals that Koopmans–Beckmann’s QAP can solving the non-convex QAP issue. Strong convex relax-
be regarded as a special case of Lawler’s. ations can be obtained by lifting methods that add auxiliary
The GM problem is translated into finding the optimal variables representing quadratic monomials in the original
one-to-one correspondences X that maximizes the overall variables. This enables the addition of additional convex con-
affinity score J (X). As a combinatorial QAP problem in straints on the lifted variables. Semi-definite programming
general, GM is known to be NP-hard. Most methods relax (SDP) is a general tool for combinatorial problems and has
the stringent constraints and provide approximate solutions been applied to solving GM (Schellewald and Schnörr 2005;
in an affordable over head. In this regard, many relaxation Torr 2003; Zhao et al. 1998; Kezurer et al. 2015). The SDP
strategies are introduced in the literature, thereby leading to relaxation is quite tight and allows finding a strong approxi-
a variety of GM solvers. In the following, we briefly review mation in polynomial time. However, the high computational
the influential ones through the development course of GM. cost prohibits its scalability. Some other lifting methods with
linear programming relaxations have also been developed
4.3.1 Spectral Relaxations (Almohamad and Duffuaa 1993; Adams and Johnson 1994).
The dual problem of the LP relaxations are recently exten-
The first group of methods follow a strategy of spectral relax- sively considered to solve GM (Swoboda et al. 2017; Chen
ation. Leordeanu and Hebert (2005) proposed to replace the and Koltun 2015; Swoboda et al. 2017; Torresani et al. 2012;
one-to-one mapping constraint and the binary constraint by Zhang et al. 2016), which has a strong link with the MAP
constraining vec(X)22 = 1. In this case, the solution X inference algorithms.
can be obtained by solving an eigenvector problem. Each
element in X is interpreted as the association of one corre- 4.3.3 Convex-to-Concave Relaxations
spondence with the optimal cluster (true correspondences).
A discretization strategy is used to enforce the mapping con- One useful strategy is to utilize the path-following technique.
straints. The idea was later improved by Cour et al. (2007), This approach gradually achieves a convex-to-concave pro-
who explicitly considered enforcing the one-to-one map- cedure of the original problem to finally find a good solution
ping constraint to achieve tighter relaxation. This method with the constraints satisfied. The computational complex-
can also be solved in closed forms as an eigenvector prob- ity is also much lower than those of the lifting methods.
lem. Liu and Yan (2010) proposed to detect multiple visual Zaslavskiy et al. (2009) adopted this strategy for GM prob-
patterns by using a l1 -norm-based spectral relaxation tech- lem with Koopmans–Beckmann’s QAP formulation, which
nique, i.e. constraining vec(X)1 = 1. The solution can be is extended by to directed graphs (Liu et al. 2012b) and par-
efficiently obtained by replicator equation from evolution- tial matching (Liu and Qiao 2014). Zhou and De la Torre
ary game theory. Jiang et al. (2014) presented a non-negative (2015) presented a unified framework of GM based on the
matrix factorization technique, which extends the constraint factorization of affinity matrix based on Lawler’s QAP. Such
as vec(X) p = 1, p ∈ [1, 2]. Meanwhile, Egozi et al. a framework effectively reduces the computational complex-
(2012) presented a fairly different approach. In their work, ity and reveals the relation between Koopmans–Beckmann’s
they provided a probabilistic interpretation of spectral match- and Lawler’s QAPs. The (advanced) doubly stochastic (DS)
ing schemes and derived a novel probabilistic matching relaxation methods improve upon these approaches by iden-
scheme wherein the affinity matrix is also updated in the tifying tighter formulations (Fogel et al. 2013; Dym et al.
iteration process. With Koopmans–Beckmann’s QAP for- 2017; Bernard et al. 2018), where the tightness of spectral,
mulation, the spectral methods (Umeyama 1988; Scott and SDP, and DS relaxation is discussed and theoretically veri-
Longuet-Higgins 1991; Shapiro and Brady 1992; Caelli and fied.
Kosinov 2004) relax X to be orthogonal, i.e. X X = I. This
expression can be solved in a closed form as an eigenvalue
problem. These methods possess the merit of efficiency due
to the loose relaxation. However, the accuracy is not advan-
taged in general.

123
International Journal of Computer Vision (2021) 129:23–79 41

4.3.4 Continuous Relaxations online setting are considered to explore the concept of cycle-
consistency over pairwise matching. Another body of work
A large volume of GM methods has focused on devising accu- takes the initial (noisy) pairwise matching result as input, and
rate or efficient algorithms to solve the QAP approximately, aims to recover a globally consistent pairwise matching set
albeit with no global optimality guarantee. In most cases, X (Kim et al. 2012; Pachauri et al. 2013; Huang and Guibas
is simply relaxed to be continuous, as a DS matrix. Gold and 2013; Chen et al. 2014; Zhou et al. 2015; Wang et al. 2018;
Rangarajan (1996) proposed a graduated assignment algo- Hu et al. 2018). In these methods, matching over all graphs
rithm, which performs gradient ascent on the relaxed problem is jointly and equally considered to form a bulk matrix that
under an annealing schedule. The convergence of this method includes all pairwise matchings. The intrinsic structure of
has been revisited and improved by Tian et al. (2012) with a this matrix induced by the matching problem, such as cycle-
soft constrained mechanism. van Wyk and van Wyk (2004) consistency, is investigated. The last group utilizes clustering
proposed to enforce the one-to-one mapping constraint by or low rank recovery techniques to solve multi-graph match-
successively projecting onto the convex set of the desired ing, which provides a new perspective in the feature space
integer constraints. Leordeanu et al. (2009) proposed an effi- for the problem (Zeng et al. 2012; Yan et al. 2015c, 2016a;
cient algorithm that optimizes in the (quasi) discrete domain Tron et al. 2017). More recently, the multi-graph matching
via solving a sequence of linear assignment problems. Many problem has been considered in the optimization framework
famous optimization techniques, such as ADMM (Lê-Huu with a theoretically well-grounded convex relaxation (Swo-
and Paragios 2017), tabu search (Adamczewski et al. 2015) boda et al. 2019), or with projected power iterations to search
and multiplicative update algorithm (Jiang et al. 2017a), have for a feasible solution (Bernard et al. 2019).
also been tested. Recent studies also include Jiang et al.
(2017b) and Yu et al. (2018), which introduce new schemes 4.3.6 Other Paradigms
to asymptotically approximate the original QAP, and Maron
and Lipman (2018), which presents a new (probably) concave Although the QAP formulation is prevalent in GM, the way
relaxation technique. Yu et al. (2020b) introduced a determi- of formulation is not unique. Numerous methods deal with
nant regularization technique together with gradient-based GM from different perspectives or paradigms and also form
optimization to relax this problem into continuous domain. an important category in this field.
Cho et al. (2010) provided a random walk view of GM
4.3.5 Multi-graph Matching and devised a technique to obtain solution by simulating
random walks on the association graph. Lee et al. (2010)
In contrast to the classic two-graph matching setting, jointly and Suh et al. (2012) introduced Monte Carlo methods to
matching a batch of graphs with consistent correspondences, improve the matching robustness. Cho and Lee (2012) further
i.e. multi-graph matching, has recently drawn increasing devised a progressive GM method, which combines pro-
attention due to its methodological advantage and potential gression of graphs with matching of graphs to reduce the
to incorporate cross-graph information. Arguably, one cen- computational complexity. Wang et al. (2018a) proposed to
tral issue of multi-graph matching lies in the enforcement use a functional representation of graphs and conduct match-
of cycle-consistency for a feasible solution. In general, this ing by minimizing the discrepancy between the original and
concept refers to the fact that the bijection correspondence the transformed graphs. Subsequently, in order to suppress
between two graphs shall be consistent with a derived one the matching of outliers, Wang et al. (2020) assigned zero-
through an intermediate graph. Put it more concretely, for valued vectors to the potential outliers in the obtained optimal
any pair of graphs G a and G b with their node correspon- correspondence matrix. The affinity matrix plays a key role in
dence matrix Xab , let G c be an intermediate graph, the cycle the GM problem. However, the handcrafted K is vulnerable
consistency constraint is enforced: Xac Xcb = Xab , where to scale and rotation differences. To this end, unsupervised
Xac and Xcb are the matching solutions of G a and G c and (Leordeanu et al. 2012) and supervised (Caetano et al. 2009)
G c and G b , respectively. methods are devised to learn K. Zanfir and Sminchisescu
Existing multi-graph matching methods can be roughly (2018) recently addressed this issue with an end-to-end deep
grouped into three lines of works. For the methods falling learning scheme. Wang et al. (2020) introduced a fully train-
into the first group, the multi-graph matching problem is able framework for graph matching. In this framework, they
solved by an iterative procedure for computing a number of utilized a graph network block module and simultaneously
two-graph matching tasks (Yan et al. 2013, 2014, 2015a, b; considered the learning of node/edge affinities and the solv-
Jiang et al. 2020b). In each iteration, a two-graph matching ing of combinatorial optimization.
solution is computed to locally maximize the affinity score,
which can leverage off-the-shelf pairwise matching solvers,
such as in Jiang et al. (2020b), both offline batch mode and

123
42 International Journal of Computer Vision (2021) 129:23–79

The extension of GM to a high-order formulation is a (ICP) algorithm is a popular method (Besl and McKay 1992).
natural way to improve the robustness by mostly exploring ICP iteratively alternates between hard assignments of cor-
the geometrical cues. This leads to a tensor-based objective respondences for the closest points in two point sets and
(Lee et al. 2011) also called hypergraph matching: the closed-form rigid transformation estimation until con-
vergence. The ICP algorithm is widely used as baselines due
J H (X) = H ⊗1 x ⊗2 x . . . ⊗m x, (3) to its simplicity and low computational complexity. How-
ever, a good initialization is required because ICP is prone
where m is the order of affinities, H denotes the m-order tensor to be trapped into local optima. Numerous studies, such as
encoding the affinities between hyperedges in the graphs, ⊗k EM-ICP (Granger and Pennec 2002), LM-ICP (Fitzgibbon
is the tensor product, and x = vec(X). Representative studies 2003), and TriICP (Chetverikov et al. 2005), in the research
on hypergraph matching include Zass and Shashua (2008), field of PSR have been proposed to improve ICP. The reader
Chertok and Keller (2010), Lee et al. (2011), Chang and is referred to a recent survey (Pomerleau et al. 2013) for
Kimia (2011), Duchenne et al. (2011) and Yan et al. (2015d). a detailed discussion of ICP’s variants. The robust point
matching (RPM) algorithm (Gold et al. 1998) are proposed
4.4 Point Set Registration Methods to overcome the ICP limitations; the soft assignment and
deterministic annealing strategy are adopted, and the rigid
Point set registration (PSR) aims to estimate the spatial trans- transformation model is generalized to a non-rigid one by
formation that optimally aligns two point sets. In feature using the thin-plate spline [TPS-RPM (Chui and Rangarajan
matching, different formulations are adopted in PSR and GM. 2003)].
For two point sets, GM methods determine the alignment
via maximizing the overall affinity score of unary corre- 4.4.2 EM-Based Methods
spondence and pairwise correspondences. By contrast, PSR
methods determine the underlying global transformation. RPM is also a representative of the EM-like PSR methods,
n1 n2
Given the two point sets {xi }i=1 and {yi }i=1 , the general which form an important category in this field. The EM-like
conventional objective can be expressed as methods formulate PSR as an optimization problem of either
 a weighted squared loss function or the log-likelihood max-
min J (P, θ ) = pi j y j − T (xi , θ )22 + g(P) imization of Gaussian mixture models (GMMs), and local
i, j (4) optimum is searched through EM or EM-like algorithms. The
n 1 ×n 2 
s.t. θ ∈ , P ∈ {0, 1} , P1n 2 ≤ 1n 1 , P 1n 1 ≤ 1n 2 , posterior probability of each correspondence is computed in
the E-step, and the transformation is refined in the M-step.
where θ denotes the parameters of the predefined transfor- Sofka et al. (2007) investigated the modeling of uncertainty
mation. The regularization term g(P) avoids trivial solutions, in the registration process and presented a covariance driven
such as P = 0. Compared to GM, this model only repre- correspondence method in an EM-like framework. Myro-
sents the general principles, but does not necessarily cover nenko and Song (2010) proposed the well-known coherent
all the algorithms for PSR. For example, a probabilistic inter- point drift (CPD) method in which a probabilistic framework
pretation or a density-based objective can be used, and the is established on the basis of GMM; here, the EM algo-
constraints for P may be only partially imposed during opti- rithm is utilized for maximum likelihood estimation of the
mization, which all differ from the above formulation. parameters. Horaud et al. (2011) developed an expectation
PSR poses a stronger assumption on the data, that is, conditional maximization-based probabilistic method, which
the existence of a global transformation between point sets, allows the use of anisotropic covariance for the mixture
which is the key feature that differentiates it from GM. model components and improves over isotropic covariance
Although the generality is restricted, this assumption leads to case. Ma et al. (2016b) and Zhang et al. (2017a) exploited the
low computational complexity because of the few parameters unification of local feature and global feature in the GMM-
needed for global transformation models. A sophisticated based probabilistic framework. Lawin et al. (2018) presented
transformation model is developed from rigid to non-rigid a density adaptive PSR method via modeling the underlying
ones in order to enhance the generalization ability. Various structure of the scene as a latent probability distribution.
schemes are also proposed to improve robustness against
degradations, such as noise, outliers, and missing points. 4.4.3 Density-Based Methods

4.4.1 ICP and Its Variants Density-based methods introduce generative models to the
PSR problem, in which no explicit point correspondence
PSR has been an important research topic for the last few is established. Each point set is represented by a density
decades in computer vision, and the iterative closest point function, such as GMM. Registration is achieved by the mini-

123
International Journal of Computer Vision (2021) 129:23–79 43

mization of a statistical discrepancy measure between the two data. This method applies Lagrangian duality to generate a
density functions. Tsin and Kanade (2004) were the first to candidate solution for the primal problem thus enables it to
propose such a method and used kernel density functions to obtain the corresponding dual variable in a closed form.
model the point sets, and the discrepancy measure is defined
as kernel correlation. Meanwhile, Glaunes et al. (2004) rep- 4.4.5 Miscellaneous Methods
resented the point sets by using relaxed Dirac delta functions.
They then determined the optimal diffeomorphic transforma- Apart from the commonly used rigid model or non-rigid
tion that minimizes the distance of the two distributions. Jian transformation model based on TPS (Chui and Rangara-
and Vemuri (2011) extended this approach by using GMM- jan 2003) or Gaussian radial basis functions (Myronenko
based representation and minimizing the L2 error between and Song 2010), additional complex deformations are also
the densities. The authors also provided a unified framework considered in the literature. These models include simple
of density-based PSR. Many popular methods, including articulated extensions, such as Horaud et al. (2011) and Gao
Myronenko and Song (2010) and Tsin and Kanade (2004) can and Tedrake (2019). A smooth locally affine model is intro-
be regarded as special cases in theory. Campbell and Peters- duced as the transformation model and developed under the
son (2015) proposed to use a support vector parameterized ICP framework in non-rigid ICP (Amberg et al. 2007), which
GMM for adaptive data representation. This approach can is also adopted in Li et al. (2008). However, this model should
improve the robustness of density-based methods to noise, be used in conjunction with sparse hand selected feature
outliers, and occlusions. Recently, Liao et al. (2020) utilized correspondences as it allows many degrees of freedom. A
fuzzy clusters to represent a scanned point set, then registed different linear skinning model, which does not require user’s
two point sets by minimizing a fuzzy weighted sum of dis- involvement in the registration process, has been proposed
tances between their fuzzy cluster centers. and applied in another work (Chang and Zwicker 2009).
Another line of PSR methods introduce shape descriptors
4.4.4 Optimization-Based Methods into the registration process. Local shape descriptors, such
as spin images (Johnson and Hebert 1999), shape contexts
A group of optimization-based methods have been proposed (Belongie et al. 2001), integral volume (Gelfand et al. 2005)
as globally optimal solutions to alleviate the local optimum and point feature histograms (Rusu et al. 2009) are gener-
issue. These methods generally search in a limited transfor- ated. Sparse feature correspondences are established by a
mation space for timing saving, such as rotation, translation, similarity constraint of descriptors. Subsequently, the under-
and scaling. Stochastic optimization techniques, including lying rigid transformation can be estimated using random
genetic algorithms (Silva et al. 2005; Robertson and Fisher sampling consensus (RANSAC) (Fischler and Bolles 1981)
2002), particle swarm optimization (Li et al. 2009), parti- or BnB search (Bazin et al. 2012). Ma et al. (2013b) proposed
cle filtering (Sandhu et al. 2010) and simulated annealing a robust algorithm based on the L 2 E estimator in a non-rigid
schemes (Papazov and Burschka 2011; Blais and Levine case.
1995), are widely used, but no convergence is guaranteed. Some new schemes for PSR based on different observa-
Meanwhile, Branch and bound (BnB) is a well-established tions have emerged. Golyanik et al. (2016) modeled point set
optimization technique that can efficiently search the glob- as particles with gravity as attractive force, and registration is
ally optimal solution in the transformation space and form accomplished by solving the differential equations of New-
the theoretical basis of many optimization-based methods, tonian mechanics. Ma et al. (2015a) and Wang et al. (2016)
including Li and Hartley (2007), Parra Bustos et al. (2014), proposed the use of context-aware Gaussian fields to address
Campbell and Petersson (2016), Yang et al. (2016) and the PSR problem. Vongkulbhisal et al. (2017, 2018) proposed
Liu et al. (2018b). In addition to these methods, Maron the discriminative optimization method. This approach learns
et al. (2016) introduced a semidefinite programming (SDP) the search direction from training data to guide optimization
relaxation-based method, in which a global solution is guar- without the need of defining cost functions. Danelljan et al.
anteed for isometric shape matching. Lian et al. (2017) (2016) and Park et al. (2017) considered the color informa-
formulated PSR as a concave QAP by eliminating the rigid tion of point sets, whereas Evangelidis and Horaud (2018)
transformation variables, and BnB is utilized to achieve a and Giraldo et al. (2017) addressed the problem of joint reg-
globally optimal solution. Yao et al. (2020) presented a istration of multiple point sets.
formulation for robust non-rigid PSR based on a globally
smooth robust estimator for data fitting and regularization, 4.5 Descriptor Matching with Mismatch Removal
which is optimized by majorization-minimization algorithm
to reduce each iteration in solving a simple least-squares Descriptor matching followed by mismatch removal, also
problem. Another method in Iglesias et al. (2020) presents a called indirect image matching, casts the matching task into a
study of global optimality conditions for PSR with missing two-stage problem. This method commonly starts with estab-

123
44 International Journal of Computer Vision (2021) 129:23–79

lishing preliminary correspondences through the similarity approaches are available for putative feature correspondence
of local image descriptors with the distance judging from construction the use of only local appearance information
the measuring space. Several common strategies, including and simple similarity-based putative match selection strate-
fixed threshold (FT), nearest neighbor (NN) also called brute gies, will unavoidably result in a large number of incorrect
force matching, mutual NN (MNN), and NN distance ratio matches, particularly when images undergo serious non-rigid
(NNDR), are available for the construction of putative match deformation, extreme viewpoint changes, low quality, and/or
sets. Thereafter, the false matches are removed from the puta- repeated contents. Therefore, a robust, accurate, and efficient
tive match sets by using extra local and/or global geometrical mismatch elimination method is urgently required in the sec-
constraints. We briefly divide the mismatch removal meth- ond stage to preserve as many true matches as possible while
ods into resampling-based, non-parametric model-based, and keeping the mismatch to a minimum by using additional geo-
relaxed methods. In the following sections, we will introduce metrical constraints.
these methods in detail and provide comprehensive analysis.
4.5.2 Resampling-Based Methods
4.5.1 Putative Match Set Construction
Resampling technique is (arguably) a prevalent paradigm and
Suppose that we have detected and extracted M and N local is represented by the classic RANSAC algorithm (Fischler
features to be matched from the considering two images I1 and Bolles 1981). Basically, the two images are assumed
and I2 . The descriptor matching stage operates by computing to be coupled by a certain parametric geometric relation,
the pairwise distance matrix with M × N entries and then such as projective transformation or epipolar geometry. The
selecting the potential true matches through the aforemen- RANSAC algorithm then follows a hypothesize-and-verify
tioned rule. strategy: repeatedly sample a minimal subset from the data,
The FT strategy considers the matches with their distances e.g. four correspondences for projective transformation and
below a fixed threshold. However, this strategy can be sen- seven correspondences for fundamental, estimate a model as
sitive and may incur numerous one-to-many matchings in hypothesis, and verify the quality by the number of consis-
contrast to the one-to-one correspondence nature. This situ- tent inliers. Finally, the correspondences consistent with the
ation results in poor performance in feature matching task. optimal model are recognized as inliers.
The NN strategy can effectively deal with the data sensitivity Various methods have been proposed to improve the per-
problem and recall more potential true matches. Such a strat- formance of RANSAC. In MLESAC (Torr and Zisserman
egy has been applied in various descriptor matching methods, 1998, 2000), the model quality is verified by a maximum
but it cannot avoid the one-to-many cases. In mutual NN likelihood process, which albeit under certain assumptions,
descriptor matching, each feature in I1 , looks for its NN in can improve the results and is less sensitive to the pre-
I2 (and vice versa), and the feature pairs that are mutual NN defined threshold. The idea of modifying the verification
become candidate matches in the putative match set. This stage is not only utilized but also further extended in many
type of strategy can obtain high ratio of correct matches but following studies due to the simple implementation. The
may sacrifice many other true correspondences. The NNDR modification of sampling strategy has also been considered
considered that the distance difference between first and sec- in quite a few studies due to the appealing result of efficiency
ond NN is significant. Hence, the use of the distance ratio with enhancement. In essence, diverse prior information is incor-
a predefined threshold would obtain robust and promising porated to increase the probability of selecting an all-inlier
matching performance while not sacrifice many true matches. sample subset. Specifically, the inliers are assumed to be spa-
However, NNDR relies on the stable distance distribution tially coherent in NAPSAC (Nasuto and Craddock 2002), or
of these descriptors even though the method is widely used exist with some groupings in GroupSAC (Ni et al. 2009).
and well performed in SIFT-like descriptor matching. In fact, PROSAC (Chum and Matas 2005) exploits a priori predicted
NNDR is no longer applicable for descriptors of other types, inlier probability, and EVSAC (Fragoso et al. 2013) uses
such as binary or some learning based descriptors (Rublee an estimate of confidence with extreme value theory of the
et al. 2011; Ono et al. 2018). correspondences. Another seminal work is the locally opti-
The optimal choice of these methods for descriptor match- mized RANSAC (LO-RANSAC) (Chum et al. 2003), with
ing should rely on the property of descriptor and the specific the key observation that taking minimal subsets can amplify
application. For example, the MNN is stricter than others with the underlying noise and yield hypotheses that are far from
high inlier ratio but may sacrifice many other potential true the ground truth. This problem is addressed by introducing
matches. By contrast, NN and NNDR tend to be more general a local optimization procedure when arriving at the so-far-
in feature matching task with relatively better performance. the-best model. In the original paper, local optimization is
Mikolajczyk and Schmid (2005) proposed a simple test about implemented as an iterated least squares fitting process with
these candidate match selection strategies. Although various a shrinking inlier-outlier threshold inside an inner RANSAC.

123
International Journal of Computer Vision (2021) 129:23–79 45

This has a large-than-minimal sampling and is applied only to the reproducing kernel Hilbert space in association with
the inliers of the current model. The computational cost issue Tikhonov regularization to enforce the smoothness con-
of LO-RANSAC is addressed in Lebeda et al. (2012), where straint. The estimation is conducted in a Bayesian model,
several implementation improvements are suggested. The where the outliers are explicitly considered for robustness.
local optimization step is augmented with a graph-cut tech- The VFC algorithm, and its variants (Ma et al. 2015b, 2017a,
nique in Barath and Matas (2018). Many improving strategies 2019b) have been proven effective.
for RANSAC are integrated in USAC (Raguram et al. 2012).
More recently, Barath et al. (2019b) applyed σ -consensus 4.5.4 Relaxed Methods
in their MAGSAC, to eliminate the need of a user-defined
threshold by marginalizing over a range of noise scales. The recent trend has been towards developing relaxed meth-
Whereafter, observing that nearby points are more likely ods for matching, where the geometric constraint is made
to originate from the same geometric model, Barath et al. less strict to accommodate even complex scenarios, such
(2019a) extracted the local structure for global sampling and as motion discontinuities arising from image pairs of wide
parameter model estimation by drawing samples from grad- baselines or with objects undergoing independent motions.
ually growing neighborhoods. Based on above two methods, Certain GM methods (Leordeanu and Hebert 2005; Liu
they introduced MAGSAC++ (Barath et al. 2020) with a new and Yan 2010) are available for such requirements and use
scoring function. This method avoids requiring the inlier- quadratic models that incorporate pairwise geometric rela-
outlier decision, in which a novel marginalization procedure tions of correspondences to find the potentially correct ones.
formulated as an M-estimation is solved by an iteratively However, the results are often coarse.
re-weighted least squares procedure, and the progressive Lipman et al. (2014) considered deformations that are
growing sampling strategy in Barath et al. (2019a) is also piecewise affine; they then formulated feature matching into
applied for RANSAC-like robust estimation. a constrained optimization problem that seeks for such a
Some fundamental shortcomings are exhibited by the deformation consistent with the most correspondences and
resampling methods despite their efficacy in wide appli- exerts a bounded distortion. Lin et al. (2014, 2017) pro-
cations of computer vision. For example, the theoretically posed to identify true matches with likelihood functions
required runtime exponentially grows with the increase of estimated using nonlinear regression technique in a specially
outlier rate. The minimal subset sampling strategy only designed domain of correspondence, where motion coher-
applies to parametric models and fails to handle image pairs ence is imposed, while discontinuities are also allowed. This
undergoing complex transformations, such as non-rigid ones. concept corresponds to enforcing a local motion coherence
This situation motivates researchers to develop new algo- constraint. Ma et al. (2018a, 2019d) presented a locality pre-
rithms divorced from the resampling paradigm. serving approach for matching, whereby a global distortion
model for matching is relaxed to focus on the locality of
4.5.3 Non-parametric Model-Based Methods each correspondence in exchange for generality and effi-
ciency. The derived criterion has been proven able to rapidly
A group of non-parametric model-based methods have and accurately filter erroneous matches. A similar method
been proposed. Instead of simple parametric models, non- appeared in Bian et al. (2017) wherein a simple criterion
parametric models address more general priors in matching, based on local supporting matches to reject outliers is intro-
e.g. motion coherence, and can deal with degenerated sce- duced. Jiang et al. (2020a) casted feature matching as a
narios. These methods are distinguished by different defor- spatial clustering problem with outliers to adaptively cluster
mation functions to model the transformation and different the putative matches into several motion consistent clusters
means to cope with gross outliers. Pilet et al. (2008) proposed together with an outlier/mismatch cluster. Another method
the use triangulated 2-D mesh to model the deformation using in Lee et al. (2020) formulates the feature matching prob-
a tailored robust estimator for eliminating the detrimental lem as a Markov random field that uses both local descriptor
effect of outliers. The idea of robust estimators is also lever- distance and relative geometric similarities to enhance the
aged in Gay-Bellile et al. (2008), with Huber estimator, and robustness and accuracy.
Ma et al. (2015), with L 2 E estimator, despite of their dif-
ferent modeling of deformation. A fairly different method is 4.6 Learning for Matching
proposed in Li and Hu (2010), in which the Support Vec-
tor Regression technique is employed to robustly estimate a Apart from detectors or descriptors, learning-based matching
correspondence function and reject mismatches. methods are commonly used to substitute traditional meth-
The seminal work vector field consensus (VFC) (Ma ods in information extraction and representation or model
et al. 2013a, 2014) introduces a new framework for non- regression. The matching step by learning can be roughly
rigid matching. The deformation function is restricted within classified into image-based and point-based learning. Based

123
46 International Journal of Computer Vision (2021) 129:23–79

on the traditional methods, the former aims to cope with three Quicksilver (Yang et al. 2017b) has been devised by the
typical tasks, namely image registration (Wu et al. 2015a), patch-wise prediction of a deformation model directly using
stereo matching (Poursaeed et al. 2018) and camera local- image appearance, whereby a deep encoder-decoder network
ization or transformation estimation (Poursaeed et al. 2018; is used for predicting the large deformation diffeomorphic
Erlik Nowruzi et al. 2017; Yin and Shi 2018). Such a method model. Inspired by deep convolution, Revaud et al. (2016)
can directly realize task-based learning without attempting introduced a dense matching algorithm based on a hierarchi-
to detect any salient image structure (e.g. interest points) in cal correlation architecture. This method can handle complex
advance. By contrast, point-based learning prefers conduct- non-rigid deformations and repetitive textured regions. Arar
ing on the extracted point sets; such methods are commonly et al. (2020) introduced an unsupervised multi-modal image
used for point data processing, such as classification, seg- registration technique based on an image-to-image transla-
mentation (Qi et al. 2017a, b) and registration (Simonovsky tion network with geometric preserving constraints.
et al. 2016; Liao et al. 2017). Researchers have also used Different from metric learning, a trained agent is used for
these for correct match selection and geometrical transfor- image registration with a reinforcement learning paradigm,
mation model estimation from putative match sets (Moo Yi and typically for estimating a rigid transformation model
et al. 2018; Ma et al. 2019a; Zhao et al. 2019; Ranftl and or a deformable field. Liao et al. (2017) first used the rein-
Koltun 2018; Poursaeed et al. 2018). forcement learning for rigid image registration, in which an
artificial agent and a greedy supervised approach coupled
4.6.1 Learning from Images with attention-driven hierarchical strategy are used to realize
the “strategy learning” process and find the best sequence of
Matching methods of image-based learning often use CNNs motion actions to yield image alignment. An artificial agent,
for image-level latent information extraction and similar- which explores the parametric space of a statistical deforma-
ity measurement, as well as geometrical relation estimation. tion model by training from a large number of synthetically
Therefore, the patch-based learning (Sect. 3.3: learning- deformed image pairs, is also trained in Krebs et al. (2017)
based feature descriptors) is frequently used as an extension to cope with deformable registration problem and the diffi-
of area-based image registration and stereo matching. This culty in extracting reliable ground-truth deformable fields of
is because traditional similarity measurements in a sliding real data. Instead of using a single agent, Miao et al. (2018)
window can be easily replaced with a deep manner, i.e., deep proposed a multi-agent reinforcement learning paradigm for
descriptors. However, the success achieved by researchers in medical image registration in which the auto-attention mech-
using deep learning in spatial transformation networks (STN) anism is used for receptive multiple image regions. However,
(Jaderberg et al. 2015) and optical flow estimation (FlowNet) the reinforcement learning is often used to predict iterative
(Dosovitskiy et al. 2015) has aroused a wave of studies on updates of the regression procedure and still consumes large
directly estimating the geometrical transformation or non- computation in the iterative process.
parametric deformation field with deep learning techniques, To reduce the run time and avoid explicitly defining
even achieving an end-to-end trainable framework. a dissimilarity metric, end-to-end registration in one shot
Image registration. For area-based image registration, has received increasing attention. Sokooti et al. (2017) first
early deep learning is generally used as a direct extension designed deep regression networks to directly learn a dis-
of the classical registration framework, and later use the placement vector field from a pair of input images. Another
reinforcement learning paradigm to iteratively estimate the method in de Vos et al. (2017) similarly trained a deep
transformation, even directly estimate the deformative field network to regress and output the parameters of spatial trans-
or displacement field for the registration task. The most intu- formation, which can then generate the displacement field
itional approach is to use deep learning networks to estimate to warp the moving image to the target image. However,
the similarity measurement for the target image pair in order a similarity metric between image pairs is still required to
to drive an iterative optimization procedure. In this way, achieve unsupervised optimization. More recently, a deep
the classical measure metrics, such as the correlation-like learning framework has been introduced in de Vos et al.
and MI methods, etc., can be substituted with more supe- (2019) for unsupervised affine and deformable image reg-
rior deep metrics. For instance, Wu et al. (2015a) achieved istration. The trained networks can be used to register pairs
deformable image registration by using the convolutional of unseen images in one shot. Similar methods regarding
stacked auto-encoder (CAE) to discover compact and highly deep networks as a regressor can directly learn the parame-
discriminative features from the observed image patch data ter transform model from image pairs, such as Fundamental
for similarity metrics learning. Similarly, to obtain better (Poursaeed et al. 2018), Homography (DeTone et al. 2016),
similarity measure, Simonovsky et al. (2016) used a deep and non-rigid deformation (Rocco et al. 2017).
network trained from a few aligned image pairs. In addi- Many other end-to-end image level learning-based reg-
tion, a fast, deformable image registration method called istration methods are presented. Chen et al. (2019) pro-

123
International Journal of Computer Vision (2021) 129:23–79 47

posed end-to-end trainable deep networks to directly predict a jointly optimizing between domain translation and stereo
the dense displacement field for image alignment. Wang matching. Another method in Yang et al. (2020) learns the
and Zhang (2020) introduced DeepFLASH for efficient wavelet coefficients of the disparity rather than the disparity
deformable medical image registration, which is imple- itself, which can learn global context information from low
mented in a low dimensional bandlimited space thus dra- frequency submodule and details from others. Moreover, the
matically reduces the computational and memory request. guided strategy (Zhang et al. 2019a; Poggi et al. 2019) is also
To simultaneously enhance the topology preservation and utilized for stereo matching.
smoothness of the transformation model, Mok and Chung Stereo matching with deep convolutional techniques has
(2020) proposed an efficient unsupervised symmetric image been dominated for their top performance in public bench-
registration method which maximizes the similarity between marks2 . However, the use of CNNs in stereo matching
images within the space of diffeomorphic maps and estimates community is limited by the input image pairs, which are
both forward and inverse transformations simultaneously. In generally captured from the binocular camera with a narrow
Truong et al. (2020), the authors introduced a universal net- baseline and epipolar rectification. Nevertheless, the network
work for geometric matching, optical flow estimation and structure, basic ideas, and some tricks or strategies in these
semantic corresponding, which can achieve both high accu- learning-based stereo matching may have a strong reference
racy and robustness by investigating the combined use of for general image matching tasks.
global and local correlation layers. See more details in the
registration-specific reviews (Ferrante and Paragios 2017; 4.6.2 Learning from Points
Haskins et al. 2020).
Stereo matching. Over the past years, analogous to regis- Learning from points is not as popular as those in images
tration, numerous studies in stereo matching have focused on for feature extraction, representation and similarity mea-
accurately computing the matching cost by using deep con- surements. Point-based learning, particularly for feature
volutional techniques and refining the disparity map (Zbontar matching, has only been introduced in recent years. This
and LeCun 2015; Luo et al. 2016; Zbontar and LeCun 2016; is because using CNNs on point data is more difficult than
Shaked and Wolf 2017). In addition to the deep descriptors, on raw images due to the unordered structure and dispersed
such as DeepCompare (Zagoruyko and Komodakis 2015) nature of sparse points. Moreover, operating and extract-
and MatchNet (Han et al. 2015), etc., Zbontar and LeCun ing the spatial relationships, such as neighboring elements,
(2015) introduced a deep Siamese network to compute the relative positions, length, and angle information, among
matching cost, which is trained to predict the similarity multi-points using deep convolutional techniques are chal-
between image patches. They further proposed a series of lenging. However, using deep learning techniques to solve
CNNs (Zbontar and LeCun 2016) for the binary classifi- points-based tasks has received increasing considerations.
cation of pairwise matching and applied these in disparity These techniques can be roughly divided into parameter fit-
estimation. Similar to converting the computation of match- ting (Brachmann et al. 2017; Ranftl and Koltun 2018) and
ing costs into a multi-label classification problem, Luo et al. point classification and/or segmentation (Qi et al. 2017a, b;
(2016) proposed an efficient Siamese network for fast stereo Moo Yi et al. 2018; Ma et al. 2019a; Zhao et al. 2019). The for-
matching. In addition, Shaked and Wolf (2017) improved mer is inspired by the classical RANSAC algorithm and aims
the performance by computing the matching cost with the to estimate the transformation model, such as fundamen-
proposed constant highway networks and the disparity esti- tal matrix (Ranftl and Koltun 2018) and epipolar geometry
mation with reflective confidence learning. (Brachmann and Rother 2019), by means of a data-driven
The end-to-end deep manner for this matching task has optimization strategy with CNNs. However, the latter tends
drawn increasing attention in recent years. For instance, to train a classifier to identify the true matches from putative
Mayer et al. (2016) trained an end-to-end CNN in their Disp- match set. Generally, parameter fitting and point classifica-
Net to obtain a fine disparity map, which is extended by tion are trained jointly for performance enhancement.
Pang et al. (2017) with a two-stage CNN called cascade For trainable fundamental matrix estimation, Brachmann
residual learning (CRL). More recently, a spatial pyramid et al. (2017) proposed a differentiable RANSAC, termed as
pooling module together with a 3-D convolutional strat- DSAC, which is based on reinforcement learning in an end-
egy has been introduced in Chang and Chen (2018). This to-end manner. They replaced the deterministic hypothesis
approach can exploit global context information to enhance selection by probabilistic selection to decrease the expected
stereo matching. Inspired from CycleGAN (Zhu et al. 2017) loss and optimize the learnable parameters. Subsequently,
and to deal with domain gap, Liu et al. (2020) proposed Ranftl and Koltun (2018) presented a trainable method for
an end-to-end training framework to translate all synthetic
stereo images into realistic ones simultaneously maintain 2 http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?
epipolar constraints. This method is implemented through benchmark=stereo

123
48 International Journal of Computer Vision (2021) 129:23–79

fundamental matrix estimation from noise, which is casted may serve as great references for addressing these dispersed
as a series of weighted homogeneous least-squares prob- and unstructured point data in the matching task.
lem, where the robust weights are estimated with deep
networks. Similar to DSAC, using learning techniques to 4.7 Matching in 3-D Cases
improve re-sampling strategy is also introduced in Brach-
mann and Rother (2019) and Kluger et al. (2020). Brachmann Similar to its 2-D counterpart, 3-D matching methods
and Rother (2019) proposed NG-RANSAC, a robust estima- often involve two steps, i.e., namely, keypoint detection
tor using learned guidance of hypothesis sampling. It uses and local feature description, and a sparse correspondence
the inlier count itself as training objective to facilitate self- set can then be established by calculating the similarities
supervised learning of NG-RANSAC, and can incorporate between descriptors. Although most methods use local fea-
non-differentiable task loss functions and non-differentiable ture descriptors, which are designed to be robust to noise
minimal solvers. While CONSAC (Kluger et al. 2020) is and deformations to establish correspondences between 3-D
introduced as a robust estimator for multiple parametric instances, a variety of classical and recent works fall into
model fitting. It uses neural network to sequentially update another category. We refer the readers to the recent surveys
the conditional sampling probabilities for the hypothesis (Biasotti et al. 2016; Van Kaick et al. 2011) in the shape
selection. matching area given that a detailed review of the literature is
Learning-based mismatch removal methods have been beyond the scope for this paper.
developed in recent years. Moo Yi et al. (2018) first attempted The embedding methods aim to parametrize the com-
to introduce a learning-based technique termed as learning to plex matching problem with less degrees of freedom for
find good correspondences (LFGC), which aims to train a net- tractability by exploiting some natural assumptions (e.g.,
work from a set of sparse putative matches together with the approximate isometry). A traditional approach is proposed
image intrinsics under the rigid geometrical transformation by Elad and Kimmel (2003) to match shapes by embedding
constraints, and to label the test correspondences as inliers or them in an intermediate Euclidean space. In this approach,
outliers and output the camera motion simultaneously. How- the geodesic distances are approximated by Euclidean ones,
ever, the LFGC may sacrifice many true correspondences and the original non-rigid registration problem is reduced to
to estimate the motion parameters, failing to handle general rigid registration in the intermediate space. Notably, another
matching problems, such as deformable and non-rigid image work developed conformal mapping approaches that also use
matching. To this end, Ma et al. (2019a) proposed a gen- embedding space (Lipman and Funkhouser 2009; Kim et al.
eral framework to learn a two-class classifier for mismatch 2011; Zeng et al. 2010).
removal called LMR, which uses a few images, and hand- A more direct approach is to find a point-wise match-
crafted geometrical representation for training. Their method ing between (subsets of) points on shapes by minimizing
showed promising matching performance with linearithmic the structure distortion. This formulation was developed
time complexity. More recently, Zhang et al. (2019b) focused by Bronstein et al. (2006), who introduced a highly non-
on the geometrical recovery based on their order-aware net- convex and non-differentiable objective and generalized
works (OAN) and have achieved promising performance multidimensional scaling technique for optimization. Some
on pose estimation. Sarlin et al. (2020) proposed Super- researchers have also attempted to mitigate the prohibitively
Glue, to match two sets of local features by jointly finding high computational complexity issue (Sahillioglu and Yemez
correspondences and rejecting non-matchable points. This 2011; Tevs et al. 2011) while considering the quadratic
method is implemented with graph neural networks (Scarselli assignment formulation (Rodola et al. 2012, 2013; Chen and
et al. 2009) for differentiable transport problem optimiza- Koltun 2015; Wang et al. 2011) in graph matching.
tion. Similar graph neural network pipeline has been adopted The family of methods based on the functional map frame-
by an emerging research branch namely deep graph match- work was first developed by Ovsjanikov et al. (2012). Instead
ing (Wang et al. 2019; Yu et al. 2020a; Fey et al. 2020), of point-to-point matching in Euclidean space, these meth-
where cross-graph convolution (Wang et al. 2019), channel- ods represent the correspondences using the functional map
independent embedding (Yu et al. 2020a) and Spline-based between two manifolds, which can be characterized by linear
convolution (Fey et al. 2020) are proposed and adopted for operators. The functional map can be encoded in a compact
supervised graph correspondence learning. form by using the eigenbases of the Laplace-Beltrami oper-
Even though applying CNNs onto point data is difficult, ator. Most natural constraints on the map, such as landmark
the latest techniques have shown great potential for matrix correspondences and operator commutativity, become lin-
estimation and point data classification with deep regressor ear in this formulation, leading to an efficient solution. This
and classifier, particularly for the challenging data or scenar- approach was adopted and extended in many follow-up works
ios. Moreover, the multi-layer perception methods in natural (Aflalo et al. 2016; Kovnatsky et al. 2015; Pokrass et al. 2013;
language processing and the graph convolutional techniques Rodolà et al. 2017; Litany et al. 2017).

123
International Journal of Computer Vision (2021) 129:23–79 49

Point set learning in 3-D cases for registration is also a preserve as many true matches as possible while maintain-
hot topic. Yew et al. (2020) proposed RPM-Net for rigid ing the mismatch to a minimum level using extra geometrical
point cloud registration, in which it desensitizes initializa- constraints. Specifically, the resampling-based method, such
tion and improves convergence performance with learned as RANSAC, can estimate the latent parameter model and
fusion features. Gojcic et al. (2020) introduced an end-to- simultaneously remove the outliers. However, their theoreti-
end multiview point cloud registration framework by directly cally required runtime grows exponentially with the increase
learning to register all views of a scene in a globally consistent in outlier rate, and they cannot process the image pairs that
manner. Pais et al. (2020) introduced a learning architecture undergo more complex non-rigid transformations. The non-
for 3D point registration, namely 3DRegNet. This method parametric model-based methods can handle the non-rigid
can identify true point correspondences from a set of puta- image matching problem by using high-dimensional non-
tive matches, and regress the motion parameters to align the parametric model, but it is still challenging in defining the
scans into a common reference frame. Choy et al. (2020) objective function and finding the optimal solution in a more
used high-dimensional convolutional networks to detect lin- complex solution space. Different from the global constraints
ear subspaces in high-dimensional spaces, then applied it for in the resampling and non-parameter model-based methods,
3D registration under rigid motions and image correspon- the relaxed mismatch removal methods are commonly con-
dence estimation. ducted on a local coherent assumption of potential inliers.
Thus, much simpler but efficient rules are designed to fil-
4.8 Summary ter out the outliers while maintaining the inliers within an
extremely short time. However, methods of this type are lim-
Given a pair of images of similar object/scene and with/without ited due to their parameter sensitivity; moreover, they are
the feature detection and/or description, the matching tasks prone to preserve evident outliers, thereby affecting the accu-
have been extended into several different forms, such as racy of subsequent pose estimation and image registration.
image registration, stereo matching, feature matching, graph In addition, the image patch-based descriptor may not be
matching, and point set registration. These different matching workable due to the matching request in less-texture images,
definitions are generally introduced for specific applications, shape, semantic images, and the raw points directly captured
with their own strengths presented. from specific device. Therefore, for performing the matching
Traditional image registration and stereo achieve dense task of these situations, the graph matching and point regis-
matching by means of patch-wise similarity measuring tration methods are more suitable. The graph structure among
together with optimization strategy to search the overall opti- neighboring points and the overall corresponding matrix are
mal solution. However, they are conducted on image pairs of applied to optimize and find the optimal solution. However,
high overlapping area (slight geometrical deformation) and these pure point-based methods are limited by restrictions in
binocular camera, and these may require large computational their computation burden and outlier sensitivity. Therefore,
burden and the limited handcrafted measuring metrics. designing appropriate problem formulation and constraint
The introduction of deep learning has promoted registra- conditions, and proposing more efficient optimization meth-
tion accuracy and disparity estimation due to advancements ods, are still open problems in image matching community
in network design and loss definition, as well as abundant and require further research attention.
training samples. However, we also find that using deep Analogously to image-based learning, increasing studies
learning for these matching tasks is usually performed on have used deep learning in feature-based matching commu-
image pairs undergoing slight geometrical deformation such nity. The latest techniques have shown great potential for
as medical image registration and binocular stereo match- matrix estimation (e.g. fundamental matrix) and point data
ing. Applying them for more complex scenarios, such as classification (such as mismatch removal) with deep regres-
wide baseline images stereo or image registration with seri- sor and classifier, particularly for handling challenging data
ous geometric deformations, still remains open. or scenarios. However, conducting convolutional networks
Feature-based matching can effectively address the limita- on point data is not as easy as on raw images due to the
tions in large viewpoint, wide baseline, and serious non-rigid unordered structure and dispersed nature of these sparse
image matching problems. Among those proposed in the points. Nevertheless, recent studies have shown the feasibil-
literature, the most popular strategy is to construct the puta- ity of using the graph convolutional strategy and multi-layer
tive matches based on descriptor distance, followed by a perception methods, together with specific normalization on
robust estimator such as RANSAC. However, a large num- such point data. In addition to rigid transformation parameter
ber of mismatches in the putative match sets may negatively estimation, matching on point data with non-rigid and even
affect the performance in subsequent visual task and also serious deformation by using deep convolutional techniques
require considerable time for model estimation. Therefore, may be a more challenging and significant problem.
the mismatch removal method is required and integrated to

123
50 International Journal of Computer Vision (2021) 129:23–79

5 Matching-Based Applications Sturm et al. 2012) has received intensive attention over the
decades.
Image matching is a fundamental problem in computer vision In common SLAM systems, feature matching is needed to
and is considered a critical prerequisite in a wide range of establish correspondences between frames, which then serve
applications. In this section, we briefly review several repre- as the input for estimating the relative camera pose and local-
sentative applications. ization. Similar to SfM, the full-fledged feature matching
pipeline is used in most SLAM systems. Typically, in Endres
et al. (2012), Endres et al. introduced a SLAM system that
5.1 Structure-from-Motion
incorporates feature matching to establish spatial relations
from the sensor data in the front-end. The well-known SIFT
Structure-from-motion (SfM) involves recovering the 3-D
(Lowe 2004), SURF (Bay et al. 2008), and ORB (Rublee et al.
structure of a stationary scene from a series of images,
2011) algorithms are optionally used to detect and describe
which are obtained from different viewpoints by estimat-
features, and RANSAC (Fischler and Bolles 1981) is subse-
ing the camera motions corresponding to these images. SfM
quently used for robust matching.
involves three main stages, namely, (i) feature matching
An evaluation of different feature detectors and descrip-
across images, (ii) camera pose estimation, and (iii) recovery
tors can be found in Gil et al. (2010). Recently, Lowry and
of the 3-D structure using the estimated motion and features.
Andreasson (2018) proposed a spatial verification method for
Its efficacy largely depends on the admissible set of feature
visual localization, which is robust in the presence of a high
matches.
proportion of outliers. For a SLAM system that percepts 3-
In modern SfM systems (Schonberger and Frahm 2016;
D range scans, the point set registration methods (e.g. ICP)
Wu 2018; Sweeney et al. 2015), the feature matching pipeline
(Nüchter et al. 2007) are also used for scan matching and
is widely adopted across images, i.e., feature detection,
localizing the robot.
description, and nearest-neighbor matching, to provide ini-
Loop closure detection–another core module in SLAM
tial correspondences. The initial correspondences contain a
application–refers to accurately asserting that an agent has
number of outliers. Thus, geometric verification is required,
returned to a previously visited location. It is crucial to reduce
which is tackled via the estimation fundamental matrix using
the drift of the estimated trajectory caused by accumula-
RANSAC (Fischler and Bolles 1981). This can potentially
tive error. A group of appearance-based approaches have
be addressed by mismatch removal methods.
been developed to use image similarities to identify previ-
Meanwhile, to enhance the SfM task, researchers have
ously visited places. Feature matching results are naturally
focused on performing robust feature matching, i.e., thus
applicable to measure the similarity of two scenes and have
establishing rich and accurate correspondences. Evidently,
been the bases of many state-of-the-art methods. For exam-
advanced descriptors can greatly affect this task (Fan et al.
ple, Liu and Zhang (2012) performed feature matching with
2019). Moreover, Shah et al. (2015) proposed a geometry-
SIFT between the current image and each previously vis-
aware approach, which initially uses a small sample of
ited image, after which they determined the closed loop on
features to estimate the epipolar geometry between the
the basis of the number of accurate matches in the results.
images and leverages it for the guided matching of the
Zhang et al. (2011) used directed matching of raw features
remaining features. Lin et al. (2016b) utilized RANSAC to
extracted from images for detecting loop-closure events. To
guide the training of match consistency curves for differ-
achieve loop closure detection, Wu et al. (2014) used LSH
entiating true and false matches. Their approach traces the
as the basic technique by matching the binary visual features
common problems of wide-baselines and repeated structures
in the current view of a robot with the visual features in the
for reconstructing modern cities. These correspondences are
robot appearance map. Liu et al. (2015a) developed a consen-
also the prerequisites for camera pose estimation, and the
sus constraint to prune outliers and verified the superiority
effective substitution of commonly used RANSAC for this
of their methods for loop closure detection.
task has also been investigated (Moo Yi et al. 2018), with a
pre-stage of identifying good correspondences.
5.3 Visual Homing

5.2 Simultaneous Localization and Mapping Visual homing aims to navigate a robot from an arbitrary
starting position to a goal or home position based solely
Acquiring maps of the environment is a fundamental task on visual information. This is often accomplished by esti-
for autonomous mobile robots, thereby forming the basis mating a homing vector/direction (pointing from the current
of many different higher-level tasks, such as navigation and position to the home position) from two panoramic images,
localization. The problem of simultaneous localization and which are captured respectively at the current position and
mapping (SLAM) (Davison et al. 2007; Mur-Artal et al. 2015; the home position. Conventionally, feature matching serves

123
International Journal of Computer Vision (2021) 129:23–79 51

as the building block of correspondence methods in visual image and multi-sensor image analysis. For example, Chen
homing research (Möller et al. 2010). In this category, the et al. (2010) developed the partial intensity invariant feature
homing vector can be determined by transforming the cor- descriptor (PIIFD) to register retinal images, whereas Wang
respondences into motion flows (Ma et al. 2018b; Churchill et al. (2015) extended PIIFD in a more robust registration
and Vardy 2013; Liu et al. 2013; Zhao and Ma 2017). framework with SURF detector (Bay et al. 2008) and a single
Ramisa et al. (2011) combined the average landmark vec- Gaussian point matching model. On the basis of the charac-
tor with invariant feature points automatically detected in teristics of multi-modal images, Liu et al. (2018a) proposed
panoramic images to achieve autonomous visual homing. an affine and contrast invariant descriptor for IR and visible
However, the feature matches are solely determined by the image registration. Du et al. (2018) also proposed an IR and
similarity of the descriptors in the method, thus leading to a visible image registration method based on scale-invariant
number of mismatches. The presence of outliers has been ver- PIIFD feature and locality preserving matching. Ye et al.
ified to be the reason of performance degradation for visual (2017) proposed a novel feature descriptor based on the struc-
homing (Schroeter and Newman 2008). In order to resolve tural properties of images for multi-modal registration. A
the degradation caused by mismatches, Liu et al. (2013) used detailed discussion of feature matching-based, multi-modal
a RANSAC-like method to remove mismatches. Meanwhile, registration techniques of the medical image analysis area,
Zhao and Ma (2017) proposed a visual homing method by which are categorized as geometric methods, can be found
simultaneously mismatch removal and robust interpolation in Sotiras et al. (2013).
of sparse motion flows under a smoothness prior. Ma et al. Meanwhile, image stitching or image mosaic involves
(2018b) also proposed a guided locality preserving matching obtaining a wider field-of-view of a scene from a sequence
method to handle extremely large proportions of outliers and of partial views (Ghosh and Kaabouch 2016). Compared to
improve the visual homing robustness. image registration, image stitching deals with low overlap-
ping images and requires accurate alignment in the pixel-
5.4 Image Registration and Stitching level to avoid visual discontinuities. Feature-based stitching
methods are popular in this area because of their invariance
Image registration is the process of aligning two or more properties and efficiency. For example, in order to iden-
images of the same scene obtained from different view- tify geometrically consistent feature matches and achieve
points, at different times, or from different sensors (Zitova accurate homography estimation, Brown and Lowe (2007)
and Flusser 2003). In the past decades, feature-based methods proposed the use of the SIFT (Lowe 2004) feature matching
in which the key requirement is feature matching have gained and the RANSAC (Fischler and Bolles 1981) algorithm. Lin
increasing attention due to its robustness and efficiency. et al. (2011) used SIFT (Lowe 2004) to pre-compute matches
Once the correspondence is established, image registration and then jointly estimating the matching and the smoothly
is reduced to estimate the transformation model (e.g., rigid, varying affine fields for better stitching performance. Inter-
affine, or projective). Finally, the source image is transformed ested readers can refer to the comprehensive survey (Ghosh
by means of the mapping functions, which rely on some and Kaabouch 2016; Bonny and Uddin 2016) for an overview
interpolation technique (e.g., bilinear and nearest neighbor). of more feature-based image mosaic and stitching methods.
A large number of works have been proposed for feature
matching and image registration. Ma et al. (2015b) pro- 5.5 Image Fusion
posed a Bayesian formulation for rigid and non-rigid feature
matching and image registration. To further exploit the geo- To generate a more conducive image to subsequent appli-
metrical cues, the locally linear transforming constraint is cations, image fusion is adopted to combine the meaningful
incorporated. They also recently proposed a guided local- information from images acquired by different sensors or
ity preserving matching method (Ma et al. 2018a). Their under different shooting settings (Pohl and Van Genderen
proposed method can significantly reduce the computational 1998), wherein the source images have been accurately
complexity and is able to deal with a more complex transfor- aligned in advance. The very premise of image fusion is to
mation model. For non-rigid image registration, Pilet et al. register source images using feature matching methods, and
(2008) and Gay-Bellile et al. (2008) proposed solutions, the accuracy of registration directly affects the fusion quality.
where robust matching techniques are insensitive to outliers. Liu et al. (2017) used the CNN to jointly generate the activ-
Some efforts (Paul and Pati 2016; Ma et al. 2017b; Yang ity level measurement and fusion rules for multi-focus image
et al. 2017a) also attempted to modify feature detectors and fusion. Meanwhile, Ma et al. (2019c) proposed an end-to-end
descriptors to improve the registration process. model for infrared and visible image fusion, which generates
The problem of multi-modal image registration is more images with a dominant infrared intensity and an additional
complicated due to the high variability of appearance caused visible gradient under the framework of generative adversar-
by different modalities, which frequently arise in medical ial networks. Subsequently, they introduced a detail loss and

123
52 International Journal of Computer Vision (2021) 129:23–79

a target edge-enhancement loss to further enrich the texture the spatial coding method, whereby the valid visual word
details (Ma et al. 2020). matches are identified by verifying the global relative posi-
A group of methods aim to fuse images based on the local tion consistency.
features, among which the dense SIFT is the most popu- With the function of measuring similarity, feature match-
lar. Liu et al. (2015b) proposed the fusion of multi-focus ing also plays an important role in object recognition and
images with dense scale invariant feature transform, wherein tracking. For example, Lowe et al. (1999) used SIFT features
the local feature descriptors are used not only as the activity to match sample images and new images. In their proposed
level measurement, but also to match the mis-registered pix- method, the potential model pose is identified through a
els between multiple source images to improve the quality Hough transform hash table and then through a least-squares
of the fusion results. Similarly, Hayat and Imran (2019) pro- fit to achieve a final estimate of model parameters. The pres-
posed a ghost-free multi-exposure image fusion technique ence of the object is strongly evident if at least three keys
using the dense SIFT descriptor with a guided filter, which agree on the model parameters with low residuals. Modern
can produce high-quality images using ordinary cameras. In attempts for object recognition also include some specifically
addition, Chen et al. (2015) and Ma et al. (2016a) introduced a handcrafted features (Dalal and Triggs 2005; Hinterstoisser
method that can perform image registration and image fusion et al. 2012) and, more recently, deep learning approaches
simultaneously, thus fulfilling image fusion on unaligned (Wohlhart and Lepetit 2015).
image pairs. Tracking basically refers to estimating the trajectory of
an object over images. Feature matching across images is
5.6 Image Retrieval, Object Recognition and the basis of feature-based tracking, and a variety of algo-
Tracking rithms for these tasks have been proposed in the literature.
The feature matching pipeline is adopted in most visual
Feature matching can be used to measure similarity between tracking systems, except that the matching is constrained to
images, thereby enabling a series of high-level applications, those of the known features that are predicted to lie close
including image retrieval (Zhou et al. 2017), object recogni- to the encountered position. The readers are referred to a
tion, and tracking. The goal of image retrieval is to retrieve all comprehensive evaluation of different feature detectors and
images that exhibit similar scenes for a given query image. descriptors for tracking by Gauglitz et al. (2011), and the
In local feature-based image retrieval, the image similarity recently presented benchmark (Wu et al. 2015b), which cov-
is intrinsically determined by the feature matches between ers a review of modern object tracking methods as well as
images. Thus, the image similarity score can be obtained the role played by feature representation methods.
by aggregating votes from the matched features. In Zhou
et al. (2011), the relevance score is simply determined by the
number of feature matches across two images. In Jégou et al. 6 Experiments
(2010), the scoring function is defined as a cumulation of the
squared term frequency inverse document frequency weights Diverse methods for image matching have been proposed,
on shared visual words, which is essentially a bag of features particularly when the deep learning techniques are becoming
of inner products. increasingly popular. However, the question of which method
Moreover, geometric context verification, a common tech- would be suitable for specific applications under different
nique for refining initial image retrieval result, is directly scenarios and requirements still remains. We are encouraged
related to feature matching. By incorporating the geometri- to conduct more comprehensive and objective comparative
cal information, geometric context verification technique can analysis of these classical and state-of-the-art techniques.
be used to address the false match problem caused by the
ambiguity of local descriptor and the quantization loss. For 6.1 Overview of Existing Reviews
image retrieval, a large group of methods estimate the trans-
formation model in an explicit approach to verify the tentative To evaluate the existing matching methods at an early time,
matches. For example, Philbin et al. (2007) used a RANSAC- the classical image registration survey (Zitova and Flusser
like method to find the inlier correspondences, whereas 2003) provided several definitions for evaluation of regis-
Avrithis and Tolias (2014) developed a simple spatial match- tration accuracy including localization error, matching error,
ing model inspired by Hough voting in the transformation and alignment or registration error. In 2005, Mikolajczyk et
space. Another line of works address geometric context ver- al. evaluated affine region detectors (Mikolajczyk et al. 2005)
ification without explicitly handling a transformation model. and local descriptors (Mikolajczyk and Schmid 2005) against
For example, Sivic and Zisserman (2003) utilized the con- changes of viewpoint, scale, illumination, blur, and image
sistency of spatial context in local feature groups to verify compression on their own proposed VGG (a.k.a. Oxford)
the tentative correspondences. Zhou et al. (2010) proposed datasets. They also presented a comprehensive compari-

123
International Journal of Computer Vision (2021) 129:23–79 53

Fig. 2 Examples of the five datasets. The ground truth is given using = true positive, black = true negative). For visibility, in the image pairs,
colored correspondences. The head and tail of each arrow in the motion at most 100 randomly selected matches are presented, and the true neg-
field correspond to the positions of feature points in two images (blue atives are not shown (Color figure online)

son on repeatability and accuracy for detectors and recall, based techniques, in which they considered that finding
1 − pr ecision for descriptors. Subsequently, Strecha et al. additional true matches between similar images does not
(2008) published a dense 3-D dataset for wide-baseline stereo necessarily improve performance when matching images
and 3-D geometrical and camera pose evaluation. under extreme viewpoint or illumination changes. Mitra et al.
In addition, Aanæs et al. (2012) evaluated some repre- (2018) provided a PhotoSynth (PS) dataset for training local
sentative detectors using a large dataset of known camera image descriptors. Komorowski et al. (2018) provided a sta-
positions, controlled illumination, and 3-D models, namely, bility evaluation for handcrafted and learning-based interest
DTU. At the same time, Heinly et al. (2012) compared the point detectors on ApolloScape street dataset (Huang et al.
traditional float and binary feature operators in 2012 and eval- 2018). A comprehensive comparison of local image feature
uated their matching performance with the inter-combination detectors based on both classical and CNN techniques is con-
of existing detectors and descriptors on the public and their ducted on public datasets (Lenc and Vedaldi 2014). That
own datasets. The evaluation was conducted on more system- work proposed a modified repeatability for detection eval-
atic performance metrics consisting of putative match ratio, uation, which is more robust to feature scale variety. Jin
precision, matching score, recall, and entropy. Similarly, et al. (2020) introduced a benchmark for local features and
using inter-combination strategy, Mukherjee et al. (2015) robust estimation algorithms, focusing on the accuracy of the
provided a comparative experimental analysis for selecting reconstructed camera pose as their practical evaluation. In
appropriate combination of various detectors and descrip- addition, Bellavia and Colombo (2020) provided a compre-
tors in order to solve the problems of image matching using hensive analysis and evaluation about the descriptor design
different image data. based on SIFT.
More recently, inspired by emerging deep learning tech- From the above mentioned, we can know that several com-
niques, Balntas et al. (2017) reported that existing defective prehensive and thorough evaluation of feature detectors and
datasets and evaluation metrics may lead to unreliable com- descriptors can be found in Komorowski et al. (2018), Lenc
parative results. Thus, they proposed and publicized a large and Vedaldi (2014), Heinly et al. (2012) and Schonberger
benchmark for handcrafted and learned local image descrip- et al. (2017). However, in order to evaluate the local feature
tors called Hpathes. They also comprehensively evaluated methods, many studies compared the matching performances
the performance of widely used handcrafted descriptors on a 3-D reconstruction task, including the works of Fan
and recent deep ones with extensive experiments on patch et al. (2019) and Schonberger et al. (2017). In the 3-D case,
recognition, patch verification, image matching, and patch Tombari et al. (2013) presented a thorough evaluation of sev-
retrieval. Schonberger et al. (2017) conducted an experimen- eral state-of-the-art 3-D keypoint detectors, and Guo et al.
tal evaluation of learned local features, including classical (2016) compared ten popular local feature descriptors in the
machine learning based variants of SIFT and recent CNN- contexts of 3-D object recognition, 3-D shape retrieval, and

123
54 International Journal of Computer Vision (2021) 129:23–79

Fig. 3 Quantitative performance of the state-of-the-art mismatch presented in cumulative distribution, a point on the curve with coordi-
removal algorithms on the introduced five datasets. The statistics of nate (x, y) denotes that there are (100 ∗ x) percents of image pairs which
precision, recall, F-score and runtime are reported for each dataset, and have the performance value (i.e., precision, recall, F-score or runtime)
the average values are given in the legend. From top to bottom, the statis- no more than y
tics of DAISY, DTU, Retina, RemoteSensing and VGG. The results are

3-D modeling. Several matching related applications, such pose estimation and loop-closure detection, we will present
as image retrieval (Zheng et al. 2018) and visual localization both quantitative and qualitative comparisons.
(Piasco et al. 2018), have also been evaluated recently. We
refer the readers to these works for a detailed discussion of
6.2 Results on Mismatch Removal
their performance. For mismatch removal, point set regis-
tration, graph matching, and the application performance of
We conduct experiments on five image matching datasets
with ground truth. Our primary aim is to evaluate different

123
International Journal of Computer Vision (2021) 129:23–79 55

mismatch removal methods. The features of each image are ing research. This dataset includes 60 pairs of low-altitude
assumed to be detected and described, and the open source remote sensing images captured by small UAV and their
VLFeat toolbox is used to determine the putative correspon- groundtruth. The image pairs contain viewpoint changes in
dence using SIFT (Lowe 2004). The details of the adopted horizontal, vertical, their mixture and extreme patterns which
datasets are described as follows, and some representative produce problems of low overlap, image distortion and severe
image matching examples from the used datasets are illus- outliers.4 Throughout the experiments, we use three evalua-
trated in Fig. 2. The ground truth of each dataset is checked tion metrics: precision, recall, and F-score. Given the number
by the provided geometrical transform matrix, such as homo- of true positives (TP), true negatives (TN), false positives
graph, or provided in the manner that each match is manually (FP) and false negatives (FN), the precision is obtained by:
labeled as true or false. The experiments of this part are per-
formed on a desktop with 3.4 GHz Intel Core i5-7500 CPU, TP
P= . (5)
8GB memory. T P + FP
DAISY (Tola et al. 2010): The dataset consists of wide
baseline image pairs with ground truth depth maps, includ- The recall is given as follows:
ing two short image sequences and several individual image
TP
pairs. We match each two images in one sequence and all R= . (6)
the individual pairs are used, which creates in total 47 image T P + FN
pairs for evaluation. This dataset is a challenging one due The F-score, as a summary statistic of precision and recall,
to the large number of matches, which is up to 8000. The is obtained as follows:
average numbers of matches and inlier rate are 1191.6 and
77.99%, respectively. 2× P × R
F= . (7)
DTU (Aanæs et al. 2016): The dataset is originally desig- P+R
nated for multiple-view stereo evaluation, which involves a
number of different scenes with a wide range of objects. The The mismatch removal methods include: RANSAC (Fis-
ground truth camera positions and internal camera param- chler and Bolles 1981) (abbreviated as RS), SM (Leordeanu
eters have high accuracy. Two scenes are selected for this and Hebert 2005), ICF (Li and Hu 2010), GS (Liu and Yan
dataset (i.e., Frustum and House), after which we create 130 2010), LO-RANSAC (Lebeda et al. 2012) (abbreviated as
image pairs for evaluation. These scenes generally have large LRS), VFC (Ma et al. 2014), LPM (Ma et al. 2019d), GMS
viewpoint changes in the scenes. The average numbers of (Bian et al. 2017), and LFGC (Moo Yi et al. 2018).
matches and inlier rate are 729.3 and 58.83%, respectively. Figure 3 shows the performance on the five datasets
Retina (Ma et al. 2019d) It consists of 70 retinal image evaluated by precision, recall, F-score, and runtime with
pairs with non-rigid transformation. Due to different modal- cumulative distribution. In addition, the average values of
ities between images, ambiguous putative matches are gen- each statistic is summarized in Table 1 for a more straightfor-
erated, resulting in a small number of correct matches and a ward comparison. The graph matching methods, SM and GS,
low inlier ratio. The average numbers of matches and inlier have shown relatively weak performances given the graphi-
rate are 158.4 and 41.56%, respectively. cal model, albeit with strong generality, only excavates the
RemoteSensing (Ma et al. 2019d) There are 161 remote shallow pairwise geometric constraints. Random sampling
sensing image pairs including color-infrared, SAR, and methods, RS and LRS, hold the key assumption that the
panchromatic photographs. The feature matching task for image pairs are related by parametric models. This assump-
such image pairs typically arises in image-based position- tion seems to work well in the datasets; however, their time
ing as well as navigating and change detection. The average costs are not favorable. The non-parametric interpolation
numbers of matches and inlier rate are 767.6 and 68.50%, method VFC is relatively robust and outperforms ICF. How-
respectively. ever, its computational cost is higher than that of some other
VGG (Mikolajczyk and Schmid 2005) It contains 40 strong competitors, e.g., LPM. LPM is simple to implement.
image pairs either of planar scenes or captured by a cam- It utilizes a more relaxed geometric constraint, yet it achieves
era in a fixed position during acquisition. Hence, the image surprisingly excellent performance and becomes the best per-
transformation can be precisely described by homography. former considering the time cost. Compared with GMS, it
The ground truth homographies are included in the dataset. obtains much better performance with only a slight increase
These abovementioned datasets are collected and avail- in runtime. The recent trend has suggested a deep learning
able at. 3 In addition, a small UAV image registration dataset paradigm for differentiating mismatches, e.g. LFGC. LFGC
(SUIRD) is also provided for image registration or match- has proven to be much more effective than the traditional

3 https://github.com/StaRainJ/Imgae_matching_Datasets. 4 https://github.com/yyangynu/SUIRD/tree/master/SUIRD_v2.2.

123
56

123
Table 1 Quantitative performance of the state-of-the-art mismatch removal algorithms on the introduced five datasets
Alg. RS (Fischler and SM (Leordeanu ICF (Li and Hu GS (Liu and Yan LRS (Lebeda VFC (Ma et al. LPM (Ma et al. GMS (Bian et al. LFGC (Moo Yi
Bolles 1981) and Hebert 2010) 2010) et al. 2012) 2014) 2019d) 2017) et al. 2018)
2005)

DAISY P 97.52 92.95 94.71 96.16 97.45 97.42 98.29 92.51 97.63
R 87.99 85.32 37.67 74.59 93.83 92.48 96.10 85.65 43.89
F 92.31 88.37 49.21 81.83 95.51 94.72 97.16 88.71 59.90
T 8.37e1 4.60e2 1.09e4 8.71e4 1.33e2 5.63e1 2.52e1 5.54e0 8.02e0
DTU P 94.65 82.26 95.62 92.73 93.79 93.07 95.52 88.01 89.34
R 87.94 93.14 50.59 74.31 92.10 91.41 95.21 81.85 49.19
F 90.92 85.65 62.57 80.05 92.76 91.98 95.28 83.94 61.22
T 1.92e2 5.46e1 9.20e2 1.39e4 1.19e2 4.67e2 1.21e1 1.06e0 7.01e0
Retina P 94.82 75.75 70.90 72.80 94.57 94.76 93.97 79.25 96.96
R 92.82 99.62 84.14 97.03 97.09 94.39 90.22 69.45 81.53
F 93.63 85.14 71.62 80.32 95.73 94.42 91.52 73.32 88.22
T 1.65e2 2.97e0 3.48e1 5.79e2 5.19e1 5.75e0 2.30e0 7.5e-1 7.02e0
RS P 99.07 85.31 88.93 99.20 99.02 99.39 98.12 97.31 42.22
R 98.12 93.14 60.37 87.56 99.55 98.17 98.81 88.53 36.90
F 98.53 86.13 63.03 92.44 99.27 98.76 98.37 92.08 38.02
T 5.98e2 8.70e1 1.39e3 1.81e4 4.65e2 6.08e2 1.24e1 1.09e0 8.55e0
VGG P 97.34 95.33 92.35 96.62 96.48 98.03 97.04 96.12 94.86
R 98.65 87.92 50.78 95.31 98.13 97.50 85.79 82.27 38.43
F 97.88 90.38 62.17 94.91 97.25 97.55 87.17 85.58 53.43
T 1.01e2 4.38e1 9.27e2 6.33e4 7.56e1 1.44e2 1.08e1 1.14e0 4.07e1
The average statistics of precision (P), recall (R), F-score (F) in percentage and runtime (T) in milliseconds with scientific notation are reported for each dataset. The RemoteSensing dataset is
abbreviated as RS
International Journal of Computer Vision (2021) 129:23–79
International Journal of Computer Vision (2021) 129:23–79 57

Fig. 4 2-D shape contours used in our experiments, from left to right, fish, whale, fu, beijing

distance to the ground truth corresponding point is below a


given threshold. Thus, we can define the accuracy of reg-
istration as the percentage of accurately aligned points. In
our experiment, the threshold is empirically set to 0.1. Four
patterns are collected to evaluate the non-rigid 2-D registra-
tion results, as shown in Fig. 4. We also create five deformed
shapes for each pattern as the data to be registered, generating
a total of 20 instances. We also conduct noise, outlier, and
rotation experiments on these instances. For the 3-D case, as
Fig. 5 The bunny and wolf pattern of 3-D point cloud used in our
experiments shown in Fig. 5, two patterns are used, and we exert random
rotation to create 20 instances for each pattern. Noise and
outlier experiments are also conducted on these 40 instances.
methods. However, in our case, it has a restricted performance The results of non-rigid 2-D registration are presented in
with low recall and high accuracy, resulting in the failure Fig. 6. The outlier experiments of APM are excluded due
in RemoteSensing. This finding indicates that the learning to its prohibitive runtime with the increase in data points.
methods are data-dependent with limited generality. The experimental setting is relatively challenging, and the
weaknesses of each method have emerged. For instance,
TPS-RPM is generally robust to outliers, but it can be
6.3 Results on Point Set Registration
degraded in the case of severe noises. CPD and GMM have
similar performances and are sensitive to outliers. L 2 E and
The experiments for point set registration consist of two
PR-GLS utilize the information of shape context descriptor
parts: non-rigid registration with 2-D shape contour data
to guide the registration, but their performances are unstable.
and rigid registration with 3-D point cloud data. In the 2-D
APM can only deal with affine deformation, thus leading to
case, six representative methods, namely, TPS-RPM (Chui
its inferior performance. However, compared to other meth-
and Rangarajan 2003), GMM (Jian and Vemuri 2011), CPD
ods that are only locally convergent and fail to handle violent
(Myronenko and Song 2010), L 2 E (Ma et al. 2013b), PR-
rotations, APM is invariant to rotation owing to its global
GLS (Ma et al. 2015a), and APM (Lian et al. 2017) are
optimality.
evaluated. In the 3-D case, the rigid versions of GMM and
The results of rigid 3-D point cloud registration are pre-
CPD as well as ICP (Besl and McKay 1992) and GoICP
sented in Fig. 7. In our random rotation settings, the locally
(Yang et al. 2016) are evaluated. The experiments of this part
convergent methods, i.e., GMM, CPD, and ICP, fail to accu-
are performed on a desktop with 3.4 GHz Intel Core i5-7500
rately register the point clouds. In this regard, the globally
CPU, 8GB memory.
optimal method, GoICP, outperforms them by a large margin.
The point data are normalized as inputs, thus allowing
the use of a fixed threshold to evaluate the registration per-
formance. Specifically, a point is accurately aligned if its

123
58 International Journal of Computer Vision (2021) 129:23–79

1.2 1.2 101


1.2
1.1
1
1
1

0.9 0.8 0.8


100

Average Accuracy
Average Accuracy

Average Accuracy
0.8
0.6

Time/s
0.6
0.7
0.4
0.6 0.4
TPSRPM 0.2 TPSRPM 10-1
0.5 GMM TPSRPM TPSRPM
GMM
CPD CPD 0.2 GMM GMM
0.4 L2E 0 L2E CPD CPD
PR-GLS PR-GLS L2E L2E
0.3 APM APM PR-GLS PR-GLS
-0.2 0
APM
0.2 10-2
0.01 0.03 0.05 0.07 0.09 30 60 90 120 150 180 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 100
Noise Level Degree of Rotation Outlier Ratio Cumulative Distribution

Fig. 6 Quantitative evaluation of non-rigid 2-D shape contour registration

1.2 1.2 103


GMM GMM GMM
CPD CPD CPD
1 1
ICP ICP ICP
GoICP GoICP GoICP
0.8 0.8
102
Average Accuracy

Average Accuracy

0.6 0.6

Time/s
0.4 0.4

0.2 0.2 101

0 0

-0.2 -0.2
100
0.01 0.03 0.05 0.07 0.09 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 140 160 180 200
Noise Level Outlier Ratio Cumulative Distribution

Fig. 7 Quantitative evaluation of rigid 3-D point cloud registration

6.4 Results on Graph Matching 30 on this dataset to test the robustness of these algorithms,
as presented in Fig. 8. The figure shows that in the equal-
Graph matching represents an alternative means to establish size matching, most GM methods can achieve near-optimal
correspondences between two feature sets. Here, we evalu- performance, except for the spectral relaxed baselines. For
ate seven state-of-the-art methods in the literature, namely, unequal-size matching, the performance gap has emerged.
SM (Leordeanu and Hebert 2005), SMAC (Cour et al. 2007), In summary, FGM achieves the best performance with the
IPFP (Leordeanu et al. 2009), RRWM (Cho et al. 2010), TM highest time cost, and RRWM is the most balanced algo-
(Duchenne et al. 2011), GNCCP (Liu and Qiao 2014), and rithm, which is only inferior to FGM in accuracy but is much
FGM (Zhou and De la Torre 2015) on several extensively more efficient.
used and publicly available datasets. These datasets include The car and motorbike dataset consists of 30 pairs of car
the CMU house sequence (Cho et al. 2010; Zhou and De la images and 20 pairs of motorbike images obtained from the
Torre 2015), the car and motorbike dataset (Zhou and De la PASCAL challenges (Everingham et al. 2010). Each pair
Torre 2015; Leordeanu et al. 2012), and the Chinese char- contains 30–60 ground-truth correspondences. We consider
acter dataset (Liu and Qiao 2014; Zhang et al. 2016). The the most general graph wherein the edge is directed and the
experiments of this part are performed on a desktop with 3.4 edge feature is asymmetrical. Similarly, the graph is built
GHz Intel Core i5-7500 CPU, 8GB memory. with Delaunay triangulation, and the affinity matrix is con-
The CMU house sequence consists of 111 images of a structed as in Zhou and De la Torre (2015) except for TM.
toy house captured from different viewpoints. Each image To test the robustness to outliers, 2 ∼ 20 outliers are ran-
has 30 manually marked landmark points with known corre- domly selected from the background. As shown in Fig. 9, the
spondences. We match all images spaced by 5, 10, . . . , 110 path following algorithms, i.e., GNCCP and FGM, outper-
frames and compute the average performance per separation form all other methods, except for TM with the highest time
gap. The large gaps indicate more challenging scenes due to cost. The RRWM remains competitive with high accuracy
the increasing perspective changes. We build the graph using and low runtime. The higher-order TM has achieved remark-
Delaunay triangulation and construct the affinity matrix sim- able performance in this experiment with consistent optimal
ply by the edge distance as in Zhou and De la Torre (2015), performance. Moreover, its runtime is reasonable due to the
except for TM, which has high order. Different from the adopted random sampling strategy used for constructing the
original equal-size 30-node to 30-node matching, we remove three-order affinity matrix. The direct comparison of pairwise
some nodes and conduct unequal-size matching experiments and higher-order graph matching methods can be unfair, but
with the corresponding settings of 25 versus 30 and 20 versus

123
International Journal of Computer Vision (2021) 129:23–79 59

1 102
SM
0.95 SMAC
IPFP
0.9 10 1 RRWM
TM
GNCCP
0.85
FGM
0.8 100

Accuracy

Time/s
0.75

0.7 10-1
SM
0.65 SMAC
IPFP
0.6 RRWM 10-2
TM
0.55 GNCCP
FGM
0.5 10-3
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
Baseline Baseline

1 1

0.9 0.9

0.8 0.8

0.7 0.7

Accuracy
Accuracy
0.6 0.6
SM
0.5 0.5 SMAC
SM
IPFP
SMAC
RRWM
IPFP
0.4 0.4 TM
RRWM
GNCCP
TM
FGM
0.3 GNCCP 0.3
FGM

0.2 0.2
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
Baseline Baseline

Fig. 8 Quantitative evaluation on the CMU house dataset. Top row statistics. Bottom row (from left two right): example of unequal-size
(from left to right): illustration of equal-size matching with ground- matching with ground-truth correspondence, 25 versus 30 matching
truth correspondence, 30 versus 30 matching results and its runtime results and 20 versus 30 matching results

1 102
SM
SMAC
0.9 IPFP
10 1 RRWM
TM
GNCCP
0.8 FGM

100
Accuracy

Time/s
0.7

10-1
SM
0.6 SMAC
IPFP
RRWM
10-2
0.5 TM
GNCCP
FGM

0.4 10-3
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
#Outlier #Outlier

1 102
SM
SMAC
0.9 IPFP
1 RRWM
10
TM
GNCCP
0.8 FGM
100
Accuracy

Time/s

0.7

10-1
0.6 SM
SMAC
IPFP
RRWM 10-2
0.5 TM
GNCCP
FGM
0.4 10-3
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
#Outlier #Outlier

Fig. 9 Quantitative evaluation on car and motorbike dataset. Top row Bottom row (from left two right): example of unequal-size matching
(from left to right): example of equal-size matching with ground-truth with ground-truth correspondence, motorbike matching results and the
correspondence, car image matching results and the runtime statistics. runtime statistics

123
60 International Journal of Computer Vision (2021) 129:23–79

In the following, we briefly introduce the datasets and


evaluation metrics to be used and provide quantitative com-
parisons and analyses.
Outdoor scenes. We adopt the Yahoo’s YFCC100M
dataset (Thomee et al. 2016), with 100 million publicly
accessible tourist photos from the Internet and subsequently
Fig. 10 Examples of the Chinese character dataset curate into 72 image sequences for SfM. From this dataset,
68 sequences are selected as valid raw data. Next, we use
the Visual SfM (Heinly et al. 2015) to recover the camera
the results still exhibit the efficacy of utilizing higher-order poses and generate the ground-truth. This dataset is divided
information in GM. into disjoint subsets for training (60%), validation (20%),
The Chinese character dataset has four hand-written Chi- and test (20%). For fairness, all learning-based methods are
nese characters with marked features wherein each character re-trained on the same training set.
has 10 samples. We create matching instances between all Indoor scenes. We adopt the SUN3D dataset (Xiao et al.
pairs of samples for each character, i.e. 45 instances each. 2013), which is an RGBD video dataset with camera poses
The average performance is summarized in Table 2 and the computed by generalized bundle adjustment. Specifically,
example is shown in Fig. 10. The scene is relatively challeng- all samples in this dataset are subsampled from videos of
ing, and we use simple edge distances to construct affinity every 10 frames of feature office-like scenes. This dataset
matrix, resulting in the relatively low accuracy for all meth- is extremely challenging for sparse correspondence methods
ods. However, the superior performers are still evident. FGM due to the few distinctive features, heavy repetitive elements,
and TM perform similarly, but TM is more efficient. and substantial self-occlusions. Zhang et al. (2019b) reported
that some sequences in this dataset do not provide camera
6.5 Results on Pose Estimation poses. Thus, these sequences are dropped and 239 sequences
are finally obtained as valid data. Similar to the data of out-
The camera pose estimation aims to determine the posi- door scenes, the SUN3D dataset is split into disjoint subsets
tion and orientation of the camera with respect to the object for training (60%), validation (20%), and testing (20%).
or scene, which is a significant step in 3-D computer vision Evaluation Metrics. Once potential inliers are obtained,
tasks, such as SfM, SLAM, and visual localization for self- it is possible to efficiently estimate the rotation and transla-
driving cars and augmented reality. Here, the camera pose tion vectors by RANSAC. The performance can be evaluated
estimation of traditional approaches estimates the pose from using the angular difference between the estimated and
a set of 2-D versus 3-D matches between pixels in a query ground-truth vectors; i.e., the closest arc distance in degrees
image and 3-D points in a scene model. However, the 3-D as the error metric. First, a curve should be generated by clas-
model is typically obtained via SfM, thus leading to poten- sifying whether each pose as accurate or not. The precision
tially inaccurate pose estimates. To address this problem, one should be computed with respect to the given angle threshold
alternative is to perform a set of 2-D versus 2-D correspon- from 0◦ to 180◦ , and a normalized cumulative curve should
dences between two or more images of the same scene. be built. Second, the area under curve (AUC) is computed up
To estimate the camera pose, the putative sparse feature to a maximum threshold of 5◦ , 10◦ , or 20◦ . Since the curve
correspondences must also be constructed with off-the-shelf itself can measure precision, its AUC can be regarded as the
feature matcher, such as SIFT. Moreover, the most classi- metric of mAP.
cal pipeline is the combination of SIFT and RANSAC. The Several traditional mismatch removal methods, i.e., GMS
geometric model can be estimated and converted into the rel- (Bian et al. 2017), ICF (Li and Hu 2010), LPM (Ma et al.
ative camera pose, i.e., rotation matrix and translation matrix. 2019d), SM (Leordeanu and Hebert 2005) and VFC (Ma et al.
Many advanced handcrafted methods and trainable ones are 2014) are used for evaluation of the pose estimation task,
considered as good options for their superior performance. in addition to two deep-learning-based methods, i.e., LFGC
Here, we integrate some typical mismatch removal meth- (Moo Yi et al. 2018) and OAN (Zhang et al. 2019b). For
ods between SIFT and RANSAC, while some learning-based these methods, pose estimation results are obtained by a sub-
methods can intrinsically output the transform matrix from sequent RANSAC procedure. In addition, plain RANSAC
their networks, which can be directly used for this task. In (Fischler and Bolles 1981) is also included for comparison.
addition, two different datasets, including indoor and outdoor As shown in Table 3, on the adopted dataset, the perfor-
scenes, are used in this experiment. The performance is char- mances of traditional methods are very limited due to the
acterized by the mean average precision (mAP), as depicted dominant outliers. In contrast, the deep-learning-based meth-
in Table 3. The experiments of this part are performed on a ods seem to significantly outperform the traditional methods,
server with 2.00 GHz Intel Xeon CPU, 128 GB memory. resilient to the high outlier ratio.

123
Table 2 Evaluation by average accuracy and runtime on Chinese character dataset (best in bold)
Dataset SM (Leordeanu SMAC (Cour IPFP (Leordeanu RRWM (Cho TM (Duchenne GNCCP (Liu and FGM (Zhou and
International Journal of Computer Vision (2021) 129:23–79

and Hebert et al. 2007) et al. 2009) et al. 2010) et al. 2011) Qiao 2014) De la Torre
2005) 2015)

Character1 (Acc) 0.2151 0.2690 0.3325 0.5548 0.7611 0.5508 0.6048


Character1 (Time) 0.0045 0.0293 0.0089 0.0408 0.2834 0.3138 2.1719
Character2 (Acc) 0.3449 0.4464 0.6580 0.8097 0.7729 0.8879 0.8986
Character2 (Time) 0.0033 0.0121 0.0064 0.0129 0.2300 0.1351 1.1638
Character3 (Acc) 0.2413 0.2595 0.3889 0.5151 0.9000 0.5040 0.6500
Character3 (Time) 0.0039 0.0284 0.0081 0.0343 0.2835 0.2932 1.9731
Character4 (Acc) 0.2077 0.2338 0.2879 0.5082 0.5787 0.4242 0.6116
Character4 (Time) 0.0033 0.0113 0.0062 0.0218 0.2290 0.2189 1.3926

123
61
62

123
Table 3 mAP performance of representative methods for pose estimation on YFCC100M and SUN3D datasets
Dataset degree GMS (Bian et al. ICF (Li and Hu LPM (Ma et al. SM (Leordeanu VFC (Ma et al. RANSAC (Fis- LFGC (Moo Yi OAN (Zhang
2017) 2010) 2019d) and Hebert 2005) 2014) chler and Bolles et al. 2018) et al. 2019b)
1981)

YFCC100M 5 1.61 3.71 6.26 3.77 7.79 9.08 47.98 52.18


10 3.30 7.76 11.65 7.79 13.87 14.28 – –
20 7.04 16.13 21.79 16.48 24.13 22.80 – –
SUN3D 5 0.39 3.48 6.36 5.08 7.65 2.85 15.98 17.50
10 1.22 6.65 11.14 9.13 12.54 5.61 – –
20 3.94 13.52 19.52 16.79 20.78 11.22 — –
International Journal of Computer Vision (2021) 129:23–79
International Journal of Computer Vision (2021) 129:23–79

Table 4 Performance by maximum recall rate (%) at precision equals to 100% for loop closure detection (best in bold)
dataset LPM (Ma et al. GMS (Bian et al. GS (Liu and Yan SM (Leordeanu and Hebert 2005) ICF (Li and Hu RANSAC LORANSAC VFC (Ma et al.
2019d) 2017) 2010) 2010) (Fischler and (Lebeda et al. 2014)
Bolles 1981) 2012)

Lip6Indoor 91.82 88.64 87.73 91.36 88.18 93.18 93.18 90.45


Lip6Outdoor 54.89 54.23 54.06 55.22 51.41 56.22 56.55 54.06
NewCollege 84.99 84.26 84.75 85.96 64.89 85.47 84.99 86.44
CityCentre 73.08 70.05 71.66 71.3 45.28 74.33 75.04 71.12

123
63
64 International Journal of Computer Vision (2021) 129:23–79

6.6 Results on Loop-Closure Detection many buildings, cars, and people. Both image sequences are
grabbed with a single-monocular handheld camera. In addi-
Appearance-based loop-closure detection is a fundamental tion, a binary matrix is defined as the ground truth for each
component in visual SLAM. The essence involves recogniz- dataset, whose rows and columns correspond to images at
ing previously visited areas of the environment. This task is different time indices. Each element in this binary matrix
crucial in reducing the drift of the estimated trajectory caused denotes the presence (set to 1) or absence (set to 0) of a
by the accumulative error and contributes to global consistent loop-closure event between the corresponding frame pair.
mapping. To generate consistent maps, the loop-closure detection
Appearance-based loop-closure detection only uses image module should obtain true positive detections to provide
similarity to identify previously visited places. This category information for the back-end optimization, thereby reducing
commonly starts with the construction of a set of putative the drift of the estimated trajectory caused by accumulative
correspondences by a feature operator, such as SIFT, between error. However, the loop-closure detection result must also
the current image and each previously visited image. Then, include no false positive detections as this can affect the per-
the closed loop is determined on the basis of the number formance of a full SLAM system and result in a completely
of accurate matches using mismatch removal methods. This inaccurate map result. In summary, the loop closure mech-
solution is simple but relatively effective. anisms should work at 100% precision while maintaining
Moreover, the computational requirement in directly real- high recall rate. In such cases, the evaluation of loop-closure
izing feature matching between the current image and each detection algorithm is performed in terms of precision-recall
previously visited image would be largely increased. To metrics. Here, precision is the ratio of the number of true pos-
ensure the real-time performance of loop-closure detection, itive loop-closure detections to the number of total positive
we use a two-step approach. In the first step, loop-closure loop-closure detections identified by the system, and recall
candidates are selected by the BoW method with presup- is the ratio between the true positive loop closure detections
posed score threshold, which is fast and easy to implement. and the total actual loop-closure events defined by the ground
However, the BoW method only considers whether or not a truth of dataset. Combining the analysis and the curve, we
feature exists and neglects the spatial arrangement of the fea- focus on the maximum recall that can be achieved at 100%
tures, thereby leading to perceptual aliasing problem. Thus, precision, indicating that the loop-closure detection result
in the second step, a robust feature matching algorithm is includes no false positive detection and avoids the influence
required to determine whether a loop-closure candidate is a in a full SLAM system.
true loop-closure event. Some of the representative mismatch removal methods
To evaluate the effectiveness and compare the perfor- are adopted for comparison in our experiment. The quan-
mance of the loop-closure detection methods based on feature titative comparisons, with respect to maximum recall rate
matching, we conduct extensive experiments on four differ- at precision of 100% on different datasets, are presented in
ent datasets, including NewCollege, CityCentre, Lip6Indoor, Table 4. From the results, we can see that the methods that
and Lip6Outdoor. The performance is characterized by the pursue relaxed geometric constraints, i.e., LPM (Ma et al.
maximum recall that can be achieved at 100% precision, as 2019d), GMS (Bian et al. 2017), GS (Liu and Yan 2010),
shown in Table 4. The experiments are performed on a desk- SM (Leordeanu and Hebert 2005), ICF (Li and Hu 2010)
top with 2.6 GHz Intel Core CPU, 16G B memory. and VFC (Ma et al. 2014), are less favored in this task. In
The NewCollege and CityCentre datasets are obtained comparison, the resampling methods that exploit parametric
from the work of Cummins and Newman Cummins and models of the correspondences, i.e., RANSAC (Fischler and
Newman (2008). The NewCollege dataset contains 1, 073 Bolles 1981) and LORANSAC (Lebeda et al. 2012), can give
images with size of 640 × 480, and the CityCentre dataset better results for loop-closure detection.
contains 1, 237 images with size of 640 × 480. The images
were recorded by means of the vision system of a wheeled
robotic platform while traversing 2.2km through a college’s 7 Conclusions and Future Trends
campus grounds and adjoining parks with buildings, roads,
gardens, cars, and people. The environment is outdoor and Image matching has played a significant role in various
dynamic. visual applications and has attracted considerable attention.
The Lip6Indoor and Lip6Outdoor datasets are obtained Researchers have also achieved significant progress in this
from Angeli et al. (2008). The Lip6Indoor dataset has 388 field in the past few decades. Therefore, we provide a compre-
images with size of 240 × 192; it is an indoor image hensive review of the existing image matching methods–from
sequence with strong perceptual aliasing problem. While handcrafted to trainable ones–in order to provide better
the Lip6Outdoor dataset has 1, 063 images with size of reference and understanding for the researchers in this com-
240×192; it is a long outdoor image sequence of a street with munity.

123
International Journal of Computer Vision (2021) 129:23–79 65

Image matching can be briefly classified into area- and recent years. However, the complexity is still the main
feature-based matching. Area-based methods are used to concern of the problem. Thus, practical and efficient algo-
achieve dense matching without detecting any salient fea- rithms are required.
ture points from the images. They are more welcomed in – In recent years, deep learning schemes have rapidly
high overlapping image matching (such as medical image evolved and shown tremendous improvements in many
registration) and narrow-baseline stereo (such as binocular research fields related to computer vision. However, in the
stereo matching). The deep learning-based techniques have literature of feature matching, most works have applied
drawn increasing attention for such a pipeline. Therefore, we deep learning techniques to feature detection and descrip-
provide a brief review of these types of methods in Sect. 4 tion. Thus, the potential capacity for accurate feature
and focus more on the learning-based methods. matching can be further explored in the future.
The feature-based image matching can effectively address – Image matching among multi-modal images is still an
the limitations in large viewpoint, wide baseline, and seri- unsolved problem. In the future, deep learning techniques
ous non-rigid image matching problems. It can be used in a can be used for better feature detection and description
pipeline of salient feature detection, discriminative descrip- performance.
tion, and reliable matching, often including transformation – Feature matching is a fundamental task in computer
model estimation. Following this procedure, feature detec- vision. However, its application has not been sufficiently
tion can extract the distinctive structure from the image. explored. Thus, one promising research direction is to
Meanwhile, feature description may be regarded as an image customize modern feature matching techniques to sat-
representation method, which is widely used for image cod- isfy different requirements of practical vision tasks, e.g.,
ing and similarity measurement. The matching step can be SfM and SLAM.
extended into different types of matching forms, such as
graph matching, point set registration, descriptor match- Acknowledgements This work was partly supported by the National
ing and mismatch removal, as well as the matching task Natural Science Foundation of China under Grant Nos. 61773295 and
in 3-D cases. These are more flexible and applicable than 61972250, Natural Science Foundation of Hubei Province under Grant
No. 2019CFA037, and National Key Research and Development Pro-
area-based methods, thereby receiving considerable atten- gram of China under Grant No. 2018AAA0100704.
tion in image matching area. Therefore, we review them with
the core idea that they are used from traditional techniques
to classical learning and deep learning. Moreover, to pro- Compliance with ethical standards
vide a comprehensive understanding of the significance in
image matching, we introduce several applications related to Conflict of Interest The authors declare no conflict of interest.
image matching. We also provide comprehensive and objec-
tive comparisons and analyses of these classical and deep Open Access This article is licensed under a Creative Commons
learning-based techniques through extensive experiments on Attribution 4.0 International License, which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as
representative datasets. long as you give appropriate credit to the original author(s) and the
Despite the considerable development in both theory and source, provide a link to the Creative Commons licence, and indi-
performance, image matching remains an open problem with cate if changes were made. The images or other third party material
challenges for further efforts. in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
– The two-stage strategy for feature matching, which has intended use is not permitted by statutory regulation or exceeds the
been widely adopted in the literature, performs mismatch permitted use, you will need to obtain permission directly from the copy-
removal on only a small set of potential correspondences right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.
with sufficiently similar descriptors. However, this may
lead to restricted performance in recall, which can be
problematic for some scenarios.
– In a different scenario, correspondences are sought not References
between projections of physically the same points in
Aanæs, H., Dahl, A. L., & Pedersen, K. S. (2012). Interesting interest
different images, but between semantic analogs across points. International Journal of Computer Vision, 97(1), 18–35.
different instances within a category. This requires new Aanæs, H., Jensen, R. R., Vogiatzis, G., Tola, E., & Dahl, A. B. (2016).
paradigms for feature matching in feature description and Large-scale data for multiple-view stereopsis. International Jour-
mismatch removal. nal of Computer Vision, 120(2), 153–168.
Abdel-Hakim, A. E., & Farag, A. A. (2006). Csift: A sift descriptor
– Joint matching of multiple images has been proven to with color invariant characteristics. In Proceedings of the IEEE
drastically boost the matching performance of pairwise conference on computer vision and pattern recognition, pp. 1978–
matching and has attracted considerable attention in 1983.

123
66 International Journal of Computer Vision (2021) 129:23–79

Adamczewski, K., Suh, Y., & Mu Lee, K. (2015). Discrete tabu search descriptors. In Proceedings of the IEEE conference on computer
for graph matching. In Proceedings of the IEEE international con- vision and pattern recognition, pp. 5173–5182.
ference on computer vision, pp. 109–117. Balntas, V., Riba, E., Ponsa, D., & Mikolajczyk, K. (2016b). Learning
Adams, W. P., & Johnson, T. A. (1994). Improved linear programming- local feature descriptors with triplets and shallow convolutional
based lower bounds for the quadratic assignment problem. neural networks. In Proceedings of the British machine vision con-
DIMACS Series in Discrete Mathematics and Theoretical Com- ference, pp. 1–11.
puter Science, 16, 43–77. Barath, D., & Matas, J. (2018). Graph-cut ransac. In Proceedings of the
Aflalo, Y., Dubrovina, A., & Kimmel, R. (2016). Spectral general- IEEE conference on computer vision and pattern recognition, pp.
ized multi-dimensional scaling. International Journal of Computer 6733–6741.
Vision, 118(3), 380–392. Barath, D., Ivashechkin, M., & Matas, J. (2019a). Progressive napsac:
Agrawal, M., Konolige, K., & Blas, M. R. (2008). Censure: Center Sampling from gradually growing neighborhoods. arXiv preprint
surround extremas for realtime feature detection and matching. In arXiv:1906.02295.
Proceedings of the European conference on computer vision, pp. Barath, D., Matas, J., & Noskova, J. (2019b). Magsac: Marginalizing
102–115. sample consensus. In Proceedings of the IEEE conference on com-
Alahi, A., Ortiz, R., & Vandergheynst, P. (2012). Freak: Fast retina key- puter vision and pattern recognition, pp. 10,197–10,205.
point. In Proceedings of the IEEE conference on computer vision Barath, D., Noskova, J., Ivashechkin, M., & Matas, J. (2020).
and pattern recognition, pp. 510–517. Magsac++, a fast, reliable and accurate robust estimator. In Pro-
Alcantarilla, P. F., Bartoli, A., & Davison, A. J. (2012) Kaze features. ceedings of the IEEE/CVF conference on computer vision and
In Proceedings of the European conference on computer vision, pattern recognition, pp. 1304–1312.
pp. 214–227. Barroso-Laguna, A., Riba, E., Ponsa, D., & Mikolajczyk, K. (2019).
Alcantarilla, P. F., & Solutions, T. (2011). Fast explicit diffusion for Key.net: Keypoint detection by handcrafted and learned CNN
accelerated features in nonlinear scale spaces. IEEE Transactions filters. In Proceedings of the IEEE international conference on
on Pattern Analysis and Machine Intelligence, 34(7), 1281–1298. computer vision, pp. 5836–5844.
Aldana-Iuit, J., Mishkin, D., Chum, O., & Matas, J. (2016). In the sad- Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: Speeded up robust
dle: Chasing fast and repeatable features. In Proceedings of the features. In Proceedings of the European conference on computer
international conference on pattern recognition, pp. 675–680. vision, pp. 404–417.
Almohamad, H., & Duffuaa, S. O. (1993). A linear programming Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up
approach for the weighted graph matching problem. IEEE Trans- robust features (surf). Computer Vision and Image Understanding,
actions on Pattern Analysis and Machine Intelligence, 15(5), 110(3), 346–359.
522–525. Bazin, J.C., Seo, Y., & Pollefeys, M. (2012). Globally optimal consensus
Amberg, B., Romdhani, S., & Vetter, T. (2007). Optimal step nonrigid set maximization through rotation search. In Proceedings of the
ICP algorithms for surface registration. In Proceedings of the IEEE Asian conference on computer vision, pp. 539–551.
conference on computer vision and pattern recognition, pp. 1–8. Bellavia, F., & Colombo, C. (2020). Is there anything new to say about
Angeli, A., Filliat, D., Doncieux, S., & Meyer, J. A. (2008). A fast and sift matching? International Journal of Computer Vision, 128(3),
incremental method for loop-closure detection using bags of visual 1847–1866.
words. In: IEEE transactions on robotics, pp. 1027–1037. Belongie, S., Malik, J., & Puzicha, J. (2001). Shape context: A new
Arandjelović, R., & Zisserman, A. (2012). Three things everyone should descriptor for shape matching and object recognition. In Advances
know to improve object retrieval. In Proceedings of the IEEE in neural information processing systems, pp. 831–837.
conference on computer vision and pattern recognition, pp. 2911– Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object
2918. recognition using shape contexts. IEEE Transactions on Pattern
Arar, M., Ginger, Y., Danon, D., Bermano, A. H., & Cohen-Or, D. Analysis and Machine Intelligence, 4, 509–522.
(2020). Unsupervised multi-modal image registration via geome- Bernard, F., Theobalt, C., & Moeller, M. (2018). Ds*: Tighter lifting-
try preserving image-to-image translation. In Proceedings of the free convex relaxations for quadratic matching problems. In
IEEE/CVF conference on computer vision and pattern recognition, Proceedings of the IEEE conference on computer vision and pat-
pp. 13,410–13,419. tern recognition, pp. 4310–4319.
Aubry, M., Schlickewei, U., & Cremers, D. (2011). The wave kernel Bernard, F., Thunberg, J., Swoboda, P., & Theobalt, C. (2019).
signature: A quantum mechanical approach to shape analysis. In Hippi: Higher-order projected power iterations for scalable multi-
Proceedings of the IEEE international conference on computer matching. In Proceedings of the IEEE international conference on
vision workshops, pp. 1626–1633. computer vision, pp. 10,284–10,293.
Avrithis, Y., & Tolias, G. (2014). Hough pyramid matching: Speeded-up Besl, P. J., & McKay, N. D. (1992). Method for registration of 3-d
geometry re-ranking for large scale image retrieval. International shapes. In Sensor fusion IV: Control paradigms and data struc-
Journal of Computer Vision, 107(1), 1–19. tures, Vol. 1611, pp. 586–607.
Awrangjeb, M., & Lu, G. (2008). Robust image corner detection based Bhattacharjee, D., & Roy, H. (2019). Pattern of local gravitational force
on the chord-to-point distance accumulation technique. IEEE (plgf): A novel local image descriptor. In IEEE transactions on
Transactions on Multimedia, 10(6), 1059–1072. pattern analysis and machine intelligence.
Awrangjeb, M., Lu, G., & Fraser, C. S. (2012). Performance compar- Bhowmik, A., Gumhold, S., Rother, C., & Brachmann, E. (2020). Rein-
isons of contour-based corner detectors. IEEE Transactions on forced feature points: Optimizing feature detection and description
Image Processing, 21(9), 4167–4179. for a high-level task. In Proceedings of the IEEE/CVF conference
Babai, L. (2018). Groups, graphs, algorithms: The graph isomorphism on computer vision and pattern recognition, pp. 4948–4957.
problem. In Proceedings of the international congress of mathe- Bian, J., Lin, W. Y., Matsushita, Y., Yeung, S. K., Nguyen, T. D., &
maticians, pp. 3319–3336. Cheng, M. M. (2017). Gms: Grid-based motion statistics for fast,
Balntas, V., Johns, E., Tang, L., & Mikolajczyk, K. (2016a). Pn-net: ultra-robust feature correspondence. In Proceedings of the IEEE
Conjoined triple deep network for learning local image descriptors. conference on computer vision and pattern recognition, pp. 4181–
arXiv preprint arXiv:1601.05030. 4190.
Balntas, V., Lenc, K., Vedaldi, A., & Mikolajczyk, K. (2017). Hpatches: Biasotti, S., Cerri, A., Bronstein, A., & Bronstein, M. (2016). Recent
A benchmark and evaluation of handcrafted and learned local trends, applications, and perspectives in 3d shape similarity assess-

123
International Journal of Computer Vision (2021) 129:23–79 67

ment. In Computer graphics forum, Vol. 35, Wiley Online Library, descriptors. In Computer graphics forum, Vol. 27, Wiley Online
pp. 87–119. Library, pp. 643–652.
Blais, G., & Levine, M. D. (1995). Registering multiview range data to Chang, J. R., & Chen, Y. S. (2018). Pyramid stereo matching network.
create 3d computer objects. IEEE Transactions on Pattern Analysis In Proceedings of the IEEE conference on computer vision and
and Machine Intelligence, 17(8), 820–824. pattern recognition, pp. 5410–5418.
Bonny, M. Z., & Uddin, M. S. (2016). Feature-based image stitching Chang, W., & Zwicker, M. (2009). Range scan registration using
algorithms. In Proceedings of the international workshop on com- reduced deformable models. In Computer graphics forum, Vol. 28,
putational intelligence, pp. 198–203. Wiley Online Library, pp. 447–456.
Boscaini, D., Masci, J., Melzi, S., Bronstein, M. M., Castellani, U., Chang, M. C., & Kimia, B. B. (2011). Measuring 3d shape similarity
& Vandergheynst, P. (2015). Learning class-specific descriptors by graph-based matching of the medial scaffolds. Computer Vision
for deformable shapes using localized spectral convolutional net- and Image Understanding, 115(5), 707–720.
works. In Computer graphics forum, Vol. 34, Wiley Online Library, Chen, Q., & Koltun, V. (2015). Robust nonrigid registration by convex
pp. 13–23. optimization. In Proceedings of the IEEE international conference
Boscaini, D., Masci, J., Rodolà, E., & Bronstein, M. (2016). Learning on computer vision, pp. 2039–2047.
shape correspondence with anisotropic convolutional neural net- Chen, Y., Guibas, L., & Huang, Q. (2014). Near-optimal joint object
works. In Advances in neural information processing systems, pp. matching via convex relaxation. In Proceedings of the interna-
3189–3197. tional conference on machine learning, pp. 100–108.
Brachmann, E., & Rother, C. (2019). Neural-guided RANSAC: Learn- Chen, Y. C., Huang, P. H., Yu, L. Y., Huang, J. B., Yang, M. H., & Lin,
ing where to sample model hypotheses. In Proceedings of the IEEE Y. Y. (2018). Deep semantic matching with foreground detection
international conference on computer vision, pp. 4322–4331. and cycle-consistency. In Proceedings of the Asian conference on
Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., computer vision, pp. 347–362.
Gumhold, S., & Rother, C. (2017). Dsac-differentiable RANSAC Chen, J., Kellokumpu, V., Zhao, G., & Pietikäinen, M. (2013). Rlbp:
for camera localization. In Proceedings of the IEEE conference on Robust local binary pattern. In Proceedings of the British machine
computer vision and pattern recognition, pp. 6684–6692. vision conference.
Bronstein, M. M., & Kokkinos, I. (2010). Scale-invariant heat kernel Chen, J., Wang, L., Li, X., & Fang, Y. (2019). Arbicon-net: Arbitrary
signatures for non-rigid shape recognition. In Proceedings of the continuous geometric transformation networks for image registra-
IEEE conference on computer vision and pattern recognition, pp. tion. In Advances in neural information processing systems, pp.
1704–1711. 3410–3420.
Bronstein, A. M., Bronstein, M. M., & Kimmel, R. (2006). General- Chen, H. M., Arora, M. K., & Varshney, P. K. (2003a). Mutual
ized multidimensional scaling: a framework for isometry-invariant information-based image registration for remote sensing data.
partial surface matching. Proceedings of the National Academy of International Journal of Remote Sensing, 24(18), 3701–3706.
Sciences, 103(5), 1168–1172. Chen, H., & Bhanu, B. (2007). 3d free-form object recognition in range
Brown, M., Hua, G., & Winder, S. (2010). Discriminative learning of images using local surface patches. Pattern Recognition Letters,
local image descriptors. IEEE Transactions on Pattern Analysis 28(10), 1252–1262.
and Machine Intelligence, 33(1), 43–57. Chen, Q. S., Defrise, M., & Deconinck, F. (1994). Symmetric phase-
Brown, M., & Lowe, D. G. (2007). Automatic panoramic image stitch- only matched filtering of Fourier-Mellin transforms for image
ing using invariant features. International Journal of Computer registration and recognition. IEEE Transactions on Pattern Anal-
Vision, 74(1), 59–73. ysis and Machine Intelligence, 16(12), 1156–1168.
Caelli, T., & Kosinov, S. (2004). An eigenspace projection clustering Chen, C., Li, Y., Liu, W., & Huang, J. (2015). Sirf: Simultaneous satel-
method for inexact graph matching. IEEE Transactions on Pattern lite image registration and fusion in a unified framework. IEEE
Analysis and Machine Intelligence, 26(4), 515–519. Transactions on Image Processing, 24(11), 4213–4224.
Caetano, T. S., McAuley, J. J., Cheng, L., Le, Q. V., & Smola, A. J. Chen, J., Shan, S., He, C., Zhao, G., Pietikainen, M., Chen, X., et al.
(2009). Learning graph matching. IEEE Transactions on Pattern (2009). Wld: A robust local image descriptor. IEEE Transactions
Analysis and Machine Intelligence, 31(6), 1048–1058. on Pattern Analysis and Machine Intelligence, 32(9), 1705–1720.
Cai, H., Mikolajczyk, K., & Matas, J. (2010). Learning linear discrimi- Chen, J., Tian, J., Lee, N., Zheng, J., Smith, R. T., & Laine, A. F. (2010).
nant projections for dimensionality reduction of image descriptors. A partial intensity invariant feature descriptor for multimodal
IEEE Transactions on Pattern Analysis and Machine Intelligence, retinal image registration. IEEE Transactions on Biomedical Engi-
33(2), 338–352. neering, 57(7), 1707–1718.
Calonder, M., Lepetit, V., Strecha, C., & Fua, P. (2010). Brief: Binary Chen, H. M., Varshney, P. K., & Arora, M. K. (2003b). Perfor-
robust independent elementary features. In Proceedings of the mance of mutual information similarity measure for registration
European conference on computer vision, pp. 778–792. of multitemporal remote sensing images. IEEE Transactions on
Campbell, D., & Petersson, L. (2015). An adaptive data representation Geoscience and Remote Sensing, 41(11), 2445–2454.
for robust point-set registration and merging. In Proceedings of Chertok, M., & Keller, Y. (2010). Efficient high order matching.
the IEEE international conference on computer vision, pp. 4292– IEEE Transactions on Pattern Analysis and Machine Intelligence,
4300. 32(12), 2205–2215.
Campbell, D., & Petersson, L. (2016). Gogma: Globally-optimal gaus- Chetverikov, D., Stepanov, D., & Krsek, P. (2005). Robust Euclidean
sian mixture alignment. In Proceedings of the IEEE conference on alignment of 3d point sets: The trimmed iterative closest point
computer vision and pattern recognition, pp. 5685–5694. algorithm. Image and Vision Computing, 23(3), 299–309.
Canny, J. (1987). A computational approach to edge detection. In Read- Cho, M., & Lee, K. M. (2012). Progressive graph matching: Making
ings in computer vision, Elsevier, pp. 184–203. a move of graphs via probabilistic voting. In Proceedings of the
Cao, S. Y., Shen, H. L., Chen, S. J., & Li, C. (2020). Boosting structure IEEE conference on computer vision and pattern recognition, pp.
consistency for multispectral and multimodal image registration. 398–405.
IEEE Transactions on Image Processing, 29, 5147–5162. Cho, M., Lee, J., & Lee, K. M. (2010). Reweighted random walks for
Castellani, U., Cristani, M., Fantoni, S., & Murino, V. (2008). Sparse graph matching. In Proceedings of the European conference on
points matching by combining 3d mesh saliency with statistical computer vision, pp. 492–505.

123
68 International Journal of Computer Vision (2021) 129:23–79

Chopra, S., Hadsell, R., LeCun, Y., et al. (2005). Learning a similar- DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). Superpoint:
ity metric discriminatively, with application to face verification. Self-supervised interest point detection and description. In Pro-
In Proceedings of the IEEE conference on computer vision and ceedings of the IEEE conference on computer vision and pattern
pattern recognition, pp. 539–546. recognition workshops, pp. 224–236.
Choy, C. B., Gwak, J., Savarese, S., & Chandraker, M. (2016). Univer- Dong, J., & Soatto, S. (2015). Domain-size pooling in local descrip-
sal correspondence network. In Advances in neural information tors: Dsp-sift. In Proceedings of the IEEE conference on computer
processing systems, pp. 2414–2422. vision and pattern recognition, pp. 5097–5106.
Choy, C., Lee, J., Ranftl, R., Park, J., & Koltun, V. (2020). High- Dorai, C., & Jain, A. K. (1997). Cosmos-a representation scheme for
dimensional convolutional networks for geometric pattern recog- 3d free-form objects. IEEE Transactions on Pattern Analysis and
nition. In Proceedings of the IEEE/CVF conference on computer Machine Intelligence, 19(10), 1115–1130.
vision and pattern recognition, pp. 11,227–11,236. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov,
Chui, H., & Rangarajan, A. (2003). A new point matching algorithm for V., Van Der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet:
non-rigid registration. Computer Vision and Image Understanding, Learning optical flow with convolutional networks. In Proceed-
89(2–3), 114–141. ings of the IEEE international conference on computer vision, pp.
Chum, O., & Matas, J. (2005). Matching with prosac-progressive sam- 2758–2766.
ple consensus. In Proceedings of the IEEE conference on computer Duan, Y., Lu, J., Wang, Z., Feng, J., & Zhou, J. (2017). Learning deep
vision and pattern recognition, pp. 220–226. binary descriptor with multi-quantization. In Proceedings of the
Chum, O., Matas, J., & Kittler, J. (2003). Locally optimized ransac. In IEEE conference on computer vision and pattern recognition, pp.
Proceedings of the joint pattern recognition symposium, Springer, 1183–1192.
pp. 236–243. Duchenne, O., Bach, F., Kweon, I. S., & Ponce, J. (2011). A tensor-based
Churchill, D., & Vardy, A. (2013). An orientation invariant visual hom- algorithm for high-order graph matching. IEEE Transactions on
ing algorithm. Journal of Intelligent & Robotic Systems, 71(1), Pattern Analysis and Machine Intelligence, 33(12), 2383–2395.
3–29. Du, Q., Fan, A., Ma, Y., Fan, F., Huang, J., & Mei, X. (2018). Infrared
Cook, D. J., & Holder, L. B. (2006). Mining graph data. New York: and visible image registration based on scale-invariant piifd feature
Wiley. and locality preserving matching. IEEE Access, 6, 64107–64121.
Cour, T., Srinivasan, P., & Shi, J. (2007). Balanced graph matching. In Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., &
Advances in neural information processing systems, pp. 313–320. Sattler, T. (2019). D2-net: A trainable cnn for joint description and
Cummins, M., & Newman, P. (2008). Fab-map: Probabilistic localiza- detection of local features. In Proceedings of the IEEE conference
tion and mapping in the space of appearance. The International on computer vision and pattern recognition, pp. 8092–8101.
Journal of Robotics Research, 27(6), 647–665. Dym, N., Maron, H., & Lipman, Y. (2017). Ds++: A flexible, scal-
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for able and provably tight relaxation for matching problems. arXiv
human detection. In Proceedings of the IEEE conference on com- preprint arXiv:1705.06148.
puter vision and pattern recognition, pp. 886–893. Egozi, A., Keller, Y., & Guterman, H. (2012). A probabilistic approach
Danelljan, M., Meneghetti, G., Shahbaz Khan, F., & Felsberg, M. to spectral graph matching. IEEE Transactions on Pattern Analysis
(2016). A probabilistic framework for color-based point set regis- and Machine Intelligence, 35(1), 18–27.
tration. In Proceedings of the IEEE conference on computer vision Elad, A., & Kimmel, R. (2003). On bending invariant signatures for
and pattern recognition, pp. 1818–1826. surfaces. IEEE Transactions on Pattern Analysis and Machine
Datar, M., Immorlica, N., Indyk, P., & Mirrokni, V. S. (2004). Locality- Intelligence, 25(10), 1285–1295.
sensitive hashing scheme based on p-stable distributions. In Elbaz, G., Avraham, T., & Fischer, A. (2017). 3d point cloud registration
Proceedings of the twentieth annual symposium on computational for localization using a deep neural network auto-encoder. In Pro-
geometry, pp. 253–262. ceedings of the IEEE conference on computer vision and pattern
Davison, A. J., Reid, I. D., Molton, N. D., & Stasse, O. (2007). recognition, pp. 4631–4640.
Monoslam: Real-time single camera slam. IEEE Transactions on Endres, F., Hess, J., Engelhard, N., Sturm, J., Cremers, D., & Burgard,
Pattern Analysis and Machine Intelligence, 6, 1052–1067. W. (2012). An evaluation of the rgb-d slam system. In Proceedings
Dawn, S., Saxena, V., & Sharma, B. (2010). Remote sensing image reg- of the IEEE international conference on robotics and automation,
istration techniques: A survey. In Proceedings of the international pp. 1691–1696.
conference on image and signal processing, pp. 103–112. Erin Liong, V., Lu, J., Wang, G., Moulin, P., & Zhou, J. (2015). Deep
de Vos, B. D., Berendsen, F. F., Viergever, M. A., Sokooti, H., Staring, hashing for compact binary codes learning. In Proceedings of the
M., & Isgum, I. (2019). A deep learning framework for unsuper- IEEE conference on computer vision and pattern recognition, pp.
vised affine and deformable image registration. Medical Image 2475–2483.
Analysis, 52, 128–143. Erlik Nowruzi, F., Laganiere, R., & Japkowicz, N. (2017). Homogra-
de Vos, B. D., Berendsen, F. F., Viergever, M. A., Staring, M., & Isgum, phy estimation from image pairs with hierarchical convolutional
I. (2017). End-to-end unsupervised deformable image registration networks. In Proceedings of the IEEE international conference on
with a convolutional neural network. In Deep learning in medi- computer vision, pp. 913–920.
cal image analysis and multimodal learning for clinical decision Evangelidis, G. D., & Horaud, R. (2018). Joint alignment of multiple
support, Springer, pp. 204–212. point sets with batch and incremental expectation-maximization.
Deng, H., Birdal, T., & Ilic, S. (2018). Ppfnet: Global context aware IEEE Transactions on Pattern Analysis and Machine Intelligence,
local features for robust 3d point matching. In Proceedings of the 40(6), 1397–1410.
IEEE conference on computer vision and pattern recognition, pp. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisser-
195–205. man, A. (2010). The pascal visual object classes (voc) challenge.
Deng, H., Zhang, W., Mortensen, E., Dietterich, T., & Shapiro, L. International Journal of Computer Vision, 88(2), 303–338.
(2007). Principal curvature-based region detector for object recog- Fan, B., Kong, Q., Wang, X., Wanga, Z., Xiang, S., Pan, C., et al. (2019).
nition. In Proceedings of the IEEE conference on computer vision A performance evaluation of local features for image-based 3d
and pattern recognition, pp. 1–8. reconstruction. IEEE Transactions on Image Processing, 28(10),
DeTone, D., Malisiewicz, T., & Rabinovich, A. (2016). Deep image 4774–4789.
homography estimation. arXiv preprint arXiv:1606.03798.

123
International Journal of Computer Vision (2021) 129:23–79 69

Fan, B., Wu, F., & Hu, Z. (2011). Rotationally invariant descriptors using Giraldo, L. G. S., Hasanbelliu, E., Rao, M., & Principe, J. C. (2017).
intensity order pooling. IEEE Transactions on Pattern Analysis Group-wise point-set registration based on rényi’s second order
and Machine Intelligence, 34(10), 2031–2045. entropy. In Proceedings of the IEEE conference on computer vision
Ferrante, E., & Paragios, N. (2017). Slice-to-volume medical image and pattern recognition, pp. 2454–2462.
registration: A survey. Medical Image Analysis, 39, 101–123. Glaunes, J., Trouvé, A., & Younes, L. (2004). Diffeomorphic matching
Ferraz, L., & Binefa, X. (2012). A sparse curvature-based detector of of distributions: A new approach for unlabelled point-sets and sub-
affine invariant blobs. Computer Vision and Image Understanding, manifolds matching. In Proceedings of the IEEE conference on
116(4), 524–537. computer vision and pattern recognition, pp. 712–718.
Fey, M., Lenssen, J. E., Morris, C., Masci, J., & Kriege, N. M. (2020). Gojcic, Z., Zhou, C., Wegner, J. D., Guibas, L. J., & Birdal, T. (2020).
Deep graph matching consensus. In International conference on Learning multiview 3d point cloud registration. In Proceedings of
learning representations. the IEEE/CVF conference on computer vision and pattern recog-
Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: nition, pp. 1759–1769.
a paradigm for model fitting with applications to image analysis Gold, S., & Rangarajan, A. (1996). A graduated assignment algorithm
and automated cartography. Communications of the ACM, 24(6), for graph matching. IEEE Transactions on Pattern Analysis and
381–395. Machine Intelligence, 18(4), 377–388.
Fitzgibbon, A. W. (2003). Robust registration of 2d and 3d point sets. Gold, S., Rangarajan, A., Lu, C. P., Pappu, S., & Mjolsness, E. (1998).
Image and Vision Computing, 21(13–14), 1145–1153. New algorithms for 2d and 3d point matching: Pose estimation and
Flint, A., Dick, A., & Van Den Hengel, A. (2007). Thrift: Local 3d correspondence. Pattern Recognition, 31(8), 1019–1031.
structure recognition. In Proceedings of the biennial conference Golyanik, V., Aziz Ali, S., & Stricker, D. (2016). Gravitational approach
on digital image computing techniques and applications, pp. 182– for point set registration. In Proceedings of the IEEE conference
188. on computer vision and pattern recognition, pp. 5802–5810.
Fogel, F., Jenatton, R., Bach, F., & d’Aspremont, A. (2013). Convex Gong, Y., Kumar, S., Rowley, H. A., & Lazebnik, S. (2013). Learning
relaxations for permutation problems. In Advances in neural infor- binary codes for high-dimensional data using bilinear projections.
mation processing systems, pp. 1016–1024. In Proceedings of the IEEE conference on computer vision and
Foroosh, H., Zerubia, J. B., & Berthod, M. (2002). Extension of phase pattern recognition, pp. 484–491.
correlation to subpixel registration. IEEE Transactions on Image Gong, Y., Lazebnik, S., Gordo, A., & Perronnin, F. (2012). Iterative
Processing, 11(3), 188–200. quantization: A procrustean approach to learning binary codes for
Forssén, P. E. (2007). Maximally stable colour regions for recognition large-scale image retrieval. IEEE Transactions on Pattern Analysis
and matching. In Proceedings of the IEEE conference on computer and Machine Intelligence, 35(12), 2916–2929.
vision and pattern recognition, pp. 1–8. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Fragoso, V., Sen, P., Rodriguez, S., & Turk, M. (2013). Evsac: accel- Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adver-
erating hypotheses generation by modeling matching scores with sarial nets. In Advances in neural information processing systems,
extreme value theory. In Proceedings of the IEEE international pp. 2672–2680.
conference on computer vision, pp. 2472–2479. Granger, S., & Pennec, X. (2002). Multi-scale em-icp: A fast and robust
Frome, A., Huber, D., Kolluri, R., Bülow, T., & Malik, J. (2004). Rec- approach for surface registration. In Proceedings of the European
ognizing objects in range data using regional point descriptors. In conference on computer vision, pp. 418–432.
Proceedings of the European conference on computer vision, pp. Guo, Y., Bennamoun, M., Sohel, F., Lu, M., Wan, J., & Kwok, N. M.
224–237. (2016). A comprehensive performance evaluation of 3d local fea-
Gao, W., & Tedrake, R. (2019). Filterreg: Robust and efficient prob- ture descriptors. International Journal of Computer Vision, 116(1),
abilistic point-set registration using Gaussian filter and twist 66–89.
parameterization. In Proceedings of the IEEE conference on com- Guo, Y., Sohel, F., Bennamoun, M., Lu, M., & Wan, J. (2013). Rotational
puter vision and pattern recognition, pp. 11,095–11,104. projection statistics for 3d local surface description and object
Gauglitz, S., Höllerer, T., & Turk, M. (2011). Evaluation of interest point recognition. International Journal of Computer Vision, 105(1),
detectors and feature descriptors for visual tracking. International 63–86.
Journal of Computer Vision, 94(3), 335–360. Guo, Y., Sohel, F., Bennamoun, M., Wan, J., & Lu, M. (2015). A novel
Gay-Bellile, V., Bartoli, A., & Sayd, P. (2008). Direct estimation of local surface feature for 3d object recognition under clutter and
nonrigid registrations with image-based self-occlusion reasoning. occlusion. Information Sciences, 293, 196–213.
IEEE Transactions on Pattern Analysis and Machine Intelligence, Guo, Z., Zhang, L., & Zhang, D. (2010). A completed modeling of local
32(1), 87–104. binary pattern operator for texture classification. IEEE Transac-
Gelfand, N., Mitra, N. J., Guibas, L. J., & Pottmann, H. (2005). Robust tions on Image Processing, 19(6), 1657–1663.
global registration. In Symposium on geometry processing, Vol. 2, Gupta, R., Patil, H., & Mittal, A. (2010). Robust order-based methods
Vienna, Austria, p. 5. for feature description. In Proceedings of the IEEE conference on
Georgakis, G., Karanam, S., Wu, Z., Ernst, J., & Kosecká, J. (2018). computer vision and pattern recognition, pp. 334–341.
End-to-end learning of keypoint detector and descriptor for pose Han, X., Leung, T., Jia, Y., Sukthankar, R., & Berg, A. C. (2015).
invariant 3d matching. In Proceedings of the IEEE conference on Matchnet: Unifying feature and metric learning for patch-based
computer vision and pattern recognition, pp. 1965–1973. matching. In Proceedings of the IEEE conference on computer
Ghosh, D., & Kaabouch, N. (2016). A survey on image mosaicing vision and pattern recognition, pp. 3279–3286.
techniques. Journal of Visual Communication and Image Repre- Han, K., Rezende, R. S., Ham, B., Wong, K. Y. K., Cho, M., Schmid,
sentation, 34, 1–11. C., & Ponce, J. (2017). Scnet: Learning semantic correspondence.
Gil, A., Mozos, O. M., Ballesta, M., & Reinoso, O. (2010). A compara- In Proceedings of the IEEE international conference on computer
tive evaluation of interest point detectors and local descriptors for vision, pp. 1831–1840.
visual slam. Machine Vision and Applications, 21(6), 905–920. Harris, C. G., Stephens, M., et al. (1988). A combined corner and edge
Gionis, A., Indyk, P., Motwani, R., et al. (1999). Similarity search in detector. In Proceedings of the Alvey vision conference, pp. 147–
high dimensions via hashing. In Proceedings of the international 151.
conference on very large databases, pp. 518–529.

123
70 International Journal of Computer Vision (2021) 129:23–79

Hartmann, W., Havlena, M., & Schindler, K. (2014). Predicting match- Jiang, X., Ma, J., Jiang, J., & Guo, X. (2020a). Robust feature matching
ability. In Proceedings of the IEEE conference on computer vision using spatial clustering with heavy outliers. IEEE Transactions on
and pattern recognition, pp. 9–16. Image Processing, 29, 736–746.
Haskins, G., Kruger, U., & Yan, P. (2020). Deep learning in medical Jiang, B., Zhao, H., Tang, J., & Luo, B. (2014). A sparse nonnega-
image registration: A survey. Machine Vision and Applications, tive matrix factorization technique for graph matching problems.
31(1), 8. Pattern Recognition, 47(2), 736–747.
Hayat, N., & Imran, M. (2019). Ghost-free multi exposure image fusion Jian, B., & Vemuri, B. C. (2011). Robust point set registration using
technique using dense sift descriptor and guided filter. Journal of Gaussian mixture models. IEEE Transactions on Pattern Analysis
Visual Communication and Image Representation, 62, 295–308. and Machine Intelligence, 33(8), 1633–1645.
He, K., Lu, Y., & Sclaroff, S. (2018). Local descriptors optimized for Jin, Y., Mishkin, D., Mishchuk, A., Matas, J., Fua, P., Yi, K. M., &
average precision. In Proceedings of the IEEE conference on com- Trulls, E. (2020). Image matching across wide baselines: From
puter vision and pattern recognition, pp. 596–605. paper to practice. arXiv preprint arXiv:2003.01587.
Heikkilä, M., Pietikäinen, M., & Schmid, C. (2009). Description of Johnson, K., Cole-Rhodes, A., Zavorin, I., & Le Moigne, J. (2001).
interest regions with local binary patterns. Pattern Recognition, Mutual information as a similarity measure for remote sensing
42(3), 425–436. image registration. In Geo-spatial image and data exploitation II,
Heinly, J., Dunn, E., & Frahm, J. M. (2012). Comparative evaluation pp. 51–61.
of binary features. In Proceedings of the European conference on Johnson, A. E., & Hebert, M. (1999). Using spin images for efficient
computer vision, pp. 759–773. object recognition in cluttered 3d scenes. IEEE Transactions on
Heinly, J., Schonberger, J. L., Dunn, E., & Frahm, J. M. (2015). Recon- Pattern Analysis and Machine Intelligence, 21(5), 433–449.
structing the world* in six days*(as captured by the yahoo 100 Ke, Y., Sukthankar, R., et al. (2004). Pca-sift: A more distinctive repre-
million image dataset). In Proceedings of the IEEE conference on sentation for local image descriptors. In Proceedings of the IEEE
computer vision and pattern recognition, pp. 3287–3295. conference on computer vision and pattern recognition, pp. 506–
Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, 513.
K., & Navab, N. (2012). Model based training, detection and pose Kedem, D., Tyree, S., Sha, F., Lanckriet, G. R., & Weinberger, K. Q.
estimation of texture-less 3d objects in heavily cluttered scenes. (2012). Non-linear metric learning. In Advances in neural infor-
InProceedings of the Asian conference on computer vision, pp. mation processing systems, pp. 2573–2581.
548–562. Kezurer, I., Kovalsky, S. Z., Basri, R., & Lipman, Y. (2015). Tight
Horaud, R., Forbes, F., Yguel, M., Dewaele, G., & Zhang, J. (2011). relaxation of quadratic matching. In Computer graphics forum,
Rigid and articulated point registration with expectation condi- Vol. 34, Wiley Online Library, pp. 115–128.
tional maximization. IEEE Transactions on Pattern Analysis and Khoury, M., Zhou, Q. Y., & Koltun, V. (2017). Learning compact
Machine Intelligence, 33(3), 587–602. geometric features. In Proceedings of the IEEE international con-
Hu, N., Huang, Q., Thibert, B., & Guibas, L. J. (2018). Distributable ference on computer vision, pp. 153–161.
consistent multi-object matching. In Proceedings of the IEEE Kim, S., Lin, S., JEON, S. R., Min, D., & Sohn, K. (2018). Recurrent
conference on computer vision and pattern recognition, pp. 2463– transformer networks for semantic correspondence. In Advances
2471. in neural information processing systems, pp. 6126–6136.
Huang, Q. X., & Guibas, L. (2013). Consistent shape maps via semidef- Kim, V.G., Lipman, Y., & Funkhouser, T. (2011). Blended intrinsic
inite programming. In Computer graphics forum, Vol. 32, Wiley maps. In ACM transactions on graphics, Vol. 30, ACM, p. 79.
Online Library, pp. 177–186. Kim, V. G., Li, W., Mitra, N. J., DiVerdi, S., & Funkhouser, T. A. (2012).
Huang, X., Cheng, X., Geng, Q., Cao, B., Zhou, D., Wang, P., Lin, Y., & Exploring collections of 3d models using fuzzy correspondences.
Yang, R. (2018). The apolloscape dataset for autonomous driving. ACM Transactions on Graphics, 31(4), 54–1.
In Proceedings of the IEEE conference on computer vision and Kimmel, R., Zhang, C., Bronstein, A., & Bronstein, M. (2011). Are mser
pattern recognition workshops, pp. 954–960. features really interesting? IEEE Transactions on Pattern Analysis
Huang, D., Shan, C., Ardabilian, M., Wang, Y., & Chen, L. (2011). and Machine Intelligence, 33(11), 2316–2320.
Local binary patterns and its application to facial image analysis: Kim, S., Min, D., Lin, S., & Sohn, K. (2020). Discrete-continuous trans-
a survey. IEEE Transactions on Systems, Man, and Cybernetics, formation matching for dense semantic correspondence. IEEE
Part C (Applications and Reviews), 41(6), 765–781. Transactions on Pattern Analysis and Machine Intelligence, 42(1),
Iglesias, J. P., Olsson, C., & Kahl, F. (2020). Global optimality for point 59–73.
set registration using semidefinite programming. In Proceedings of Klein, S., Staring, M., & Pluim, J. P. (2007). Evaluation of optimiza-
the IEEE/CVF conference on computer vision and pattern recog- tion methods for nonrigid medical image registration using mutual
nition, pp. 8287–8295. information and b-splines. IEEE Transactions on Image Process-
Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial trans- ing, 16(12), 2879–2890.
former networks. In Advances in neural information processing Kluger, F., Brachmann, E., Ackermann, H., Rother, C., Yang, M. Y.,
systems, pp. 2017–2025. & Rosenhahn, B. (2020). Consac: Robust multi-model fitting by
Jégou, H., Douze, M., & Schmid, C. (2010). Improving bag-of-features conditional sample consensus. In Proceedings of the IEEE/CVF
for large scale image search. International Journal of Computer conference on computer vision and pattern recognition, pp. 4634–
Vision, 87(3), 316–336. 4643.
Jiang, B., Tang, J., Ding, C., & Luo, B. (2017b). Binary constraint Komorowski, J., Czarnota, K., Trzcinski, T., Dabala, L., & Lynen, S.
preserving graph matching. In Proceedings of the IEEE conference (2018). Interest point detectors stability evaluation on apolloscape
on computer vision and pattern recognition, pp. 4402–4409. dataset. In Proceedings of the European conference on computer
Jiang, B., Tang, J., Ding, C., Gong, Y., & Luo, B. (2017a). Graph match- vision, pp. 727–739.
ing via multiplicative update algorithm. In Advances in neural Kovnatsky, A., Bronstein, M. M., Bresson, X., & Vandergheynst, P.
information processing systems, pp. 3187–3195. (2015). Functional correspondence by matrix completion. In Pro-
Jiang, Z., Wang, T., & Yan, J. (2020b). Unifying offline and online ceedings of the IEEE conference on computer vision and pattern
multi-graph matching via finding shortest paths on supergraph. In recognition, pp. 905–914.
IEEE transactions on pattern analysis and machine intelligence. Krebs, J., Mansi, T., Delingette, H., Zhang, L., Ghesu, F. C., Miao,
S., Maier, A. K., Ayache, N., Liao, R., & Kamen, A. (2017).

123
International Journal of Computer Vision (2021) 129:23–79 71

Robust non-rigid registration through agent-based action learning. Levi, G. (1973). A note on the derivation of maximal common subgraphs
In Proceedings of the international conference on medical image of two directed or undirected graphs. Calcolo, 9(4), 341.
computing and computer-assisted intervention, pp. 344–352. Li, H., & Hartley, R. (2007). The 3d–3d registration problem revisited.
Kulis, B., & Darrell, T. (2009). Learning to hash with binary reconstruc- In Proceedings of the IEEE international conference on computer
tive embeddings. In Advances in neural information processing vision, pp. 1–8.
systems, pp. 1042–1050. Li, X., Han, K., Li, S., & Prisacariu, V. A. (2020). Dual-resolution
Kulis, B., & Grauman, K. (2009). Kernelized locality-sensitive hashing correspondence networks. arXiv preprint arXiv:2006.08844.
for scalable image search. In Proceedings of the IEEE international Li, H., Shen, T., & Huang, X. (2009). Global optimization for alignment
conference on computer vision, pp. 2130–2137. of generalized shapes. In Proceedings of the IEEE conference on
Kumar, B., Carneiro, G., Reid, I., et al. (2016). Learning local image computer vision and pattern recognition, pp. 856–863.
descriptors with deep siamese and triplet convolutional networks Li, H., Sumner, R. W., & Pauly, M. (2008). Global correspondence
by minimising global loss functions. In Proceedings of the IEEE optimization for non-rigid registration of depth scans. In Computer
conference on computer vision and pattern recognition, pp. 5385– graphics forum, Vol. 27, Wiley Online Library, pp. 1421–1430.
5394. Lian, W., Zhang, L., & Yang, M. H. (2017). An efficient globally optimal
Laskar, Z., & Kannala, J. (2018). Semi-supervised semantic matching. algorithm for asymmetric point matching. IEEE Transactions on
In Proceedings of the European conference on computer vision Pattern Analysis and Machine Intelligence, 39(7), 1281–1293.
workshop, pp. 1–11. Liao, R., Miao, S., de Tournemire, P., Grbic, S., Kamen, A., Mansi,
Lawin, F. J., Danelljan, M., Khan, F., Forssén, P. E., & Felsberg, M. T., & Comaniciu, D. (2017). An artificial agent for robust image
(2018). Density adaptive point set registration. In Proceedings of registration. In Proceedings of the thirty-first AAAI conference on
the IEEE international conference on computer vision, pp. 3829– artificial intelligence, pp. 4168–4175.
3837. Liao, Q., Sun, D., & Andreasson, H. (2020). Point set registration
Lawler, E. L. (1963). The quadratic assignment problem. Management for 3d range scans using fuzzy cluster-based metric and efficient
Science, 9(4), 586–599. global optimization. In IEEE transactions on pattern analysis and
Lazaridis, G., & Petrou, M. (2006). Image registration using the Walsh machine intelligence.
transform. IEEE Transactions on Image Processing, 15(8), 2343– Li, X., & Hu, Z. (2010). Rejecting mismatches by correspondence func-
2357. tion. International Journal of Computer Vision, 89(1), 1–17.
Le Moigne, J., Campbell, W. J., & Cromp, R. F. (2002). An automated Li, Z., Mahapatra, D., Tielbeek, J. A., Stoker, J., van Vliet, L. J., & Vos,
parallel image registration technique based on the correlation of F. M. (2015). Image registration based on autocorrelation of local
wavelet features. IEEE Transactions on Geoscience and Remote structure. IEEE Transactions on Image Processing, 35(1), 63–75.
Sensing, 40(8), 1849–1864. Lin, W. Y. D., Cheng, M. M., Lu, J., Yang, H., Do, M. N., & Torr, P.
Lebeda, K., Matas, J., & Chum, O. (2012). Fixing the locally optimized (2014). Bilateral functions for global motion modeling. In Pro-
ransac–full experimental evaluation. In Proceedings of the British ceedings of the European conference on computer vision, pp.
machine vision conference, pp. 1–11. 341–356.
Lee, J., Cho, M., & Lee, K. M. (2010). A graph matching algorithm using Lin, W. Y., Liu, S., Jiang, N., Do, M. N., Tan, P., & Lu, J. (2016b).
data-driven markov chain monte carlo sampling. In Proceedings Repmatch: Robust feature matching and pose for reconstructing
of the international conference on pattern recognition, pp. 2816– modern cities. In Proceedings of the European conference on com-
2819. puter vision, pp. 562–579.
Lee, J., Cho, M., & Lee, K. M. (2011). Hyper-graph matching via Lin, W. Y., Liu, S., Matsushita, Y., Ng, T. T., & Cheong, L. F. (2011).
reweighted random walks. In Proceedings of the IEEE conference Smoothly varying affine stitching. In Proceedings of the IEEE con-
on computer vision and pattern recognition, pp. 1633–1640. ference on computer vision and pattern recognition, pp. 345–352.
Lee, S., Lim, J., & Suh, I. H. (2020). Progressive feature matching: Lin, K., Lu, J., Chen, C. S., & Zhou, J. (2016a). Learning compact
Incremental graph construction and optimization. In IEEE trans- binary descriptors with unsupervised deep neural networks. In Pro-
actions on image processing. ceedings of the IEEE conference on computer vision and pattern
Lê-Huu, D. K., & Paragios, N. (2017). Alternating direction graph recognition, pp. 1183–1192.
matching. In Proceedings of the IEEE conference on computer Lindeberg, T. (1998). Feature detection with automatic scale selection.
vision and pattern recognition, pp. 4914–4922. International Journal of Computer Vision, 30(2), 79–116.
Lenc, K., & Vedaldi, A. (2014). Large scale evaluation of local image Lin, W. Y., Wang, F., Cheng, M. M., Yeung, S. K., Torr, P. H., Do, M.
feature detectors on homography datasets. In Proceedings of the N., et al. (2017). Code: Coherence based decision boundaries for
British machine vision conference. feature correspondence. IEEE Transactions on Pattern Analysis
Lenc, K., & Vedaldi, A. (2016). Learning covariant feature detectors. and Machine Intelligence, 40(1), 34–47.
In Proceedings of the European conference on computer vision, Lipman, Y., & Funkhouser, T. (2009). Möbius voting for surface corre-
pp. 100–117. spondence. ACM Transactions on Graphics, 28(3), 72.
Leordeanu, M., & Hebert, M. (2005). A spectral technique for corre- Lipman, Y., Yagev, S., Poranne, R., Jacobs, D. W., & Basri, R. (2014).
spondence problems using pairwise constraints. In Proceedings of Feature matching with bounded distortion. ACM Transactions on
the IEEE international conference on computer vision, pp. 1482– Graphics, 33(3), 26.
1489. Litany, O., Remez, T., Rodolà, E., Bronstein, A., & Bronstein, M.
Leordeanu, M., Hebert, M., & Sukthankar, R. (2009). An integer pro- (2017). Deep functional maps: Structured prediction for dense
jected fixed point method for graph matching and map inference. shape correspondence. In Proceedings of the IEEE international
In Advances in neural information processing systems, pp. 1114– conference on computer vision, pp. 5659–5667.
1122. Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F.,
Leordeanu, M., Sukthankar, R., & Hebert, M. (2012). Unsupervised Ghafoorian, M., et al. (2017). A survey on deep learning in medical
learning for graph matching. International Journal of Computer image analysis. Medical Image Analysis, 42, 60–88.
Vision, 96(1), 28–45. Litman, R., & Bronstein, A. M. (2014). Learning spectral descriptors for
Leutenegger, S., Chli, M., & Siegwart, R. (2011). Brisk: Binary robust deformable shape correspondence. IEEE Transactions on Pattern
invariant scalable keypoints. In Proceedings of the IEEE interna- Analysis and Machine Intelligence, 36(1), 171–180.
tional conference on computer vision, pp. 2548–2555.

123
72 International Journal of Computer Vision (2021) 129:23–79

Liu, H., & Yan, S. (2010). Common visual pattern discovery via spatially Luo, Z., Shen, T., Zhou, L., Zhang, J., Yao, Y., Li, S., Fang, T., &
coherent correspondences. In Proceedings of the IEEE conference Quan, L. (2019). Contextdesc: Local descriptor augmentation with
on computer vision and pattern recognition, pp. 1609–1616. cross-modality context. In Proceedings of the IEEE conference on
Liu, Y., & Zhang, H. (2012). Indexing visual features: Real-time loop computer vision and pattern recognition, pp. 2527–2536.
closure detection using a tree structure. In Proceedings of the IEEE Luo, Z., Zhou, L., Bai, X., Chen, H., Zhang, J., Yao, Y., Li, S., Fang, T., &
international conference on robotics and automation, pp. 3613– Quan, L. (2020). Aslfeat: Learning local features of accurate shape
3618. and localization. In Proceedings of the IEEE/CVF conference on
Liu, Y., Feng, R., & Zhang, H. (2015a). Keypoint matching by outlier computer vision and pattern recognition, pp. 6589–6598.
pruning with consensus constraint. In Proceedings of the IEEE Ma, J., Zhao, J., Jiang, J., Zhou, H., Zhou, Y., Wang, Z., & Guo, X.
international conference on robotics and automation, pp. 5481– (2018b). Visual homing via guided locality preserving matching.
5486. In Proceedings of the IEEE international conference on robotics
Liu, W., Wang, J., Ji, R., Jiang, Y. G., & Chang, S. F. (2012a). Supervised and automation, pp. 7254–7261.
hashing with kernels. In Proceedings of the IEEE conference on Ma, J., Zhao, J., Tian, J., Tu, Z., & Yuille, A. L. (2013b). Robust esti-
computer vision and pattern recognition, pp. 2074–2081. mation of nonrigid transformation for point set registration. In
Liu, Y., Wang, C., Song, Z., & Wang, M. (2018b). Efficient global point Proceedings of the IEEE conference on computer vision and pat-
cloud registration by matching rotation invariant features through tern recognition, pp. 2147–2154.
translation search. In Proceedings of the European conference on Ma, J., Chen, C., Li, C., & Huang, J. (2016a). Infrared and visible
computer vision, pp. 448–463. image fusion via gradient transfer and total variation minimization.
Liu, R., Yang, C., Sun, W., Wang, X., & Li, H. (2020). Stereogan: Information Fusion, 31, 100–109.
Bridging synthetic-to-real domain gap by joint optimization of Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., & Suetens,
domain translation and stereo matching. In Proceedings of the P. (1997). Multimodality image registration by maximization of
IEEE/CVF conference on computer vision and pattern recogni- mutual information. IEEE Transactions on Image, 16(2), 187–198.
tion, pp. 12,757–12,766. Mainali, P., Lafruit, G., Yang, Q., Geelen, B., Van Gool, L., & Lauw-
Liu, X., Ai, Y., Zhang, J., & Wang, Z. (2018a). A novel affine and con- ereins, R. (2013). Sifer: Scale-invariant feature detector with error
trast invariant descriptor for infrared and visible image registration. resilience. International Journal of Computer Vision, 104(2), 172–
Remote Sensing, 10(4), 658. 197.
Liu, Y., Chen, X., Peng, H., & Wang, Z. (2017). Multi-focus image Mair, E., Hager, G. D., Burschka, D., Suppa, M., & Hirzinger, G. (2010).
fusion with a deep convolutional neural network. Information Adaptive and generic corner detection based on the accelerated
Fusion, 36, 191–207. segment test. In Proceedings of the European conference on com-
Liu, H., Guo, B., & Feng, Z. (2005). Pseudo-log-polar Fourier transform puter vision, pp. 183–196.
for image registration. IEEE Signal Processing Letters, 13(1), 17– Maiseli, B., Gu, Y., & Gao, H. (2017). Recent developments and trends
20. in point set registration methods. Journal of Visual Communication
Liu, Y., Liu, S., & Wang, Z. (2015b). Multi-focus image fusion with and Image Representation, 46, 95–106.
dense sift. Information Fusion, 23, 139–155. Ma, J., Jiang, X., Jiang, J., Zhao, J., & Guo, X. (2019a). LMR: Learning
Liu, M., Pradalier, C., & Siegwart, R. (2013). Visual homing from scale a two-class classifier for mismatch removal. IEEE Transactions on
with an uncalibrated omnidirectional camera. IEEE Transactions Image Processing, 28(8), 4045–4059.
on Robotics, 29(6), 1353–1365. Ma, J., Jiang, J., Liu, C., & Li, Y. (2017a). Feature guided Gaussian
Liu, Z. Y., & Qiao, H. (2014). GNCCP–graduated nonconvexityand mixture model with semi-supervised em and local geometric con-
concavity procedure. IEEE Transactions on Pattern Analysis and straint for retinal image registration. Information Sciences, 417,
Machine Intelligence, 36(6), 1258–1267. 128–142.
Liu, Z. Y., Qiao, H., & Xu, L. (2012b). An extended path following algo- Ma, J., Jiang, J., Zhou, H., Zhao, J., & Guo, X. (2018a). Guided locality
rithm for graph-matching problem. IEEE Transactions on Pattern preserving feature matching for remote sensing image registration.
Analysis and Machine Intelligence, 34(7), 1451–1456. IEEE Transactions on Geoscience and Remote Sensing, 56(8),
Li, Y., Wang, S., Tian, Q., & Ding, X. (2015). A survey of recent 4435–4447.
advances in visual feature detection. Neurocomputing, 149, 736– Ma, J., Liang, P., Yu, W., Chen, C., Guo, X., Wu, J., et al. (2020). Infrared
751. and visible image fusion via detail preserving adversarial learning.
Loeckx, D., Slagmolen, P., Maes, F., Vandermeulen, D., & Suetens, Information Fusion, 54, 85–98.
P. (2009). Nonrigid image registration using conditional mutual Ma, J., Qiu, W., Zhao, J., Ma, Y., Yuille, A. L., & Tu, Z. (2015). Robust
information. IEEE Transactions on Image, 29(1), 19–29. l2 e estimation of transformation for non-rigid registration. IEEE
Loiola, E. M., de Abreu, N. M. M., Boaventura-Netto, P. O., Hahn, P., & Transactions on Signal Processing, 63(5), 1115–1129.
Querido, T. (2007). A survey for the quadratic assignment problem. Marimon, D., Bonnin, A., Adamek, T., & Gimeno, R. (2010). Darts:
European Journal of Operational Research, 176(2), 657–690. Efficient scale-space extraction of daisy keypoints. In Proceedings
Lowe, D.G., et al. (1999). Object recognition from local scale-invariant of the IEEE conference on computer vision and pattern recogni-
features. In Proceedings of the IEEE international conference on tion, pp. 2416–2423.
computer vision, pp. 1150–1157. Maron, H., & Lipman, Y. (2018). (probably) concave graph matching. In
Lowe, D. G. (2004). Distinctive image features from scale-invariant Advances in Neural information processing systems, pp. 406–416.
keypoints. International Journal of Computer Vision, 60(2), 91– Maron, H., Dym, N., Kezurer, I., Kovalsky, S., & Lipman, Y. (2016).
110. Point registration via efficient convex relaxation. ACM Transac-
Lowry, S., & Andreasson, H. (2018). Logos: Local geometric support tions on Graphics, 35(4), 73.
for high-outlier spatial verification. In Proceedings of the IEEE Masci, J., Boscaini, D., Bronstein, M., & Vandergheynst, P. (2015).
international conference on robotics and automation, pp. 7262– Geodesic convolutional neural networks on Riemannian mani-
7269. folds. In Proceedings of the IEEE international conference on
Luo, W., Schwing, A. G., & Urtasun, R. (2016). Efficient deep learning computer vision workshops, pp. 37–45.
for stereo matching. In Proceedings of the IEEE conference on Masood, A., & Sarfraz, M. (2007). Corner detection by sliding rectan-
computer vision and pattern recognition, pp. 5695–5703. gles along planar curves. Computers & Graphics, 31(3), 440–448.

123
International Journal of Computer Vision (2021) 129:23–79 73

Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide- Mishkin, D., Radenovic, F., & Matas, J. (2018). Repeatability is not
baseline stereo from maximally stable extremal regions. Image enough: Learning affine regions via discriminability. In Proceed-
and Vision Computing, 22(10), 761–767. ings of the European conference on computer vision, pp. 284–300.
Ma, W., Wen, Z., Wu, Y., Jiao, L., Gong, M., Zheng, Y., et al. (2017b). Mitra, R., Doiphode, N., Gautam, U., Narayan, S., Ahmed, S., Chan-
Remote sensing image registration with modified sift and enhanced dran, S., & Jain, A. (2018). A large dataset for improving patch
feature matching. IEEE Geoscience and Remote Sensing Letters, matching. arXiv preprint arXiv:1801.01466.
14(1), 3–7. Mok, T. C., & Chung, A. (2020). Fast symmetric diffeomorphic image
Ma, J., Wu, J., Zhao, J., Jiang, J., Zhou, H., & Sheng, Q. Z. (2019b). registration with convolutional neural networks. In Proceedings of
Nonrigid point set registration with robust transformation learn- the IEEE/CVF conference on computer vision and pattern recog-
ing under manifold regularization. IEEE Transactions on Neural nition, pp. 4644–4653.
Networks and Learning Systems, 30(12), 3584–3597. Mokhtarian, F., & Suomela, R. (1998). Robust image corner detection
Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, through curvature scale space. IEEE Transactions on Pattern Anal-
A., & Brox, T. (2016). A large dataset to train convolutional net- ysis and Machine Intelligence, 20(12), 1376–1381.
works for disparity, optical flow, and scene flow estimation. In Möller, R., Krzykawski, M., & Gerstmayr, L. (2010). Three 2d-warping
Proceedings of the IEEE conference on computer vision and pat- schemes for visual robot navigation. Autonomous Robots, 29(3–4),
tern recognition, pp. 4040–4048. 253–291.
Ma, J., Yu, W., Liang, P., Li, C., & Jiang, J. (2019c). Fusiongan: A gen- Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., & Bronstein,
erative adversarial network for infrared and visible image fusion. M. M. (2017). Geometric deep learning on graphs and manifolds
Information Fusion, 48, 11–26. using mixture model CNNs. In Proceedings of the IEEE conference
Ma, J., Zhao, J., Jiang, J., Zhou, H., & Guo, X. (2019d). Locality preserv- on computer vision and pattern recognition, pp. 5115–5124.
ing matching. International Journal of Computer Vision, 127(5), Moo Yi, K., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., & Fua, P.
512–531. (2018). Learning to find good correspondences. In Proceedings of
Ma, J., Zhao, J., Ma, Y., & Tian, J. (2015a). Non-rigid visible and the IEEE conference on computer vision and pattern recognition,
infrared face registration via regularized gaussian fields criterion. pp. 2666–2674.
Pattern Recognition, 48(3), 772–784. Moo Yi, K., Verdie, Y., Fua, P., & Lepetit, V. (2016). Learning to assign
Ma, J., Zhao, J., Tian, J., Bai, X., & Tu, Z. (2013a). Regularized vector orientations to feature points. In Proceedings of the IEEE confer-
field learning with sparse approximation for mismatch removal. ence on computer vision and pattern recognition, pp. 107–116.
Pattern Recognition, 46(12), 3519–3532. Moravec, H. P. (1977). Techniques towards automatic visual obstacle
Ma, J., Zhao, J., Tian, J., Yuille, A. L., & Tu, Z. (2014). Robust point avoidance.
matching via vector field consensus. IEEE Transactions on Image Morel, J. M., & Yu, G. (2009). Asift: A new framework for fully affine
Processing, 23(4), 1706–1721. invariant image comparison. SIAM Journal on Imaging Sciences,
Ma, J., Zhao, J., & Yuille, A. L. (2016b). Non-rigid point set registration 2(2), 438–469.
by preserving global and local structures. IEEE Transactions on Mukherjee, D., Wu, Q. J., & Wang, G. (2015). A comparative experi-
Image Processing, 25(1), 53–64. mental study of image feature detectors and descriptors. Machine
Ma, J., Zhou, H., Zhao, J., Gao, Y., Jiang, J., & Tian, J. (2015b). Robust Vision and Applications, 26(4), 443–466.
feature matching for remote sensing image registration via locally Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015). ORB-SLAM:
linear transforming. IEEE Transactions on Geoscience and Remote A versatile and accurate monocular slam system. IEEE Transac-
Sensing, 53(12), 6469–6481. tions on Robotics, 31(5), 1147–1163.
Mian, A., Bennamoun, M., & Owens, R. (2010). On the repeatability Mustafa, A., Kim, H., & Hilton, A. (2018). Msfd: Multi-scale
and quality of keypoints for local feature-based 3d object retrieval segmentation-based feature detection for wide-baseline scene
from cluttered scenes. International Journal of Computer Vision, reconstruction. IEEE Transactions on Image Processing, 28(3),
89(2–3), 348–361. 1118–1132.
Miao, S., Piat, S., Fischer, P., Tuysuzoglu, A., Mewes, P., Mansi, T., & Myronenko, A., & Song, X. (2010). Point set registration: Coherent
Liao, R. (2018). Dilated fcn for multi-agent 2d/3d medical image point drift. IEEE Transactions on Pattern Analysis and Machine
registration. In Proceedings of the thirty-second AAAI conference Intelligence, 32(12), 2262–2275.
on artificial intelligence, pp. 4694–4701. Nasuto, D., & Craddock, J. B. R. (2002). Napsac: High noise, high
Mikolajczyk, K., & Schmid, C. (2001). Indexing based on scale invari- dimensional robust estimation-it’s in the bag. In Proceedings of
ant interest points. In Proceedings of the IEEE international the British machine vision conference, pp. 458–467.
conference on computer vision, pp. 525–531. Ni, K., Jin, H., & Dellaert, F. (2009). Groupsac: Efficient consensus in
Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest the presence of groupings. In Proceedings of the IEEE interna-
point detectors. International Journal of Computer Vision, 60(1), tional conference on computer vision, pp. 2193–2200.
63–86. Norouzi, M., & Blei, D. M. (2011). Minimal loss hashing for compact
Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation binary codes. In Proceedings of the international conference on
of local descriptors. IEEE Transactions on Pattern Analysis and machine learning, pp. 353–360.
Machine Intelligence, 27(10), 1615–1630. Nüchter, A., Lingemann, K., Hertzberg, J., & Surmann, H. (2007).
Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., 6d SLAM–3d mapping outdoor environments. Journal of Field
Schaffalitzky, F., et al. (2005). A comparison of affine region detec- Robotics, 24(8–9), 699–722.
tors. International Journal of Computer Vision, 65(1–2), 43–72. Ojala, T., Pietikäinen, M., & Mäenpää, T. (2002). Multiresolution gray-
Mishchuk, A., Mishkin, D., Radenovic, F., & Matas, J. (2017). Working scale and rotation invariant texture classification with local binary
hard to know your neighbor’s margins: Local descriptor learning patterns. IEEE Transactions on Pattern Analysis and Machine
loss. In Advances in neural information processing systems, pp. Intelligence, 24(7), 971–987.
4826–4837. Ono, Y., Trulls, E., Fua, P., & Yi, K. M. (2018). LF-NET: Learning
Mishkin, D., Radenovic, F., & Matas, J. (2017). Learning dis- local features from images. In Advances in neural information
criminative affine regions via discriminability. arXiv preprint processing systems, pp. 6234–6244.
arXiv:1711.06704.

123
74 International Journal of Computer Vision (2021) 129:23–79

Ovsjanikov, M., Ben-Chen, M., Solomon, J., Butscher, A., & Guibas, Qi, C.R., Su, H., Mo, K., & Guibas, L. J. (2017a). Pointnet: Deep
L. (2012). Functional maps: A flexible representation of maps learning on point sets for 3d classification and segmentation. In
between shapes. ACM Transactions on Graphics, 31(4), 30. Proceedings of the IEEE conference on computer vision and pat-
Pachauri, D., Kondor, R., & Singh, V. (2013). Solving the multi-way tern recognition, pp. 652–660.
matching problem by permutation synchronization. In Advances Qi, C. R., Yi, L., Su, H., & Guibas, L. J. (2017b). Pointnet++: Deep
in neural information processing systems, pp. 1860–1868. hierarchical feature learning on point sets in a metric space. In
Pais, G. D., Ramalingam, S., Govindu, V. M., Nascimento, J. C., Chel- Advances in neural information processing systems, pp. 5099–
lappa, R., & Miraldo, P. (2020). 3dregnet: A deep neural network 5108.
for 3d point registration. In Proceedings of the IEEE/CVF confer- Quinlan, J. R. (1986). Induction of decision trees. Machine Learning,
ence on computer vision and pattern recognition, pp. 7193–7203. 1(1), 81–106.
Pan, W. H., Wei, S. D., & Lai, S. H. (2008). Efficient NCC-based image Raguram, R., Chum, O., Pollefeys, M., Matas, J., & Frahm, J. M. (2012).
matching in Walsh-Hadamard domain. In Proceedings of the Euro- USAC: A universal framework for random sample consensus.
pean conference on computer vision, pp. 468–480. IEEE Transactions on Pattern Analysis and Machine Intelligence,
Pang, J., Sun, W., Ren, J. S., Yang, C., & Yan, Q. (2017). Cascade resid- 35(8), 2022–2038.
ual learning: A two-stage convolutional neural network for stereo Ramer, U. (1972). An iterative procedure for the polygonal approxima-
matching. In Proceedings of the IEEE international conference on tion of plane curves. Computer Graphics and Image Processing,
computer vision, pp. 887–895. 1(3), 244–256.
Papazov, C., & Burschka, D. (2011). Stochastic global optimization for Ramisa, A., Goldhoorn, A., Aldavert, D., Toledo, R., & de Mantaras,
robust point set registration. Computer Vision and Image Under- R. L. (2011). Combining invariant features and the ALV hom-
standing, 115(12), 1598–1609. ing method for autonomous robot navigation based on panoramas.
Park, J., Zhou, Q. Y., & Koltun, V. (2017). Colored point cloud registra- Journal of Intelligent & Robotic Systems, 64(3–4), 625–649.
tion revisited. In Proceedings of the IEEE international conference Ranftl, R., & Koltun, V. (2018). Deep fundamental matrix estimation.
on computer vision, pp. 143–152. In Proceedings of the European conference on computer vision,
Parra Bustos, A., Chin, T. J., & Suter, D. (2014). Fast rotation search with pp. 284–299.
stereographic projections for 3d registration. In Proceedings of the Reddy, B. S., & Chatterji, B. N. (1996). An FFT-based technique for
IEEE conference on computer vision and pattern recognition, pp. translation, rotation, and scale-invariant image registration. IEEE
3930–3937. Transactions on Image Processing, 5(8), 1266–1271.
Paul, S., & Pati, U. C. (2016). Remote sensing optical image registration Revaud, J., Weinzaepfel, P., De Souza, C., Pion, N., Csurka, G., Cabon,
using modified uniform robust sift. IEEE Geoscience and Remote Y., & Humenberger, M. (2019). R2d2: Repeatable and reliable
Sensing Letters, 13(9), 1300–1304. detector and descriptor. arXiv preprint arXiv:1906.06195.
Perona, P., & Malik, J. (1990). Scale-space and edge detection using Revaud, J., Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2016). Deep-
anisotropic diffusion. IEEE Transactions on Pattern Analysis and matching: Hierarchical deformable dense matching. International
Machine Intelligence, 12(7), 629–639. Journal of Computer Vision, 120(3), 300–323.
Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2007). Richardson, A., & Olson, E. (2013). Learning convolutional filters for
Object retrieval with large vocabularies and fast spatial matching. interest point detection. In Proceedings of the IEEE international
In Proceedings of the IEEE conference on computer vision and conference on robotics and automation, pp. 631–637.
pattern recognition, pp. 1–8. Robertson, C., & Fisher, R. B. (2002). Parallel evolutionary registration
Piasco, N., Sidibé, D., Demonceaux, C., & Gouet-Brunet, V. (2018). of range data. Computer Vision and Image Understanding, 87(1–
A survey on visual-based localization: On the benefit of heteroge- 3), 39–50.
neous data. Pattern Recognition, 74, 90–109. Rocco, I., Arandjelovic, R., & Sivic, J. (2017). Convolutional neural
Pilet, J., Lepetit, V., & Fua, P. (2008). Fast non-rigid surface detection, network architecture for geometric matching. In Proceedings of
registration and realistic augmentation. International Journal of the IEEE conference on computer vision and pattern recognition,
Computer Vision, 76(2), 109–122. pp. 6148–6157.
Pinheiro, A. M., & Ghanbari, M. (2010). Piecewise approximation of Rocco, I., Cimpoi, M., Arandjelović, R., Torii, A., Pajdla, T., & Sivic,
contours through scale-space selection of dominant points. IEEE J. (2018). Neighbourhood consensus networks. In Advances in
Transactions on Image Processing, 19(6), 1442–1450. neural information processing systems, pp. 1651–1662.
Plötz, T., & Roth, S. (2018). Neural nearest neighbors networks. In Rodola, E., Bronstein, A. M., Albarelli, A., Bergamasco, F., & Torsello,
Advances in Neural information processing systems, pp. 1087– A. (2012). A game-theoretic approach to deformable shape match-
1098. ing. In Proceedings of the IEEE conference on computer vision and
Poggi, M., Pallotti, D., Tosi, F., & Mattoccia, S. (2019). Guided stereo pattern recognition, pp. 182–189.
matching. In Proceedings of the IEEE conference on computer Rodolà, E., Cosmo, L., Bronstein, M. M., Torsello, A., & Cremers, D.
vision and pattern recognition, pp. 979–988. (2017). Partial functional correspondence. In Computer graphics
Pohl, C., & Van Genderen, J. L. (1998). Review article multisensor forum, Vol. 36, Wiley Online Library, pp. 222–236.
image fusion in remote sensing: concepts, methods and applica- Rodolà, E., Rota Bulo, S., Windheuser, T., Vestner, M., & Cremers,
tions. International Journal of Remote Sensing, 19(5), 823–854. D. (2014). Dense non-rigid shape correspondence using random
Pokrass, J., Bronstein, A. M., Bronstein, M. M., Sprechmann, P., & forests. In Proceedings of the IEEE conference on computer vision
Sapiro, G. (2013). Sparse modeling of intrinsic correspondences. and pattern recognition, pp. 4177–4184.
In Computer graphics forum, Vol. 32, Wiley Online Library, pp. Rodola, E., Torsello, A., Harada, T., Kuniyoshi, Y., & Cremers, D.
459–468. (2013). Elastic net constraints for shape matching. In Proceed-
Pomerleau, F., Colas, F., Siegwart, R., & Magnenat, S. (2013). Com- ings of the IEEE international conference on computer vision, pp.
paring ICP variants on real-world data sets. Autonomous Robots, 1169–1176.
34(3), 133–148. Rosenfeld, A., & Weszka, J. S. (1975). An improved method of angle
Poursaeed, O., Yang, G., Prakash, A., Fang, Q., Jiang, H., Hariharan, B., detection on digital curves. IEEE Transactions on Computers,
& Belongie, S. (2018). Deep fundamental matrix estimation with- 100(9), 940–941.
out correspondences. In Proceedings of the European conference
on computer vision workshop, pp. 1–13.

123
International Journal of Computer Vision (2021) 129:23–79 75

Rosten, E., & Drummond, T. (2006). Machine learning for high-speed ings of the IEEE winter conference on applications of computer
corner detection. In Proceedings of the European conference on vision, pp. 278–285.
computer vision, pp. 430–443. Shaked, A., & Wolf, L. (2017). Improved stereo matching with con-
Rosten, E., Porter, R., & Drummond, T. (2010). Faster and better: A stant highway networks and reflective confidence learning. In
machine learning approach to corner detection. IEEE Transactions Proceedings of the IEEE conference on computer vision and pat-
on Pattern Analysis and Machine Intelligence, 32(1), 105–119. tern recognition, pp. 4641–4650.
Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. R. (2011). Orb: Shakhnarovich, G. (2005). Learning task-specific similarity. Ph.D. the-
An efficient alternative to sift or surf. In Proceedings of the IEEE sis, Massachusetts Institute of Technology.
international conference on computer vision, pp. 2564–2571. Shapiro, L. S., & Brady, J. M. (1992). Feature-based correspondence:
Rustamov, R. M. (2007). Laplace-Beltrami eigenfunctions for defor- An eigenvector approach. Image and Vision Computing, 10(5),
mation invariant shape representation. In Proceedings of the 283–288.
Eurographics symposium on geometry processing, pp. 225–233. Shen, X., Wang, C., Li, X., Yu, Z., Li, J., Wen, C., Cheng, M., & He, Z.
Rusu, R. B., Blodow, N., & Beetz, M. (2009). Fast point feature his- (2019). RF-NET: An end-to-end image matching network based on
tograms (fpfh) for 3d registration. In Proceedings of the IEEE receptive field. In Proceedings of the IEEE conference on computer
international conference on robotics and automation, pp. 3212– vision and pattern recognition, pp. 8132–8140.
3217. Shi, J., & Tomasi, C. (1993). Good features to track. Technical report,
Rusu, R. B., Blodow, N., Marton, Z. C., & Beetz, M. (2008). Aligning Cornell University.
point cloud views using persistent feature histograms. In Proceed- Silva, L., Bellon, O. R. P., & Boyer, K. L. (2005). Precision range
ings of the IEEE/RSJ international conference on intelligent robots image registration using a robust surface interpenetration measure
and systems, pp. 3384–3391. and enhanced genetic algorithms. IEEE Transactions on Pattern
Sahillioglu, Y., & Yemez, Y. (2011). Coarse-to-fine combinatorial Analysis and Machine Intelligence, 27(5), 762–776.
matching for dense isometric shape correspondence. In Computer Simonovsky, M., Gutiérrez-Becker, B., Mateus, D., Navab, N., &
graphics forum, Vol. 30, Wiley Online Library, pp. 1461–1470. Komodakis, N. (2016). A deep metric for multimodal registra-
Salakhutdinov, R., & Hinton, G. (2009). Semantic hashing. Interna- tion. In Proceedings of the international conference on medical
tional Journal of Approximate Reasoning, 50(7), 969–978. image computing and computer-assisted intervention, pp. 10–18.
Salti, S., Lanza, A., & Di Stefano, L. (2013). Keypoints from symmetries Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Learning local
by wave propagation. In Proceedings of the IEEE conference on feature descriptors using convex optimization. IEEE Transactions
computer vision and pattern recognition, pp. 2898–2905. on Pattern Analysis and Machine Intelligence, 36(8), 1573–1585.
Salti, S., Tombari, F., Spezialetti, R., & Di Stefano, L. (2015). Learn- Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., & Moreno-
ing a descriptor-specific 3d keypoint detector. In Proceedings of Noguer, F. (2015). Discriminative learning of deep convolutional
the IEEE international conference on computer vision, pp. 2318– feature point descriptors. In Proceedings of the IEEE international
2326. conference on computer vision, pp. 118–126.
Sandhu, R., Dambreville, S., & Tannenbaum, A. (2010). Point set Sipiran, I., & Bustos, B. (2011). Harris 3d: A robust extension of the Har-
registration via particle filtering and stochastic dynamics. IEEE ris operator for interest point detection on 3d meshes. The Visual
Transactions on Pattern Analysis and Machine Intelligence, 32(8), Computer, 27(11), 963.
1459–1473. Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval
Sarlin, P.E., DeTone, D., Malisiewicz, T., & Rabinovich, A. (2020). approach to object matching in videos. In Proceedings of the IEEE
Superglue: Learning feature matching with graph neural networks. international conference on computer vision, pp. 1–8.
In Proceedings of the IEEE/CVF conference on computer vision Smith, S. M., & Brady, J. M. (1997). Susan: A new approach to low
and pattern recognition, pp. 4938–4947. level image processing. International Journal of Computer Vision,
Savinov, N., Seki, A., Ladicky, L., Sattler, T., & Pollefeys, M. (2017). 23(1), 45–78.
Quad-networks: Unsupervised learning to rank for interest point Sofka, M., Yang, G., & Stewart, C. V. (2007). Simultaneous covariance
detection. In Proceedings of the IEEE conference on computer driven correspondence (CDC) and transformation estimation in the
vision and pattern recognition, pp. 1822–1830. expectation maximization framework. In Proceedings of the IEEE
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, conference on computer vision and pattern recognition, pp. 1–8.
G. (2009). The graph neural network model. In TNN. Sokooti, H., de Vos, B., Berendsen, F., Lelieveldt, B. P., Isgum, I., & Star-
Schellewald, C., & Schnörr, C. (2005). Probabilistic subgraph matching ing, M. (2017). Nonrigid image registration using multi-scale 3d
based on convex relaxation. In Proceedings of the international convolutional neural networks. In Proceedings of the international
workshop on energy minimization methods in computer vision and conference on medical image computing and computer-assisted
pattern recognition, pp. 171–186. intervention, pp. 232–239.
Schonberger, J. L., & Frahm, J. M. (2016). Structure-from-motion revis- Sotiras, A., Davatzikos, C., & Paragios, N. (2013). Deformable med-
ited. In Proceedings of the IEEE conference on computer vision ical image registration: A survey. IEEE Transactions on Medical
and pattern recognition, pp. 4104–4113. Imaging, 32(7), 1153.
Schonberger, J. L., Hardmeier, H., Sattler, T., & Pollefeys, M. (2017). Strecha, C., Lindner, A., Ali, K., & Fua, P. (2009). Training for task spe-
Comparative evaluation of hand-crafted and learned local features. cific keypoint detection. In Joint pattern recognition symposium,
In Proceedings of the IEEE conference on computer vision and Springer, pp. 151–160.
pattern recognition, pp. 1482–1491. Strecha, C., Von Hansen, W., Van Gool, L., Fua, P., & Thoennessen,
Schroeter, D., & Newman, P. (2008). On the robustness of visual hom- U. (2008). On benchmarking camera calibration and multi-view
ing under landmark uncertainty. In Proceedings of the intelligent stereo for high resolution imagery. In Proceedings of the IEEE
autonomous systems, pp. 278–287. Conference on computer vision and pattern recognition, pp. 1–8.
Scott, G. L., & Longuet-Higgins, H. C. (1991). An algorithm for associ- Strecha, C., Bronstein, A., Bronstein, M., & Fua, P. (2012). Ldahash:
ating the features of two images. Proceedings of the Royal Society Improved matching with smaller descriptors. IEEE Transactions
of London. Series B: Biological Sciences, 244(1309), 21–26. on Pattern Analysis and Machine Intelligence, 34(1), 66–78.
Shah, R., Srivastava, V., & Narayanan, P. (2015). Geometry-aware fea- Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012).
ture matching for structure from motion applications. In Proceed- A benchmark for the evaluation of RGB-D slam systems. In Pro-

123
76 International Journal of Computer Vision (2021) 129:23–79

ceedings of the IEEE/RSJ international conference on intelligent Torr, P. H. (2003). Solving Markov random fields using semi definite
robots and systems, pp. 573–580. programming. In Proceeding of AISTATS, pp. 1–8.
Suh, Y., Cho, M., & Lee, K. M. (2012). Graph matching via sequen- Torr, P., & Zisserman, A. (1998). Robust computation and parametriza-
tial Monte Carlo. In Proceedings of the European conference on tion of multiple view relations. In Proceedings of the international
computer vision, pp. 624–637. conference on computer vision, pp. 727–732.
Sun, J., Ovsjanikov, M., & Guibas, L. (2009). A concise and prov- Torresani, L., Kolmogorov, V., & Rother, C. (2012). A dual decompo-
ably informative multi-scale signature based on heat diffusion. sition approach to feature correspondence. IEEE Transactions on
In Computer graphics forum, Vol. 28, Wiley Online Library, pp. Pattern Analysis and Machine Intelligence, 35(2), 259–271.
1383–1392. Torr, P. H., & Zisserman, A. (2000). Mlesac: A new robust estimator
Sweeney, C., Hollerer, T., & Turk, M. (2015). Theia: A fast and scalable with application to estimating image geometry. Computer Vision
structure-from-motion library. In Proceedings of the ACM inter- and Image Understanding, 78(1), 138–156.
national conference on multimedia, pp. 693–696. Trajković, M., & Hedley, M. (1998). Fast corner detection. Image and
Swoboda, P., Kuske, J., & Savchynskyy, B. (2017). A dual ascent frame- Vision Computing, 16(2), 75–87.
work for Lagrangean decomposition of combinatorial problems. Tron, R., Zhou, X., Esteves, C., & Daniilidis, K. (2017). Fast multi-
In Proceedings of the IEEE conference on computer vision and image matching via density-based clustering. In Proceedings of
pattern recognition, pp. 1596–1606. the IEEE international conference on computer vision, pp. 4057–
Swoboda, P., Mokarian, A., Theobalt, C., Bernard, F., et al. (2019). A 4066.
convex relaxation for multi-graph matching. In Proceedings of the Truong, P., Danelljan, M., & Timofte, R. (2020). Glu-net: Global-local
IEEE conference on computer vision and pattern recognition, pp. universal network for dense flow and correspondences. In Proceed-
11,156–11,165. ings of the IEEE/CVF conference on computer vision and pattern
Swoboda, P., Rother, C., Abu Alhaija, H., Kainmuller, D., & Savchyn- recognition, pp. 6258–6268.
skyy, B. (2017). A study of lagrangean decompositions and dual Trzcinski, T., & Lepetit, V. (2012). Efficient discriminative projections
ascent solvers for graph matching. In Proceedings of the IEEE for compact binary descriptors. In Proceedings of the European
conference on computer vision and pattern recognition, pp. 1607– conference on computer vision, pp. 228–242.
1616. Trzcinski, T., Christoudias, M., Fua, P., & Lepetit, V. (2013). Boosting
Takita, K., Aoki, T., Sasaki, Y., Higuchi, T., & Kobayashi, K. (2003). binary keypoint descriptors. In Proceedings of the IEEE conference
High-accuracy subpixel image registration based on phase-only on computer vision and pattern recognition, pp. 2874–2881.
correlation. IEICE Transactions on Fundamentals of Electronics, Trzcinski, T., Christoudias, M., Lepetit, V., & Fua, P. (2012). Learning
Communications and Computer Sciences, 86(8), 1925–1934. image descriptors with the boosting-trick. In Advances in neural
Tang, F., Lim, S. H., Chang, N. L., & Tao, H. (2009). A novel feature information processing systems, pp. 269–277.
descriptor invariant to complex brightness changes. In Proceedings Trzcinski, T., Christoudias, M., & Lepetit, V. (2014). Learning image
of the IEEE conference on computer vision and pattern recogni- descriptors with boosting. IEEE Transactions on Pattern Analysis
tion, pp. 2631–2638. and Machine Intelligence, 37(3), 597–610.
Tevs, A., Berner, A., Wand, M., Ihrke, I., & Seidel, H. P. (2011). Intrin- Tsin, Y., & Kanade, T. (2004). A correlation-based approach to robust
sic shape matching by planned landmark sampling. In Computer point set registration. In Proceedings of the European conference
graphics forum, Vol. 30, Wiley Online Library, pp. 543–552. on computer vision, pp. 558–569.
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, Tuytelaars, T., & Van Gool, L. (2004). Matching widely separated views
D., et al. (2016). Yfcc100m: The new data in multimedia research. based on affine invariant regions. International Journal of Com-
Communications of the ACM, 59(2), 64–73. puter Vision, 59(1), 61–85.
Tian, Y., Fan, B., & Wu, F. (2017). L2-net: Deep learning of discrimi- Tuytelaars, T., Mikolajczyk, K., et al. (2008). Local invariant fea-
native patch descriptor in Euclidean space. In Proceedings of the ture detectors: A survey. Foundations and Trends® in Computer
IEEE conference on computer vision and pattern recognition, pp. Graphics and Vision, 3(3), 177–280.
661–669. Ufer, N., & Ommer, B. (2017). Deep semantic feature matching. In Pro-
Tian, Y., Yan, J., Zhang, H., Zhang, Y., Yang, X., & Zha, H. (2012). ceedings of the IEEE conference on computer vision and pattern
On the convergence of graph matching: Graduated assignment recognition, pp. 6914–6923.
revisited. In Proceedings of the European conference on computer Umeyama, S. (1988). An eigen decomposition approach to weighted
vision, pp. 821–835. graph matching problems. IEEE Transactions on Pattern Analysis
Tian, Y., Yu, X., Fan, B., Wu, F., Heijnen, H., & Balntas, V. (2019). and Machine Intelligence, 10(5), 695–703.
Sosnet: Second order similarity regularization for local descrip- Unnikrishnan, R., & Hebert, M. (2008). Multi-scale interest regions
tor learning. In Proceedings of the IEEE conference on computer from unorganized point clouds. In Proceedings of the IEEE con-
vision and pattern recognition, pp. 11,016–11,025. ference on computer vision and pattern recognition workshops,
Toews, M., & Wells, W. (2009). Sift-rank: Ordinal description for invari- pp. 1–8.
ant feature correspondence. In Proceedings of the IEEE conference van Wyk, B. J., & van Wyk, M. A. (2004). A POCS-based graph
on computer vision and pattern recognition, pp. 172–177. matching algorithm. IEEE Transactions on Pattern Analysis and
Tola, E., Lepetit, V., & Fua, P. (2010). Daisy: An efficient dense descrip- Machine Intelligence, 26(11), 1526–1530.
tor applied to wide-baseline stereo. IEEE Transactions on Pattern Van Kaick, O., Zhang, H., Hamarneh, G., & Cohen-Or, D. (2011). A
Analysis and Machine Intelligence, 32(5), 815–830. survey on shape correspondence. In Computer graphics forum,
Tombari, F., Salti, S., & Di Stefano, L. (2010a). Unique shape context Vol. 30, Wiley Online Library, pp. 1681–1707.
for 3d data description. In Proceedings of the ACM workshop on Verdie, Y., Yi, K., Fua, P., & Lepetit, V. (2015). Tilde: A temporally
3D object retrieval, pp. 57–62. invariant learned detector. In Proceedings of the IEEE conference
Tombari, F., Salti, S., & Di Stefano, L. (2010b). Unique signatures on computer vision and pattern recognition, pp. 5279–5288.
of histograms for local surface description. In Proceedings of the Vongkulbhisal, J., De la Torre, F., & Costeira, J. P. (2017). Dis-
European conference on computer vision, pp. 356–369. criminative optimization: Theory and applications to point cloud
Tombari, F., Salti, S., & Di Stefano, L. (2013). Performance evalua- registration. In Proceedings of the IEEE conference on computer
tion of 3d keypoint detectors. International Journal of Computer vision and pattern recognition, pp. 4104–4112.
Vision, 102(1–3), 198–220.

123
International Journal of Computer Vision (2021) 129:23–79 77

Vongkulbhisal, J., Irastorza Ugalde, B., De la Torre, F., & Costeira, Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for
J. P. (2018). Inverse composition discriminative optimization for large margin nearest neighbor classification. Journal of Machine
point cloud registration. In Proceedings of the IEEE conference on Learning Research, 10(Feb), 207–244.
computer vision and pattern recognition, pp. 2993–3001. Weiss, Y., Torralba, A., & Fergus, R. (2009) Spectral hashing. In
Wang, J., & Zhang, M. (2020). Deepflash: An efficient network for Advances in neural information processing systems, pp. 1753–
learning-based medical image registration. In Proceedings of the 1760.
IEEE/CVF conference on computer vision and pattern recognition, Windheuser, T., Vestner, M., Rodolà, E., Triebel, R., & Cremers, D.
pp. 4444–4452. (2014). Optimal intrinsic descriptors for non-rigid shape analysis.
Wang, C., Bronstein, M. M., Bronstein, A. M., & Paragios, N. In Proceedings of the British machine vision conference.
(2011). Discrete minimum distortion correspondence problems for Wohlhart, P., & Lepetit, V. (2015). Learning descriptors for object
non-rigid shape matching. In Proceedings of the international con- recognition and 3d pose estimation. In Proceedings of the IEEE
ference on scale space and variational methods in computer vision, conference on computer vision and pattern recognition, pp. 3109–
pp. 580–591. 3118.
Wang, Z., Fan, B., & Wu, F. (2011). Local intensity order pattern for Wu, Y., Lim, J., & Yang, M. H. (2015b). Object tracking benchmark.
feature description. In Proceedings of the international conference IEEE Transactions on Pattern Analysis and Machine Intelligence,
on computer vision, pp. 603–610. 37(9), 1834–1848.
Wang, H., Guo, J., Yan, D. M., Quan, W., & Zhang, X. (2018b). Learning Wu, J., Zhang, H., & Guan, Y. (2014). Visual loop closure detection by
3d keypoint descriptors for non-rigid shape matching. In Proceed- matching binary visual features using locality sensitive hashing.
ings of the European conference on computer vision, pp. 3–19. In Proceeding of the world congress on intelligent control and
Wang, J., Kumar, S., & Chang, S. F. Semi-supervised hashing for scal- automation, pp. 940–945.
able image retrieval. In Proceedings of the IEEE conference on Wu, C. Visualsfm: A visual structure from motion system. Retrieved
computer vision and pattern recognition. November 16, 2018 from http://ccwu.me/vsfm/doc.html.
Wang, T., Liu, H., Li, Y., Jin, Y., Hou, X., & Ling, H. (2020). Learning Wu, G., Kim, M., Wang, Q., Munsell, B. C., & Shen, D. (2015a).
combinatorial solver for graph matching. In Proceedings of the Scalable high-performance image registration framework by unsu-
IEEE/CVF conference on computer vision and pattern recognition, pervised deep feature representations learning. IEEE Transactions
pp. 7568–7577. on Biomedical Engineering, 63(7), 1505–1516.
Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, Xiao, J., Owens, A., & Torralba, A. (2013). Sun3d: A database of big
B., & Wu, Y. (2014). Learning fine-grained image similarity with spaces reconstructed using SFM and object labels. In Proceed-
deep ranking. In Proceedings of the IEEE conference on computer ings of the IEEE international conference on computer vision, pp.
vision and pattern recognition, pp. 1386–1393. 1625–1632.
Wang, G., Wang, Z., Chen, Y., Zhou, Q., & Zhao, W. (2016). Context- Xie, J., Wang, M., & Fang, Y. (2016). Learned binary spectral shape
aware Gaussian fields for non-rigid point set registration. In descriptor for 3d shape correspondence. In Proceedings of the
Proceedings of the IEEE conference on computer vision and pat- IEEE conference on computer vision and pattern recognition, pp.
tern recognition, pp. 5811–5819. 3309–3317.
Wang, F., Xue, N., Yu, J. G., & Xia, G. S. (2020). Zero-assignment Yan, J., Li, Y., Liu, W., Zha, H., Yang, X., & Chu, S. M. (2014).
constraint for graph matching with outliers. In Proceedings of the Graduated consistency-regularized optimization for multi-graph
IEEE/CVF conference on computer vision and pattern recognition, matching. In Proceedings of the European conference on com-
pp. 3033–3042. puter vision, pp. 407–422.
Wang, F., Xue, N., Zhang, Y., Bai, X., & Xia, G. S. (2018a). Adaptively Yan, J., Ren, Z., Zha, H., & Chu, S. (2016a). A constrained clustering
transforming graph matching. In Proceedings of the European con- based approach for matching a collection of feature sets. In Pro-
ference on computer vision, pp. 625–640. ceedings of the international conference on pattern recognition,
Wang, R., Yan, J., & Yang, X. (2019). Learning combinatorial embed- pp. 3832–3837.
ding networks for deep graph matching. In ICCV. Yan, J., Tian, Y., Zha, H., Yang, X., Zhang, Y., & Chu, S. M. (2013).
Wang, Q., Zhou, X., & Daniilidis, K. (2018). Multi-image semantic Joint optimization for consistent multiple graph matching. In Pro-
matching by mining consistent features. In Proceedings of the ceedings of the IEEE international conference on computer vision,
IEEE conference on computer vision and pattern recognition, pp. pp. 1649–1656.
685–694. Yan, J., Xu, H., Zha, H., Yang, X., Liu, H., & Chu, S. (2015c). A
Wang, J., Zhou, F., Wen, S., Liu, X., & Lin, Y. (2017). Deep metric matrix decomposition perspective to multiple graph matching. In
learning with angular loss. In Proceedings of the IEEE interna- Proceedings of the IEEE international conference on computer
tional conference on computer vision, pp. 2593–2601. vision, pp. 199–207.
Wang, Z., Fan, B., Wang, G., & Wu, F. (2015). Exploring local Yan, J., Yin, X. C., Lin, W., Deng, C., Zha, H., & Yang, X. (2016b). A
and overall ordinal information for robust feature description. short survey of recent advances in graph matching. In Proceedings
IEEE Transactions on Pattern Analysis and Machine Intelligence, of the ACM on international conference on multimedia retrieval,
38(11), 2198–2211. pp. 167–174.
Wang, G., Wang, Z., Chen, Y., & Zhao, W. (2015). Robust point match- Yan, J., Zhang, C., Zha, H., Liu, W., Yang, X., & Chu, S. M.
ing method for multimodal retinal image registration. Biomedical (2015d). Discrete hyper-graph matching. In Proceedings of the
Signal Processing and Control, 19, 68–76. IEEE conference on computer vision and pattern recognition, pp.
Wei, L., Huang, Q., Ceylan, D., Vouga, E., & Li, H. (2016). Dense 1520–1528.
human body correspondences using convolutional networks. In Yan, J., Cho, M., Zha, H., Yang, X., & Chu, S. M. (2015a). Multi-graph
Proceedings of the IEEE conference on computer vision and pat- matching via affinity optimization with graduated consistency reg-
tern recognition, pp. 1544–1553. ularization. IEEE Transactions on Pattern Analysis and Machine
Wei, X., Zhang, Y., Gong, Y., & Zheng, N. (2018). Kernelized subspace Intelligence, 38(6), 1228–1242.
pooling for deep local descriptors. In Proceedings of the IEEE Yang, M., Wu, F., & Li, W. (2020). Waveletstereo: Learning wavelet
conference on computer vision and pattern recognition, pp. 1867– coefficients of disparity map in stereo matching. In Proceedings of
1875. the IEEE/CVF conference on computer vision and pattern recog-
nition, pp. 12,885–12,894.

123
78 International Journal of Computer Vision (2021) 129:23–79

Yang, X., Kwitt, R., Styner, M., & Niethammer, M. (2017b). Quicksil- Zeng, Z., Chan, T. H., Jia, K., & Xu, D. (2012). Finding correspondence
ver: Fast predictive image registration-a deep learning approach. from multiple images via sparse and low-rank decomposition. In
NeuroImage, 158, 378–396. Proceedings of the European conference on computer vision, pp.
Yang, J., Li, H., Campbell, D., & Jia, Y. (2016). Go-ICP: A globally opti- 325–339.
mal solution to 3d ICP point-set registration. IEEE Transactions Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J., & Funkhouser,
on Pattern Analysis and Machine Intelligence, 38(11), 2241–2254. T. (2017). 3dmatch: Learning local geometric descriptors from
Yang, K., Pan, A., Yang, Y., Zhang, S., Ong, S., & Tang, H. (2017a). RGB-D reconstructions. In Proceedings of the IEEE conference
Remote sensing image registration using multiple image features. on computer vision and pattern recognition, pp. 1802–1811.
Remote Sensing, 9(6), 581. Zeng, Y., Wang, C., Wang, Y., Gu, X., Samaras, D., & Paragios,
Yan, J., Wang, J., Zha, H., Yang, X., & Chu, S. (2015b). Consistency- N. (2010). Dense non-rigid surface registration using high-order
driven alternating optimization for multigraph matching: A unified graph matching. In Proceedings of the IEEE conference on com-
approach. IEEE Transactions on Image Processing, 24(3), 994– puter vision and pattern recognition, pp. 382–389.
1009. Zhang, H. (2011). Borf: Loop-closure detection with scale invariant
Yao, Y., Deng, B., Xu, W., & Zhang, J. (2020). Quasi-Newton solver visual features. In Proceedings of the IEEE international confer-
for robust non-rigid registration. In Proceedings of the IEEE/CVF ence on robotics and automation, pp. 3125–3130.
conference on computer vision and pattern recognition, pp. 7600– Zhang, L., & Rusinkiewicz, S. (2018). Learning to detect features in tex-
7609. ture images. In Proceedings of the IEEE conference on computer
Ye, Y., Shan, J., Bruzzone, L., & Shen, L. (2017). Robust registration vision and pattern recognition, pp. 6325–6333.
of multimodal remote sensing images based on structural similar- Zhang, F., Prisacariu, V., Yang, R., & Torr, P. H. (2019a). Ga-net: Guided
ity. IEEE Transactions on Geoscience and Remote Sensing, 55(5), aggregation net for end-to-end stereo matching. In Proceedings of
2941–2958. the IEEE conference on computer vision and pattern recognition,
Yew, Z. J., & Lee, G. H. (2020). RPM-NET: Robust point matching pp. 185–194.
using learned features. In Proceedings of the IEEE/CVF confer- Zhang, Z., Shi, Q., McAuley, J., Wei, W., Zhang, Y., & Van Den Hengel,
ence on computer vision and pattern recognition, pp. 11,824– A. (2016). Pairwise matching through max-weight bipartite belief
11,833. propagation. In Proceedings of the IEEE conference on computer
Yi, K. M., Trulls, E., Lepetit, V., & Fua, P. (2016). Lift: Learned invariant vision and pattern recognition, pp. 1202–1210.
feature transform. In Proceedings of the European conference on Zhang, J., Sun, D., Luo, Z., Yao, A., Zhou, L., Shen, T., Chen, Y., Quan,
computer vision, pp. 467–483. L., & Liao, H. (2019b). Learning two-view correspondences and
Yin, Z., & Shi, J. (2018). Geonet: Unsupervised learning of dense depth, geometry using order-aware network. In Proceedings of the IEEE
optical flow and camera pose. In Proceedings of the IEEE confer- international conference on computer vision, pp. 5845–5854.
ence on computer vision and pattern recognition, pp. 1983–1992. Zhang, S., Yang, Y., Yang, K., Luo, Y., & Ong, S. H. (2017a). Point set
Yu, T., Wang, R., Yan, J., & Li, B. (2020a). Learning deep graph match- registration with global-local correspondence and transformation
ing with channel-independent embedding and Hungarian attention. estimation. In Proceedings of the IEEE international conference
In International conference on learning representations. on computer vision, pp. 2669–2677.
Yu, T., Yan, J., & Li, B. (2020b). Determinant regularization Zhang, X., Yu, F. X., Karaman, S., & Chang, S. F. (2017b). Learning
for gradient-efficient graph matching. In Proceedings of the discriminative and transformation covariant local feature detec-
IEEE/CVF conference on computer vision and pattern recogni- tors. In Proceedings of the IEEE conference on computer vision
tion, pp. 7123–7132. and pattern recognition, pp. 6818–6826.
Yu, T., Yan, J., Wang, Y., Liu, W., et al. (2018). Generalizing graph Zhang, X., Yu, F. X., Kumar, S., & Chang, S. F. (2017c). Learning
matching beyond quadratic assignment model. In Advances in neu- spread-out local feature descriptors. In Proceedings of the IEEE
ral information processing systems, pp. 861–871. international conference on computer vision, pp. 4595–4603.
Zagoruyko, S., & Komodakis, N. (2015). Learning to compare image Zhang, X., Qu, Y., Yang, D., Wang, H., & Kymer, J. (2015). Laplacian
patches via convolutional neural networks. In Proceedings of the scale-space behavior of planar curve corners. IEEE Transactions
IEEE conference on computer vision and pattern recognition, pp. on Pattern Analysis and Machine Intelligence, 37(11), 2207–2217.
4353–4361. Zhang, X., Wang, H., Smith, A. W., Ling, X., Lovell, B. C., & Yang, D.
Zaharescu, A., Boyer, E., Varanasi, K., & Horaud, R. (2009). Surface (2010). Corner detection based on gradient correlation matrices of
feature detection and description with applications to mesh match- planar curves. Pattern Recognition, 43(4), 1207–1223.
ing. In Proceedings of the IEEE conference on computer vision and Zhao, J., & Ma, J. (2017). Visual homing by robust interpolation for
pattern recognition, pp. 373–380. sparse motion flow. In Proceedings of the IEEE/RSJ international
Zanfir, A., & Sminchisescu, C. (2018). Deep learning of graph matching. conference on intelligent robots and systems, pp. 1282–1288.
In Proceedings of the IEEE conference on computer vision and Zhao, C., Cao, Z., Li, C., Li, X., & Yang, J. (2019). Nm-net: Mining reli-
pattern recognition, pp. 2684–2693. able neighbors for robust feature correspondences. In Proceedings
Zaslavskiy, M., Bach, F., & Vert, J. P. (2009). A path following algorithm of the IEEE conference on computer vision and pattern recogni-
for the graph matching problem. IEEE Transactions on Pattern tion, pp. 215–224.
Analysis and Machine Intelligence, 31(12), 2227–2242. Zhao, Q., Karisch, S. E., Rendl, F., & Wolkowicz, H. (1998). Semidef-
Zass, R., & Shashua, A. (2008). Probabilistic graph and hypergraph inite programming relaxations for the quadratic assignment prob-
matching. In Proceedings of the IEEE conference on computer lem. Journal of Combinatorial Optimization, 2(1), 71–109.
vision and pattern recognition, IEEE, pp. 1–8. Zheng, L., Yang, Y., & Tian, Q. (2018). Sift meets CNN: A decade
Zbontar, J., & LeCun, Y. (2015). Computing the stereo matching cost survey of instance retrieval. IEEE Transactions on Pattern Analysis
with a convolutional neural network. In Proceedings of the IEEE and Machine Intelligence, 40(5), 1224–1244.
conference on computer vision and pattern recognition, pp. 1592– Zhong, Y. (2009). Intrinsic shape signatures: A shape descriptor for
1599. 3d object recognition. In Proceedings of the IEEE international
Zbontar, J., & LeCun, Y. (2016). Stereo matching by training a convo- conference on computer vision workshops, pp. 689–696.
lution neural network to compare image patches. The Journal of Zhou, W., Li, H., & Tian, Q. (2017). Recent advance in content-
Machine Learning Research, 17(1), 2287–2318. based image retrieval: A literature survey. arXiv preprint
arXiv:1706.06064.

123
International Journal of Computer Vision (2021) 129:23–79 79

Zhou, W., Li, H., Lu, Y., & Tian, Q. (2011). Large scale image search Zitnick, C. L., & Ramnath, K. (2011). Edge foci interest points. In
with geometric coding. In Proceedings of the ACM international Proceedings of the IEEE international conference on computer
conference on multimedia, pp. 1349–1352. vision, pp. 359–366.
Zhou, W., Lu, Y., Li, H., Song, Y., & Tian, Q. (2010). Spatial coding for Zitova, B., & Flusser, J. (2003). Image registration methods: A survey.
large scale partial-duplicate web image search. In Proceedings of Image and Vision Computing, 21(11), 977–1000.
the ACM international conference on multimedia, pp. 511–520.
Zhou, X., Zhu, M., & Daniilidis, K. (2015). Multi-image matching via
fast alternating minimization. In Proceedings of the IEEE interna- Publisher’s Note Springer Nature remains neutral with regard to juris-
tional conference on computer vision, pp. 4032–4040. dictional claims in published maps and institutional affiliations.
Zhou, L., Zhu, S., Luo, Z., Shen, T., Zhang, R., Zhen, M., Fang, T., &
Quan, L. (2018). Learning and matching multi-view descriptors
for registration of point clouds. In Proceedings of the European
conference on computer vision, pp. 505–522.
Zhou, F., & De la Torre, F. (2015). Factorized graph matching. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 38(9),
1774–1789.
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-
to-image translation using cycle-consistent adversarial networks.
In Proceedings of the IEEE international conference on computer
vision, pp. 2223–2232.
Zieba, M., Semberecki, P., El-Gaaly, T., & Trzcinski, T. (2018). Bingan:
Learning compact binary descriptors with a regularized GAN. In
Advances in neural information processing systems, pp. 3608–
3618.

123

You might also like