0% found this document useful (0 votes)
13 views8 pages

Grauman Darrell Iccv05

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Grauman Darrell Iccv05

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

In Proceedings of the IEEE International Conference on Computer Vision, Beijing, China, October 2005.

The Pyramid Match Kernel:


Discriminative Classification with Sets of Image Features

Kristen Grauman and Trevor Darrell


Massachusetts Institute of Technology
Computer Science and Artificial Intelligence Laboratory
Cambridge, MA, USA

Abstract
Discriminative learning is challenging when examples are
sets of features, and the sets vary in cardinality and lack
any sort of meaningful ordering. Kernel-based classifica-
tion methods can learn complex decision boundaries, but
a kernel over unordered set inputs must somehow solve
for correspondences – generally a computationally expen-
sive task that becomes impractical for large set sizes. We
present a new fast kernel function which maps unordered
feature sets to multi-resolution histograms and computes a Figure 1: The pyramid match kernel intersects histogram pyra-
weighted histogram intersection in this space. This “pyra- mids formed over local features, approximating the optimal corre-
mid match” computation is linear in the number of features, spondences between the sets’ features.
and it implicitly finds correspondences based on the finest
resolution histogram cell where a matched pair first appears. criminative methods are known to represent complex deci-
Since the kernel does not penalize the presence of extra fea- sion boundaries very efficiently and generalize well to un-
tures, it is robust to clutter. We show the kernel function seen data [24, 21]. For example, the Support Vector Ma-
is positive-definite, making it valid for use in learning al- chine (SVM) is a widely used approach to discriminative
gorithms whose optimal solutions are guaranteed only for classification that finds the optimal separating hyperplane
Mercer kernels. We demonstrate our algorithm on object between two classes. Kernel functions, which measure sim-
recognition tasks and show it to be accurate and dramati- ilarity between inputs, introduce non-linearities to the de-
cally faster than current approaches. cision functions; the kernel non-linearly maps two exam-
ples from the input space to the inner product in some fea-
ture space. However, conventional kernel-based algorithms
1. Introduction are designed to operate on fixed-length vector inputs, where
A variety of representations used in computer vision consist each vector entry corresponds to a particular global attribute
of unordered sets of features or parts, where each set varies for that instance; the commonly used general-purpose ker-
in cardinality, and the correspondence between the features nels defined on n inputs (e.g., Gaussian RBF, polynomial)
across each set is unknown. For instance, an image may be are not applicable in the space of vector sets.
described by a set of detected local affine-invariant regions, In this work we propose a pyramid match kernel – a
a shape may be described by a set of local descriptors de- new kernel function over unordered feature sets that allows
fined at each edge point, or a person’s face may be repre- them to be used effectively and efficiently in kernel-based
sented by a set of patches with different facial parts. In such learning methods. Each feature set is mapped to a multi-
cases, one set of feature vectors denotes a single instance of resolution histogram that preserves the individual features’
a particular class of interest (an object, scene, shape, face, distinctness at the finest level. The histogram pyramids
etc.), and it is expected that the number of features will vary are then compared using a weighted histogram intersection
across examples due to viewpoint changes, occlusions, or computation, which we show defines an implicit correspon-
inconsistent detections by the interest operator. dence based on the finest resolution histogram cell where a
To perform learning tasks like categorization or recogni- matched pair first appears (see Figure 1).
tion with such representations is challenging. While gen- The similarity measured by the pyramid match approx-
erative methods have had some success, kernel-based dis- imates the similarity measured by the optimal correspon-
dences between feature sets of unequal cardinality (i.e., the Method Complexity C P M U
partial matching that optimally maps points in the lower Match [25] O(dm2 ) x x
cardinality set to some subset of the points in the larger set, Exponent [14] O(dm2 ) x x x
such that the summed similarities between matched points Greedy [3] O(dm2 ) x x x
is maximal). Our kernel is extremely efficient and can be Princ. ang. [27] O(dm3 ) x x
computed in time that is linear in the sets’ cardinality. We Bhattach.’s [12] O(dm3 ) x x x
show that our kernel function is positive-definite, meaning KL-div. [16] O(dm2 ) x x
that it is appropriate to use with learning methods that guar- Pyramid O(dm log D) x x x x
antee convergence to a unique optimum only for positive-
definite kernels (e.g., SVMs). Table 1: Comparing kernel approaches to matching unordered
sets. Columns show each method’s computational cost and
Because it does not penalize the presence of superflu-
whether its kernel captures co-occurrences (C), is positive-definite
ous data points, the proposed kernel is robust to clutter. As (P), does not assume a parametric model (M), and can handle sets
we will show, this translates into the ability to handle un- of unequal cardinality (U). d is vector dimension, m is maximum
segmented images with varying backgrounds or occlusions. set cardinality, and D is diameter of vector space. “Pyramid”
The kernel also respects the co-occurrence relations inher- refers to the proposed kernel.
ent in the input sets: rather than matching features in a set
individually, ignoring potential dependencies conveyed by changes, or image noise.
features within one set, our similarity measure captures the Recent work has shown that local features invariant to
features’ joint statistics. common image transformations (e.g., SIFT [13]) are a pow-
Other approaches to this problem have recently been pro- erful representation for recognition, because the features
posed [25, 14, 3, 12, 27, 16, 20], but unfortunately each of can be reliably detected and matched across instances of
these techniques suffers from some number of the follow- the same object or scene under different viewpoints, poses,
ing drawbacks: computational complexities that make large or lighting conditions. Most approaches, however, perform
feature set sizes infeasible; limitations to parametric distri- recognition with local feature representations using nearest-
butions which may not adequately describe the data; ker- neighbor (e.g., [1, 8, 22, 2]) or voting-based classifiers fol-
nels that are not positive-definite (do not guarantee unique lowed by an alignment step (e.g., [13, 15]); both may be
solutions for an SVM); limitations to sets of equal size; and impractical for large training sets, since their classification
failure to account for dependencies within feature sets. times increase with the number of training examples. An
Our method addresses all of these issues, resulting in SVM, on the other hand, identifies a sparse subset of the
a kernel appropriate for comparing unordered, variable- training examples (the support vectors) to delineate a deci-
length feature sets within any existing kernel-based learn- sion boundary.
ing paradigm. We demonstrate our algorithm with object Kernel-based learning algorithms, which include SVMs,
recognition tasks and show that its accuracy is compara- kernel PCA, or Gaussian Processes, have become well-
ble to current approaches, while requiring significantly less established tools that are useful in a variety of contexts,
computation time. including discriminative classification, regression, density
estimation, and clustering [21]. More recently, attention
2. Related Work has been focused on developing specialized kernels that can
more fully leverage these tools for situations where the data
In this section, we review related work on discriminative cannot be naturally represented by a Euclidean vector space,
classification with sets of features, using kernels and SVMs such as graphs, strings, or trees.
for recognition, and multi-resolution image representations. Several researchers have designed similarity measures
Object recognition is a challenging problem that re- that operate on sets of unordered features. See Table 1
quires strong generalization ability from a classifier in or- for a concise comparison of the approaches. The authors
der to cope with the broad variety in illumination, view- of [25] propose a kernel that averages over the similarities
point, occlusions, clutter, intra-class appearances, and de- of the best matching feature found for each feature member
formations that images of the same object or object class within the other set. The use of the “max” operator in this
will exhibit. While researchers have shown promising re- kernel makes it non-Mercer (i.e., not positive-definite – see
sults applying SVMs to object recognition, they have gen- Section 3), and thus it lacks convergence guarantees when
erally used global image features – ordered features of equal used in an SVM. A similar kernel is given in [14], which
length measured from the image as a whole, such as color or also considers all possible feature matchings but raises the
grayscale histograms or vectors of raw pixel data [5, 18, 17]. similarity between each pair of features to a given power.
Such global representations are known to be sensitive to Both [25] and [14] have a computational complexity that
real-world imaging conditions, such as occlusions, pose is quadratic in the number of features. Furthermore, both
match each feature in a set independently, ignoring poten- L1 distance to approximate a least-cost matching of equal-
tially useful co-occurrence information. In contrast, our mass global color histograms for nearest neighbor image re-
kernel captures the joint statistics of co-occurring features trievals. This work inspired our use of a similar representa-
by matching them concurrently as a set. tion for point sets. However, unlike [10], our method builds
The method given in [3] is based on finding a sub- a discriminative classifier, and it compares histograms with
optimal matching between two sets using a greedy heuris- a weighted intersection rather than L1 . Our method allows
tic; although this results in a non-Mercer kernel, the au- inputs to have unequal cardinalities and thus enables par-
thors provide a means of tuning the kernel hyperparameter tial matchings, which is important in practice for handling
so as to limit the probability that a given kernel matrix is clutter and unsegmented images.
not positive-definite. The authors of [27] measure similar- We believe ours is the first work to advocate for the use
ity in terms of the principal angle between the two linear of a histogram pyramid as an explicit discriminative fea-
subspaces spanned by two sets’ vector elements. This ker- ture formed over sets, and the first to show its connection
nel is only positive-definite for sets of equal cardinality, and to optimal partial matching when used with a hierarchical
its complexity is cubic in the number of features. In [20], weighted histogram intersection similarity measure.
an algebraic kernel is used to combine similarities given by
vector-based kernels, with the weighting chosen to reflect 3. Approach
whether the features are in alignment (ordered). When set
Kernel-based learning algorithms [21, 24] are founded on
cardinalities vary, inputs are padded with zeros so as to form
the idea of embedding data into a Euclidean space, and then
equally-sized matrices.
seeking linear relations among the embedded data. For ex-
In [12], a Gaussian is fit to each set of vectors, and
ample, an SVM finds the optimal separating hyperplane be-
then the kernel value between two sets is the Bhattacharyya
tween two classes in an embedded space (also referred to
affinity between their Gaussian distributions. As noted by
as the feature space). A kernel function K : X × X → 
the authors, the method is constrained to using a Gaussian
serves to map pairs of data points in an input space X to
model in order to have a closed form solution. In practice,
their inner product in the embedding space F , thereby eval-
the method in [12] is also limited to sets with small car-
uating the similarities between all points and determining
dinality, because its complexity is cubic in the number of
their relative positions. Linear relations are sought in the
features. Similarly, the authors of [16] fit a Gaussian to a
embedded space, but a decision boundary may still be non-
feature set, and then compare sets using KL-divergence as
linear in the input space, depending on the choice of a fea-
a distance measure. Unlike the kernels of [12] and [16],
ture mapping function Φ : X → F .
which are based on parametric models that assume inputs
The main contribution of this work is a new kernel func-
will fit a certain form, our method is model-free and main-
tion based on implicit correspondences that enables dis-
tains the distinct data points in the representation.
criminative classification for unordered, variable-length sets
An alternative approach when dealing with unordered set of vectors. The kernel is provably positive-definite. The
data is to designate prototypical examples from each class, main advantages of our algorithm are its efficiency, its use
and then represent examples in terms of their distances to of implicit correspondences that respect the joint statistics
each prototype; standard algorithms that handle vectors in of co-occurring features, and its resistance to clutter or “su-
a Euclidean space are then applicable. The authors of [28] perfluous” data points.
build such a classifier for handwritten digits, and use the The basic idea of our method is to map sets of features
shape context distance of [1] as the measure of similar- to multi-resolution histograms, and then compare the his-
ity. The issues faced by such a prototype-based method tograms with a weighted histogram intersection measure in
are determining which examples should serve as prototypes, order to approximate the similarity of the best partial match-
choosing how many there should be, and updating the pro- ing between the feature sets. We call the proposed kernel a
totypes properly when new types of data are encountered. “pyramid match kernel” because input sets are converted to
Our feature representation is based on a multi-resolution multi-resolution histograms.
histogram, or “pyramid”, which is computed by binning
data points into discrete regions of increasingly larger size. 3.1. The Pyramid Match Kernel
Single-level histograms have been used in various visual
We consider an input space X of sets of d-dimensional fea-
recognition systems, one of the first being that of [23],
ture vectors that are bounded by a sphere √ of diameter D and
where the intersection of global color histograms was used
to compare images. Pyramids have been shown to be a whose minimum inter-vector distance is 2d :1
  
useful representation in a wide variety of image processing
X = x|x = [f11 , . . . , fd1 ], . . . , [f1mx , . . . , fdmx ] , (1)
tasks – see [9] for a summary.
In [10], multi-resolution histograms are compared with 1 This may be enforced by scaling the data appropriately.
where mx varies across instances in X.
The feature extraction function Ψ is defined as:
y z
H (y) H (z) min(H (y), H (z))
0 0
0 0

Ψ(x) = [H−1 (x), H0 (x), . . . , HL (x)], (2)


I0=2

where L = log2 D, x ∈ X, Hi (x) is a histogram vector y z


H (y) H (z) min(H1(y), H1(z))

formed over data x using d-dimensional bins of side length 1 1

 d
2i , and Hi (x) has a dimension ri = 2iD√d . In other I1=4

words, Ψ(x) is a vector of concatenated histograms, where y z


min(H2(y), H2(z))
each subsequent component histogram has bins that double H2(y) H2(z)

in size (in all d dimensions) compared to the previous one.


The bins in the finest-level histogram H−1 are small enough I2=5

that each d-dimensional data point from sets in X falls into (a) Point sets (b) Histogram pyramids (c) Intersections
its own bin, and then the bin size increases until all data
points from sets in X fall into a single bin at level L.
The pyramid match kernel K∆ measures similarity be- Figure 2: A pyramid match determines a partial correspondence
tween point sets based on implicit correspondences found by matching points once they fall into the same histogram bin. In
within this multi-resolution histogram space. The similarity this example, two 1-D feature sets are used to form two histogram
pyramids. Each row corresponds to a pyramid level. H−1 is not
between two input sets is defined as the weighted sum of
pictured here because no matches are formed at the finest level. In
the number of feature matchings found at each level of the
(a), the set y is on the left side, and the set z is on the right. (Points
pyramid formed by Ψ: are distributed along the vertical axis, and these same points are
repeated at each level.) Light dotted lines are bin boundaries, bold

L
dashed lines indicate a pair matched at this level, and bold solid
K∆ (Ψ(y), Ψ(z)) = wi Ni , (3)
lines indicate a match already formed at a finer resolution level. In
i=0
(b) multi-resolution histograms are shown, with bin counts along
where Ni signifies the number of newly matched pairs at the horizontal axis. In (c) the intersection pyramid between the
level i. A new match is defined as a pair of features that histograms in (b) are shown. K∆ uses this to measure how many
were not in correspondence at any finer resolution level. new matches occurred at each level. Ii refers to I(Hi (y), Hi (z)).
Here, Ii = 2, 4, 5 across levels, and therefore the number of new
The kernel implicitly finds correspondences between
matches found at each level are Ni = 2, 2, 1. The sum over Ni ,
point sets, if we consider two points matched once they fall weighted by wi = 1, 12 , 14 , gives the pyramid match similarity.
into the same histogram bin (starting at the finest resolution
level where each point is guaranteed to be in its own bin).
The matching is equivalent to a hierarchical process: vec- Histogram intersection effectively counts the number of
tors not found to correspond at a high resolution have the points in two sets which match at a given quantization level,
opportunity to be matched at lower resolutions. For exam- i.e., fall into the same bin. To calculate the number of newly
ple, in Figure 2, there are two points matched at the finest matched pairs Ni induced at level i, it is sufficient to com-
scale, two new matches at the medium scale, and one at the pute the difference between successive histogram levels’ in-
coarsest scale. K∆ ’s output value reflects the overall sim- tersections:
ilarity of the matching: each newly matched pair at level
i contributes a value wi that is proportional to how similar Ni = I (Hi (y), Hi (z)) − I (Hi−1 (y), Hi−1 (z)) , (5)
two points matching at that level must be, as determined by
the bin size. Note that the sum in Eqn. 3 starts with index where Hi refers to the ith component histogram generated
i = 0, because the definition of Ψ insures that no points by Ψ in Eqn. 2. Note that the kernel is not searching explic-
match at level i = −1. itly for similar points – it never computes distances between
To calculate Ni , the kernel makes use of a histogram the vectors in each set. Instead, it simply uses the change
intersection function I, which measures the “overlap” be- in intersection values at each histogram level to count the
tween two histograms’ bins: matches as they occur.

r   The number of new matches found at each level in
I (A, B) = min A(j) , B(j) , (4) the pyramid is weighted according to the size of that
j=1 histogram’s bins: matches made within larger bins are
weighted less than those found in smaller bins. Since the
where A and B are histograms with r bins, and A(j) de- largest diagonal of a d-dimensional √ hypercube bin with
notes the count of the j th bin of A. sides of length 2i has length 2i d, the maximal distance
between any two points in one bin doubles at each increas- strictly approximates the optimal bipartite matching [10].
ingly coarser histogram in the pyramid. Thus, the number With variable cardinalities no similar proof is available, but
of new matches induced at level i is weighted by 21i to re- we show empirically below that the intersection of multi-
flect the (worst-case) similarity of points matched at that resolution histograms approximates the best partial match-
level. Intuitively, this means that similarity between vectors ing both in simulation and in practice.
(features in y and z)) at a finer resolution – where features Since the pyramid match defines correspondences across
are most distinct – is rewarded more heavily than similarity entire sets simultaneously, it inherently accounts for de-
between vectors at a coarser level. pendencies between various features occurring in one set.
From Eqns. 3, 4, and 5, we define the (un-normalized) In contrast, previous approaches have used each feature in
pyramid match kernel function: a set to independently index into the second set; this ig-
nores possibly useful information that is inherent in the co-
1“ ” occurrence of a set of distinctive features, and it fails to
XL
K̃∆ (Ψ(y), Ψ(z)) = I (Hi (y), Hi (z))−I(Hi−1 (y), Hi−1 (z)) ,
i=0
2i distinguish between instances where an object has varying
(6) numbers of similar features since multiple features may be
where y, z ∈ X, and Hi (x) is the ith histogram in Ψ(x). matched to a single feature in the other set [25, 14].
We normalize this value by the product of each input’s self-
similarity to avoid favoring larger input sets, arriving at the
final kernel value K∆ (P, Q) = √1C K̃∆ (P, Q), where C = 3.3. Satisfying Mercer’s Condition
K̃∆ (P, P) K̃∆ (Q, Q). Only positive semi-definite kernels guarantee an optimal
In order to alleviate quantization effects that may arise solution to kernel-based algorithms based on convex opti-
due to the discrete histogram bins, we can combine the mization, including SVMs. According to Mercer’s theorem,
kernel values resulting from multiple (T ) pyramid matches a kernel K is positive semi-definite if and only if
formed under different multi-resolution histograms with
randomly shifted bins. Each dimension of each of the T K(xi , xj ) = Φ(xi ), Φ(xj ), ∀xi , xj ∈ X, (7)
pyramids is shifted by an amount chosen uniformly at ran-
dom between 0 and D. This yields T feature mappings where · denotes a scalar dot product. This insures that
Ψ1 , . . . , ΨT that are applied as in Eqn. 2 to map an input set the kernel corresponds to an inner product in some feature
y to T multi-resolution histograms: [Ψ1 (y), . . . , ΨT (y)]. space, where kernel methods can search for linear relations
[21].
T inputs y and z, the combined kernel value is then
For
j=1 K∆ (Ψj (y), Ψj (z)).
Histogram intersection on single resolution histograms
over multi-dimensional data is a positive-definite similarity
3.2. Partial Match Correspondences function [17]. Using this construct and the closure proper-
ties of valid kernel functions, we can show that the pyramid
Our kernel allows sets of unequal cardinalities, and there- match kernel is a Mercer kernel. The definition given in
fore it enables partial matchings, where the points of the Eqn. 6 is algebraically equivalent to
smaller set are mapped to some subset of the points in the
larger set. Dissimilarity is only judged on the most simi- min(|y|, |z|)
L−1
X 1
K∆ (Ψ(y), Ψ(z)) = + I (Hi (y), Hi (z)) , (8)
lar part of the empirical distributions, and superfluous data 2L
i=0
2i+1
points are ignored; the result is a robust similarity measure
that accommodates inputs expected to contain extraneous since I (H−1 (y), H−1 (z)) = 0, and I (HL (y), HL (z)) =
vector entries. This is a common situation when recogniz- min(|y|, |z|) by the construction of the pyramid. Given that
ing objects in images, due for instance to background vari- Mercer kernels are closed under both addition and scaling
ations, clutter, or changes in object pose that cause different by a positive constant [21], we only need to show that the
subsets of features to be visible. Thus, the proposed kernel minimum cardinality between two sets (min(|y|, |z|)) cor-
is equipped to handle unsegmented examples, as we will responds to a positive semi-definite kernel.
demonstrate in Section 4. The cardinality of an input set x can be encoded as a bi-
By construction, the pyramid match offers an approxi- nary vector containing |x| ones followed by Z − |x| zeros,
mation of the optimal correspondence-based matching be- where Z is the maximum cardinality of any set. The inner
tween two feature sets, in which the overall similarity be- product between two such expansions is equivalent to the
tween corresponding points is maximized. When input cardinality of the smaller set, thus satisfying Mercer’s con-
sets have equal cardinalities, histogram intersection can dition. Note that this binary expansion and the one in [17]
be reduced to an L1 distance: I(H(y), H(z)) = m − only serve to prove positive-definiteness and are never com-
1
2 ||H(y) − H(z)||L1 if m = |y| = |z| [23]. Intersec- puted explicitly. Therefore, K∆ is valid for use in existing
tion over the pyramid with weights set to wi = 21i then learning methods that require Mercer kernels.
3.4. Efficiency 18000
Approximation of the optimal bijective matching
L1
2
x 10Approximation
4

L1
of the optimal partial matching

Pyramid match
The time required to compute Ψ for an input set with
1.8
16000 Pyramid match
Optimal Optimal
1.6

m d-dimensional features is O(dz log D), where z =


14000
1.4
12000

max(m, k) and k is the maximum feature value in a sin- 1.2

Distance
Distance
10000
1
gle dimension. (Typically m > k.) The bin coordinates 8000
0.8

corresponding to non-zero histogram entries for each of the 6000


0.6

log D quantization levels are computed directly during a 4000 0.4

2000 0.2
scan of the m input vectors; these entries are sorted by the 0 0
0 5000 10000 0 5000 10000
bin indices and the bin counts for all entries with the same
index are summed to form one entry. This sorting requires Figure 3: The pyramid match approximates the optimal corre-
only O(dm + kd) time using the radix-sort algorithm, a spondences, even for sets of unequal cardinalities (right). See text
for details. (This figure is best viewed in color.)
linear time sorting algorithm that is applicable to the inte-
ger bin indices [6]. The histogram pyramid that results is
high-dimensional, but very sparse, with only O(m log D) We generated two data sets, each with 100 point sets
non-zero entries that need to be stored. containing 2-D points with values uniformly distributed be-
The complexity of K∆ is O(dm log D), since computing tween one and 1000. In one data set, each point set had
the intersection values for histograms that have been sorted equal cardinalities (100 points each), while in the other car-
by bin index requires time linear in the number of non-zero dinalities varied randomly from 5 to 100. Figure 3 shows
entries (not the number of actual bins). Generating mul- the results of 10,000 pairwise set-to-set comparisons com-
tiple pyramid matches with randomly shifted grids simply puted according to the correspondences produced by the op-
scales the complexity by T , the constant number of shifts. timal matching, the pyramid match with T = 1, and the
All together, the complexity of computing both the pyra- L1 embedding of [10], respectively, for each of these sets.
mids and kernel values is O(T dm log D). In contrast, cur- Note that in these figures we plot distance (inverse similar-
rent approaches have polynomial dependence on m, which ity), and the values were sorted according to the optimal
limits the practicality of large input sizes. See Table 1 for measure’s magnitudes for visualization purposes.
complexity comparisons. This figure shows that our method does indeed find
matchings that are consistently on par with the optimal so-
4. Results lution. In the equal cardinality case (plot on left), both the
In this section we show that in simulation the pyramid pyramid match and the L1 embedding produce good ap-
match kernel approximates the best partial matching of fea- proximations; both are on average less than 9% away from
ture sets, and then we report on object recognition experi- the optimal measure.
ments with baseline comparisons to other methods. However, more importantly, the pyramid match can also
approximate the partial matching for the unequal cardinal-
4.1. Approximate Partial Matchings ity case (plot on right): its matchings continue to follow
As described in Section 3, the pyramid match approximates the optimal matching’s trend since it does not penalize out-
the optimal correspondence-based matching between two liers, whereas the L1 embedding fails because it requires all
feature sets. While for the case of equal cardinalities it re- points to match to something. Our method is again on aver-
duces to an L1 norm in a space that is known to strictly ap- age less than 9% away from the optimal matching’s measure
proximate the optimal bijective matching [10], empirically for the unequal cardinality case, while the L1 matching has
we find the pyramid kernel approximates the optimal partial an average error of 400%. Space constraints do not permit
matching of unequal cardinality sets. their inclusion, but additional experiments have shown that
We conducted an experiment to evaluate how close the this trend continues for larger dimensions.
correspondences implicitly assigned by the pyramid match 4.2. Object Recognition
are to the true optimal correspondences – the matching
that results in the maximal summed similarity between cor- For our object recognition experiments we use SVM clas-
responding points. We compared our kernel’s outputs to sifiers, which are trained by specifying the matrix of ker-
those produced by the optimal partial matching obtained via nel values between all pairs of training examples. The ker-
a linear programming solution to the transportation prob- nel’s similarity values determine the examples’ relative po-
lem [19].2 sitions in an embedded space, and quadratic programming
is used to find the optimal separating hyperplane between
2 This optimal solution requires time exponential in the number of fea-
the two classes in this space. We use the implementation
tures in the worst case, although it often exhibits polynomial-time behavior
in practice. In contrast, the pyramid kernel’s complexity is only linear in given by [4]. When kernel matrices have dominant diag-
the number of features. onals we use the transformation suggested in [26]: a sub-
polynomial kernel is applied to the original kernel values, Object recognition on ETH−80 images
85

followed by an empirical kernel mapping that embeds the


distance measure into a feature space. 80

Recognition accuracy (%)


Local affine- or scale- invariant feature descriptors ex- 75
tracted from a sparse set of interest points in an image
have been shown to be an effective, compact representa- 70

tion (e.g. [13, 15]). This is a good context in which to test


65
our kernel function, since such local features have no inher-
ent ordering, and it is expected that the number of features 60
will vary across examples. In the following we experiment Match kernel
Pyramid match kernel
with two publicly available databases and demonstrate that 55
0 20 40 60 80 100 120
our method achieves comparable object recognition perfor- Time to generate 400 x 400 kernel matrix (sec)

mance at a significantly lower computational cost than other


state-of-the-art approaches. All run-times reported below Figure 4: Allowing the same run-time, the pyramid match ker-
nel (with T = 1) produces better recognition rates than an ap-
include the time needed to compute both the pyramids and
proach that computes pairwise distances between features in order
the weighted intersections.
to match them. See text for details.
A performance evaluation given in [7] compares the
methods of [12, 27, 25] in the context of an object catego-
rization task using images from the publicly available ETH- bers of features per image with which to judge similarity.
80 database.3 The experiment uses eight object classes, Figure 4 depicts the run-time versus recognition accuracy
with 10 unique objects and five widely separated views of of our method as compared to the kernel of [25] (called
each, for a total of 400 images. A Harris detector is used the “match” kernel), which has O(dm2 ) complexity. Each
to find interest points in each image, and various local de- point in the figure represents one experiment; the saliency
scriptors (SIFT [13], JET, patches) are used to compose the threshold of the Harris interest operator was adjusted to gen-
feature sets. A one-versus-all SVM classifier is trained for erate varying numbers of features, thus trading off accuracy
each kernel type, and performance is measured via cross- versus run-time. Computing a kernel matrix for the same
validation, where all five views of an object are held out at data is significantly faster with the pyramid match kernel,
once. Note that no instances of a test object are ever present and for similar run-times our method produces much better
in the training set, so this is a categorization task (as op- recognition results.
posed to recognition of the same object). We also tested our method with a challenging database of
The experiments show the polynomial-time methods 101 objects recently developed at Caltech.4 This database
of [25] and [12] performing best, with a classification rate was obtained using Google Image Search, and the images
of 74% using on average 40 SIFT features per image [7]. contain significant clutter, occlusions, and intra-class ap-
Using 120 interest points, the Bhattacharyya kernel [12] pearance variation. We used the pyramid match kernel with
achieves 85% accuracy. However, the study also concluded a one-versus-all SVM classifier on the latest version of the
that the cubic complexity of the method given in [12] made database (which does not contain duplicated images). We
it impractical to use the desired number of features. used the SIFT detector of [13] and 10-dimensional PCA-
We evaluated our method on this same subset of the SIFT descriptors [11] to form the input feature sets, which
ETH-80 database under the same conditions provided in [7], ranged in size from 14 to 4,118 features, with an average
and it achieved a recognition rate of 83% using PCA- of 454 features per image. We set T = 2. We trained our
SIFT [11] features from all Harris-detected interest points algorithm with 30 unsegmented images per object class; all
(averages 153 points per image) and T = 8. Restricting detected interest point features were included in the input
ourselves to an average of 40 interest points yields a recog- sets. This is an advantage of our approach: since it seeks
nition rate of 73%. Thus our method performs comparably the best correspondence with some subset of the images’
to the others at their best for this data set, but is much more features, it handles unsegmented, cluttered data well.
efficient than those tested above, requiring time only linear
Eight runs using randomly selected training sets yielded
in the number of features.
a recognition rate of 43% on the remainder of the database
In fact, the ability of a kernel to handle large numbers of
examples. Note that chance performance would be 1%.
features can be critical to its success. An interest operator
For this data, performing a single image matching with our
may be tuned to select only the most “salient” features, but
method (computing four pyramids and two kernel values)
in our experiments we found that the various approaches’
on average required only 0.05 seconds.
recognition rates always benefitted from having larger num-
3 http://www.vision.ethz.ch/projects/categorization/ 4 http://www.vision.caltech.edu/Image Datasets/Caltech101/
5. Conclusions [11] Y. Ke and R. Sukthankar. PCA-SIFT: A More Distinctive
Representation for Local Image Descriptors. In Proc. IEEE
We have developed a new fast kernel function that is suit- Conf. on Computer Vision and Pattern Recognition, Wash-
able for discriminative classification with unordered sets of ington, D.C., June 2004.
local features. Our pyramid match kernel approximates the [12] R. Kondor and T. Jebara. A Kernel Between Sets of Vec-
optimal partial matching by computing a weighted intersec- tors. In Proccedings of International Conference on Machine
tion over multi-resolution histograms, and requires time lin- Learning, Washington, D.C., Aug 2003.
ear in the number of features. The kernel is robust to clut- [13] D. Lowe. Distinctive Image Features from Scale-Invariant
ter since it does not penalize the presence of extra features, Keypoints. International Journal of Computer Vision,
respects the co-occurrence statistics inherent in the input 60(2):91–110, Jan 2004.
sets, and is provably positive-definite. We have applied our [14] S. Lyu. Mercer Kernels for Object Recognition with Local
kernel to SVM-based object recognition tasks, and demon- Features. In Proc. IEEE Conf. on Computer Vision and Pat-
strated recognition performance with accuracy comparable tern Recognition, San Diego, CA, June 2005.
to current methods, but at a much lower computational cost. [15] K. Mikolajczyk and C. Schmid. Indexing Based on Scale
Invariant Interest Points. In Proc. IEEE International Conf.
on Computer Vision, Vancouver, Canada, July 2001.
Acknowledgments
[16] P. Moreno, P. Ho, and N. Vasconcelos. A Kullback-Leibler
We would like to thank John Lee for his help running exper- Divergence Based Kernel for SVM Classification in Multi-
iments for this paper, and members of the MIT Vision Inter- media Applications. In NIPS, Vancouver, Dec 2003.
face group and Mark Stephenson for reading earlier drafts. [17] F. Odone, A. Barla, and A. Verri. Building Kernels from
Binary Strings for Image Matching. IEEE Trans. on Image
Processing, 14(2):169–180, Feb 2005.
References
[18] D. Roobaert and M. Van Hulle. View-Based 3D Object
[1] S. Belongie, J. Malik, and J. Puzicha. Shape Matching and Recognition with Support Vector Machines. In IEEE Intl
Object Recognition Using Shape Contexts. IEEE Trans. on Workshop on Neural Networks for Signal Processing, Madi-
Pattern Analysis and Machine Intelligence, 24(24):509–522, son, WI, Aug 1999.
April 2002. [19] Y. Rubner, C. Tomasi, and L. Guibas. The Earth Mover’s
[2] A. Berg, T. Berg, and J. Malik. Shape Matching and Ob- Distance as a Metric for Image Retrieval. International Jour-
ject Recognition using Low Distortion Correspondences. In nal of Computer Vision, 40(2):99–121, 2000.
Proc. IEEE Conf. on Computer Vision and Pattern Recogni- [20] A. Shashua and T. Hazan. Algebraic Set Kernels with Ap-
tion, San Diego, CA, June 2005. plication to Inference Over Local Image Representations. In
NIPS, Vancouver, Canada, Dec 2005.
[3] S. Boughhorbel, J-P. Tarel, and F. Fleuret. Non-Mercer Ker-
nels for SVM Object Recognition. In British Machine Vision [21] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pat-
Conference, London, UK, Sept 2004. tern Analysis. Cambridge University Press, 2004.
[22] J. Sivic and A. Zisserman. Video Google: A Text Retrieval
[4] C. Chang and C. Lin. LIBSVM: a library for SVMs, 2001.
Approach to Object Matching in Videos. In Proc. IEEE In-
[5] O. Chapelle, P. Haffner, and V. Vapnik. SVMs for ternational Conf. on Computer Vision, Nice, Oct 2003.
Histogram-Based Image Classification. Transactions on
[23] M. Swain and D. Ballard. Color Indexing. International
Neural Networks, 10(5), Sept 1999.
Journal of Computer Vision, 7(1):11–32, 1991.
[6] T. Cormen, C. Leiserson, and R. Rivest. Intro. to Algorithms. [24] V. Vapnik. Statistical Learning Theory. John Wiley and
MIT Press, 1990. Sons, New York, 1998.
[7] J. Eichhorn and O. Chapelle. Object Categorization with [25] C. Wallraven, B. Caputo, and A. Graf. Recognition with Lo-
SVM: Kernels for Local Features. Technical report, MPI cal Features: the Kernel Recipe. In Proc. IEEE International
for Biological Cybernetics, 2004. Conf. on Computer Vision, Nice, France, Oct 2003.
[8] K. Grauman and T. Darrell. Fast Contour Matching Using [26] J. Weston, B. Scholkopf, E. Eskin, C. Leslie, and W. Noble.
Approximate Earth Mover’s Distance. In Proc. IEEE Conf. Dealing with Large Diagonals in Kernel Matrices. In Princi-
on Computer Vision and Pattern Recognition, Washington ples of Data Mining and Knowledge Discovery, volume 243
D.C., June 2004. of SLNCS, 2002.
[9] E. Hadjidemetriou, M. Grossberg, and S. Nayar. Mul- [27] L. Wolf and A. Shashua. Learning Over Sets Using Kernel
tiresolution Histograms and their Use for Recognition. Principal Angles. Journal of Machine Learning Research,
IEEE Trans. on Pattern Analysis and Machine Intelligence, 4:913–931, Dec 2003.
26(7):831–847, July 2004. [28] H. Zhang and J. Malik. Learning a Discriminative Classi-
[10] P. Indyk and N. Thaper. Fast Image Retrieval via Embed- fier Using Shape Context Distances. In Proc. IEEE Conf.
dings. In 3rd Intl Wkshp on Statistical and Computational on Computer Vision and Pattern Recognition, Madison, WI,
Theories of Vision, Nice, France, Oct 2003. June 2003.

You might also like