Grauman Darrell Iccv05
Grauman Darrell Iccv05
Abstract
Discriminative learning is challenging when examples are
sets of features, and the sets vary in cardinality and lack
any sort of meaningful ordering. Kernel-based classifica-
tion methods can learn complex decision boundaries, but
a kernel over unordered set inputs must somehow solve
for correspondences – generally a computationally expen-
sive task that becomes impractical for large set sizes. We
present a new fast kernel function which maps unordered
feature sets to multi-resolution histograms and computes a Figure 1: The pyramid match kernel intersects histogram pyra-
weighted histogram intersection in this space. This “pyra- mids formed over local features, approximating the optimal corre-
mid match” computation is linear in the number of features, spondences between the sets’ features.
and it implicitly finds correspondences based on the finest
resolution histogram cell where a matched pair first appears. criminative methods are known to represent complex deci-
Since the kernel does not penalize the presence of extra fea- sion boundaries very efficiently and generalize well to un-
tures, it is robust to clutter. We show the kernel function seen data [24, 21]. For example, the Support Vector Ma-
is positive-definite, making it valid for use in learning al- chine (SVM) is a widely used approach to discriminative
gorithms whose optimal solutions are guaranteed only for classification that finds the optimal separating hyperplane
Mercer kernels. We demonstrate our algorithm on object between two classes. Kernel functions, which measure sim-
recognition tasks and show it to be accurate and dramati- ilarity between inputs, introduce non-linearities to the de-
cally faster than current approaches. cision functions; the kernel non-linearly maps two exam-
ples from the input space to the inner product in some fea-
ture space. However, conventional kernel-based algorithms
1. Introduction are designed to operate on fixed-length vector inputs, where
A variety of representations used in computer vision consist each vector entry corresponds to a particular global attribute
of unordered sets of features or parts, where each set varies for that instance; the commonly used general-purpose ker-
in cardinality, and the correspondence between the features nels defined on n inputs (e.g., Gaussian RBF, polynomial)
across each set is unknown. For instance, an image may be are not applicable in the space of vector sets.
described by a set of detected local affine-invariant regions, In this work we propose a pyramid match kernel – a
a shape may be described by a set of local descriptors de- new kernel function over unordered feature sets that allows
fined at each edge point, or a person’s face may be repre- them to be used effectively and efficiently in kernel-based
sented by a set of patches with different facial parts. In such learning methods. Each feature set is mapped to a multi-
cases, one set of feature vectors denotes a single instance of resolution histogram that preserves the individual features’
a particular class of interest (an object, scene, shape, face, distinctness at the finest level. The histogram pyramids
etc.), and it is expected that the number of features will vary are then compared using a weighted histogram intersection
across examples due to viewpoint changes, occlusions, or computation, which we show defines an implicit correspon-
inconsistent detections by the interest operator. dence based on the finest resolution histogram cell where a
To perform learning tasks like categorization or recogni- matched pair first appears (see Figure 1).
tion with such representations is challenging. While gen- The similarity measured by the pyramid match approx-
erative methods have had some success, kernel-based dis- imates the similarity measured by the optimal correspon-
dences between feature sets of unequal cardinality (i.e., the Method Complexity C P M U
partial matching that optimally maps points in the lower Match [25] O(dm2 ) x x
cardinality set to some subset of the points in the larger set, Exponent [14] O(dm2 ) x x x
such that the summed similarities between matched points Greedy [3] O(dm2 ) x x x
is maximal). Our kernel is extremely efficient and can be Princ. ang. [27] O(dm3 ) x x
computed in time that is linear in the sets’ cardinality. We Bhattach.’s [12] O(dm3 ) x x x
show that our kernel function is positive-definite, meaning KL-div. [16] O(dm2 ) x x
that it is appropriate to use with learning methods that guar- Pyramid O(dm log D) x x x x
antee convergence to a unique optimum only for positive-
definite kernels (e.g., SVMs). Table 1: Comparing kernel approaches to matching unordered
sets. Columns show each method’s computational cost and
Because it does not penalize the presence of superflu-
whether its kernel captures co-occurrences (C), is positive-definite
ous data points, the proposed kernel is robust to clutter. As (P), does not assume a parametric model (M), and can handle sets
we will show, this translates into the ability to handle un- of unequal cardinality (U). d is vector dimension, m is maximum
segmented images with varying backgrounds or occlusions. set cardinality, and D is diameter of vector space. “Pyramid”
The kernel also respects the co-occurrence relations inher- refers to the proposed kernel.
ent in the input sets: rather than matching features in a set
individually, ignoring potential dependencies conveyed by changes, or image noise.
features within one set, our similarity measure captures the Recent work has shown that local features invariant to
features’ joint statistics. common image transformations (e.g., SIFT [13]) are a pow-
Other approaches to this problem have recently been pro- erful representation for recognition, because the features
posed [25, 14, 3, 12, 27, 16, 20], but unfortunately each of can be reliably detected and matched across instances of
these techniques suffers from some number of the follow- the same object or scene under different viewpoints, poses,
ing drawbacks: computational complexities that make large or lighting conditions. Most approaches, however, perform
feature set sizes infeasible; limitations to parametric distri- recognition with local feature representations using nearest-
butions which may not adequately describe the data; ker- neighbor (e.g., [1, 8, 22, 2]) or voting-based classifiers fol-
nels that are not positive-definite (do not guarantee unique lowed by an alignment step (e.g., [13, 15]); both may be
solutions for an SVM); limitations to sets of equal size; and impractical for large training sets, since their classification
failure to account for dependencies within feature sets. times increase with the number of training examples. An
Our method addresses all of these issues, resulting in SVM, on the other hand, identifies a sparse subset of the
a kernel appropriate for comparing unordered, variable- training examples (the support vectors) to delineate a deci-
length feature sets within any existing kernel-based learn- sion boundary.
ing paradigm. We demonstrate our algorithm with object Kernel-based learning algorithms, which include SVMs,
recognition tasks and show that its accuracy is compara- kernel PCA, or Gaussian Processes, have become well-
ble to current approaches, while requiring significantly less established tools that are useful in a variety of contexts,
computation time. including discriminative classification, regression, density
estimation, and clustering [21]. More recently, attention
2. Related Work has been focused on developing specialized kernels that can
more fully leverage these tools for situations where the data
In this section, we review related work on discriminative cannot be naturally represented by a Euclidean vector space,
classification with sets of features, using kernels and SVMs such as graphs, strings, or trees.
for recognition, and multi-resolution image representations. Several researchers have designed similarity measures
Object recognition is a challenging problem that re- that operate on sets of unordered features. See Table 1
quires strong generalization ability from a classifier in or- for a concise comparison of the approaches. The authors
der to cope with the broad variety in illumination, view- of [25] propose a kernel that averages over the similarities
point, occlusions, clutter, intra-class appearances, and de- of the best matching feature found for each feature member
formations that images of the same object or object class within the other set. The use of the “max” operator in this
will exhibit. While researchers have shown promising re- kernel makes it non-Mercer (i.e., not positive-definite – see
sults applying SVMs to object recognition, they have gen- Section 3), and thus it lacks convergence guarantees when
erally used global image features – ordered features of equal used in an SVM. A similar kernel is given in [14], which
length measured from the image as a whole, such as color or also considers all possible feature matchings but raises the
grayscale histograms or vectors of raw pixel data [5, 18, 17]. similarity between each pair of features to a given power.
Such global representations are known to be sensitive to Both [25] and [14] have a computational complexity that
real-world imaging conditions, such as occlusions, pose is quadratic in the number of features. Furthermore, both
match each feature in a set independently, ignoring poten- L1 distance to approximate a least-cost matching of equal-
tially useful co-occurrence information. In contrast, our mass global color histograms for nearest neighbor image re-
kernel captures the joint statistics of co-occurring features trievals. This work inspired our use of a similar representa-
by matching them concurrently as a set. tion for point sets. However, unlike [10], our method builds
The method given in [3] is based on finding a sub- a discriminative classifier, and it compares histograms with
optimal matching between two sets using a greedy heuris- a weighted intersection rather than L1 . Our method allows
tic; although this results in a non-Mercer kernel, the au- inputs to have unequal cardinalities and thus enables par-
thors provide a means of tuning the kernel hyperparameter tial matchings, which is important in practice for handling
so as to limit the probability that a given kernel matrix is clutter and unsegmented images.
not positive-definite. The authors of [27] measure similar- We believe ours is the first work to advocate for the use
ity in terms of the principal angle between the two linear of a histogram pyramid as an explicit discriminative fea-
subspaces spanned by two sets’ vector elements. This ker- ture formed over sets, and the first to show its connection
nel is only positive-definite for sets of equal cardinality, and to optimal partial matching when used with a hierarchical
its complexity is cubic in the number of features. In [20], weighted histogram intersection similarity measure.
an algebraic kernel is used to combine similarities given by
vector-based kernels, with the weighting chosen to reflect 3. Approach
whether the features are in alignment (ordered). When set
Kernel-based learning algorithms [21, 24] are founded on
cardinalities vary, inputs are padded with zeros so as to form
the idea of embedding data into a Euclidean space, and then
equally-sized matrices.
seeking linear relations among the embedded data. For ex-
In [12], a Gaussian is fit to each set of vectors, and
ample, an SVM finds the optimal separating hyperplane be-
then the kernel value between two sets is the Bhattacharyya
tween two classes in an embedded space (also referred to
affinity between their Gaussian distributions. As noted by
as the feature space). A kernel function K : X × X →
the authors, the method is constrained to using a Gaussian
serves to map pairs of data points in an input space X to
model in order to have a closed form solution. In practice,
their inner product in the embedding space F , thereby eval-
the method in [12] is also limited to sets with small car-
uating the similarities between all points and determining
dinality, because its complexity is cubic in the number of
their relative positions. Linear relations are sought in the
features. Similarly, the authors of [16] fit a Gaussian to a
embedded space, but a decision boundary may still be non-
feature set, and then compare sets using KL-divergence as
linear in the input space, depending on the choice of a fea-
a distance measure. Unlike the kernels of [12] and [16],
ture mapping function Φ : X → F .
which are based on parametric models that assume inputs
The main contribution of this work is a new kernel func-
will fit a certain form, our method is model-free and main-
tion based on implicit correspondences that enables dis-
tains the distinct data points in the representation.
criminative classification for unordered, variable-length sets
An alternative approach when dealing with unordered set of vectors. The kernel is provably positive-definite. The
data is to designate prototypical examples from each class, main advantages of our algorithm are its efficiency, its use
and then represent examples in terms of their distances to of implicit correspondences that respect the joint statistics
each prototype; standard algorithms that handle vectors in of co-occurring features, and its resistance to clutter or “su-
a Euclidean space are then applicable. The authors of [28] perfluous” data points.
build such a classifier for handwritten digits, and use the The basic idea of our method is to map sets of features
shape context distance of [1] as the measure of similar- to multi-resolution histograms, and then compare the his-
ity. The issues faced by such a prototype-based method tograms with a weighted histogram intersection measure in
are determining which examples should serve as prototypes, order to approximate the similarity of the best partial match-
choosing how many there should be, and updating the pro- ing between the feature sets. We call the proposed kernel a
totypes properly when new types of data are encountered. “pyramid match kernel” because input sets are converted to
Our feature representation is based on a multi-resolution multi-resolution histograms.
histogram, or “pyramid”, which is computed by binning
data points into discrete regions of increasingly larger size. 3.1. The Pyramid Match Kernel
Single-level histograms have been used in various visual
We consider an input space X of sets of d-dimensional fea-
recognition systems, one of the first being that of [23],
ture vectors that are bounded by a sphere √ of diameter D and
where the intersection of global color histograms was used
to compare images. Pyramids have been shown to be a whose minimum inter-vector distance is 2d :1
useful representation in a wide variety of image processing
X = x|x = [f11 , . . . , fd1 ], . . . , [f1mx , . . . , fdmx ] , (1)
tasks – see [9] for a summary.
In [10], multi-resolution histograms are compared with 1 This may be enforced by scaling the data appropriately.
where mx varies across instances in X.
The feature extraction function Ψ is defined as:
y z
H (y) H (z) min(H (y), H (z))
0 0
0 0
d
2i , and Hi (x) has a dimension ri = 2iD√d . In other I1=4
that each d-dimensional data point from sets in X falls into (a) Point sets (b) Histogram pyramids (c) Intersections
its own bin, and then the bin size increases until all data
points from sets in X fall into a single bin at level L.
The pyramid match kernel K∆ measures similarity be- Figure 2: A pyramid match determines a partial correspondence
tween point sets based on implicit correspondences found by matching points once they fall into the same histogram bin. In
within this multi-resolution histogram space. The similarity this example, two 1-D feature sets are used to form two histogram
pyramids. Each row corresponds to a pyramid level. H−1 is not
between two input sets is defined as the weighted sum of
pictured here because no matches are formed at the finest level. In
the number of feature matchings found at each level of the
(a), the set y is on the left side, and the set z is on the right. (Points
pyramid formed by Ψ: are distributed along the vertical axis, and these same points are
repeated at each level.) Light dotted lines are bin boundaries, bold
L
dashed lines indicate a pair matched at this level, and bold solid
K∆ (Ψ(y), Ψ(z)) = wi Ni , (3)
lines indicate a match already formed at a finer resolution level. In
i=0
(b) multi-resolution histograms are shown, with bin counts along
where Ni signifies the number of newly matched pairs at the horizontal axis. In (c) the intersection pyramid between the
level i. A new match is defined as a pair of features that histograms in (b) are shown. K∆ uses this to measure how many
were not in correspondence at any finer resolution level. new matches occurred at each level. Ii refers to I(Hi (y), Hi (z)).
Here, Ii = 2, 4, 5 across levels, and therefore the number of new
The kernel implicitly finds correspondences between
matches found at each level are Ni = 2, 2, 1. The sum over Ni ,
point sets, if we consider two points matched once they fall weighted by wi = 1, 12 , 14 , gives the pyramid match similarity.
into the same histogram bin (starting at the finest resolution
level where each point is guaranteed to be in its own bin).
The matching is equivalent to a hierarchical process: vec- Histogram intersection effectively counts the number of
tors not found to correspond at a high resolution have the points in two sets which match at a given quantization level,
opportunity to be matched at lower resolutions. For exam- i.e., fall into the same bin. To calculate the number of newly
ple, in Figure 2, there are two points matched at the finest matched pairs Ni induced at level i, it is sufficient to com-
scale, two new matches at the medium scale, and one at the pute the difference between successive histogram levels’ in-
coarsest scale. K∆ ’s output value reflects the overall sim- tersections:
ilarity of the matching: each newly matched pair at level
i contributes a value wi that is proportional to how similar Ni = I (Hi (y), Hi (z)) − I (Hi−1 (y), Hi−1 (z)) , (5)
two points matching at that level must be, as determined by
the bin size. Note that the sum in Eqn. 3 starts with index where Hi refers to the ith component histogram generated
i = 0, because the definition of Ψ insures that no points by Ψ in Eqn. 2. Note that the kernel is not searching explic-
match at level i = −1. itly for similar points – it never computes distances between
To calculate Ni , the kernel makes use of a histogram the vectors in each set. Instead, it simply uses the change
intersection function I, which measures the “overlap” be- in intersection values at each histogram level to count the
tween two histograms’ bins: matches as they occur.
r The number of new matches found at each level in
I (A, B) = min A(j) , B(j) , (4) the pyramid is weighted according to the size of that
j=1 histogram’s bins: matches made within larger bins are
weighted less than those found in smaller bins. Since the
where A and B are histograms with r bins, and A(j) de- largest diagonal of a d-dimensional √ hypercube bin with
notes the count of the j th bin of A. sides of length 2i has length 2i d, the maximal distance
between any two points in one bin doubles at each increas- strictly approximates the optimal bipartite matching [10].
ingly coarser histogram in the pyramid. Thus, the number With variable cardinalities no similar proof is available, but
of new matches induced at level i is weighted by 21i to re- we show empirically below that the intersection of multi-
flect the (worst-case) similarity of points matched at that resolution histograms approximates the best partial match-
level. Intuitively, this means that similarity between vectors ing both in simulation and in practice.
(features in y and z)) at a finer resolution – where features Since the pyramid match defines correspondences across
are most distinct – is rewarded more heavily than similarity entire sets simultaneously, it inherently accounts for de-
between vectors at a coarser level. pendencies between various features occurring in one set.
From Eqns. 3, 4, and 5, we define the (un-normalized) In contrast, previous approaches have used each feature in
pyramid match kernel function: a set to independently index into the second set; this ig-
nores possibly useful information that is inherent in the co-
1“ ” occurrence of a set of distinctive features, and it fails to
XL
K̃∆ (Ψ(y), Ψ(z)) = I (Hi (y), Hi (z))−I(Hi−1 (y), Hi−1 (z)) ,
i=0
2i distinguish between instances where an object has varying
(6) numbers of similar features since multiple features may be
where y, z ∈ X, and Hi (x) is the ith histogram in Ψ(x). matched to a single feature in the other set [25, 14].
We normalize this value by the product of each input’s self-
similarity to avoid favoring larger input sets, arriving at the
final kernel value K∆ (P, Q) = √1C K̃∆ (P, Q), where C = 3.3. Satisfying Mercer’s Condition
K̃∆ (P, P) K̃∆ (Q, Q). Only positive semi-definite kernels guarantee an optimal
In order to alleviate quantization effects that may arise solution to kernel-based algorithms based on convex opti-
due to the discrete histogram bins, we can combine the mization, including SVMs. According to Mercer’s theorem,
kernel values resulting from multiple (T ) pyramid matches a kernel K is positive semi-definite if and only if
formed under different multi-resolution histograms with
randomly shifted bins. Each dimension of each of the T K(xi , xj ) = Φ(xi ), Φ(xj ), ∀xi , xj ∈ X, (7)
pyramids is shifted by an amount chosen uniformly at ran-
dom between 0 and D. This yields T feature mappings where · denotes a scalar dot product. This insures that
Ψ1 , . . . , ΨT that are applied as in Eqn. 2 to map an input set the kernel corresponds to an inner product in some feature
y to T multi-resolution histograms: [Ψ1 (y), . . . , ΨT (y)]. space, where kernel methods can search for linear relations
[21].
T inputs y and z, the combined kernel value is then
For
j=1 K∆ (Ψj (y), Ψj (z)).
Histogram intersection on single resolution histograms
over multi-dimensional data is a positive-definite similarity
3.2. Partial Match Correspondences function [17]. Using this construct and the closure proper-
ties of valid kernel functions, we can show that the pyramid
Our kernel allows sets of unequal cardinalities, and there- match kernel is a Mercer kernel. The definition given in
fore it enables partial matchings, where the points of the Eqn. 6 is algebraically equivalent to
smaller set are mapped to some subset of the points in the
larger set. Dissimilarity is only judged on the most simi- min(|y|, |z|)
L−1
X 1
K∆ (Ψ(y), Ψ(z)) = + I (Hi (y), Hi (z)) , (8)
lar part of the empirical distributions, and superfluous data 2L
i=0
2i+1
points are ignored; the result is a robust similarity measure
that accommodates inputs expected to contain extraneous since I (H−1 (y), H−1 (z)) = 0, and I (HL (y), HL (z)) =
vector entries. This is a common situation when recogniz- min(|y|, |z|) by the construction of the pyramid. Given that
ing objects in images, due for instance to background vari- Mercer kernels are closed under both addition and scaling
ations, clutter, or changes in object pose that cause different by a positive constant [21], we only need to show that the
subsets of features to be visible. Thus, the proposed kernel minimum cardinality between two sets (min(|y|, |z|)) cor-
is equipped to handle unsegmented examples, as we will responds to a positive semi-definite kernel.
demonstrate in Section 4. The cardinality of an input set x can be encoded as a bi-
By construction, the pyramid match offers an approxi- nary vector containing |x| ones followed by Z − |x| zeros,
mation of the optimal correspondence-based matching be- where Z is the maximum cardinality of any set. The inner
tween two feature sets, in which the overall similarity be- product between two such expansions is equivalent to the
tween corresponding points is maximized. When input cardinality of the smaller set, thus satisfying Mercer’s con-
sets have equal cardinalities, histogram intersection can dition. Note that this binary expansion and the one in [17]
be reduced to an L1 distance: I(H(y), H(z)) = m − only serve to prove positive-definiteness and are never com-
1
2 ||H(y) − H(z)||L1 if m = |y| = |z| [23]. Intersec- puted explicitly. Therefore, K∆ is valid for use in existing
tion over the pyramid with weights set to wi = 21i then learning methods that require Mercer kernels.
3.4. Efficiency 18000
Approximation of the optimal bijective matching
L1
2
x 10Approximation
4
L1
of the optimal partial matching
Pyramid match
The time required to compute Ψ for an input set with
1.8
16000 Pyramid match
Optimal Optimal
1.6
Distance
Distance
10000
1
gle dimension. (Typically m > k.) The bin coordinates 8000
0.8
2000 0.2
scan of the m input vectors; these entries are sorted by the 0 0
0 5000 10000 0 5000 10000
bin indices and the bin counts for all entries with the same
index are summed to form one entry. This sorting requires Figure 3: The pyramid match approximates the optimal corre-
only O(dm + kd) time using the radix-sort algorithm, a spondences, even for sets of unequal cardinalities (right). See text
for details. (This figure is best viewed in color.)
linear time sorting algorithm that is applicable to the inte-
ger bin indices [6]. The histogram pyramid that results is
high-dimensional, but very sparse, with only O(m log D) We generated two data sets, each with 100 point sets
non-zero entries that need to be stored. containing 2-D points with values uniformly distributed be-
The complexity of K∆ is O(dm log D), since computing tween one and 1000. In one data set, each point set had
the intersection values for histograms that have been sorted equal cardinalities (100 points each), while in the other car-
by bin index requires time linear in the number of non-zero dinalities varied randomly from 5 to 100. Figure 3 shows
entries (not the number of actual bins). Generating mul- the results of 10,000 pairwise set-to-set comparisons com-
tiple pyramid matches with randomly shifted grids simply puted according to the correspondences produced by the op-
scales the complexity by T , the constant number of shifts. timal matching, the pyramid match with T = 1, and the
All together, the complexity of computing both the pyra- L1 embedding of [10], respectively, for each of these sets.
mids and kernel values is O(T dm log D). In contrast, cur- Note that in these figures we plot distance (inverse similar-
rent approaches have polynomial dependence on m, which ity), and the values were sorted according to the optimal
limits the practicality of large input sizes. See Table 1 for measure’s magnitudes for visualization purposes.
complexity comparisons. This figure shows that our method does indeed find
matchings that are consistently on par with the optimal so-
4. Results lution. In the equal cardinality case (plot on left), both the
In this section we show that in simulation the pyramid pyramid match and the L1 embedding produce good ap-
match kernel approximates the best partial matching of fea- proximations; both are on average less than 9% away from
ture sets, and then we report on object recognition experi- the optimal measure.
ments with baseline comparisons to other methods. However, more importantly, the pyramid match can also
approximate the partial matching for the unequal cardinal-
4.1. Approximate Partial Matchings ity case (plot on right): its matchings continue to follow
As described in Section 3, the pyramid match approximates the optimal matching’s trend since it does not penalize out-
the optimal correspondence-based matching between two liers, whereas the L1 embedding fails because it requires all
feature sets. While for the case of equal cardinalities it re- points to match to something. Our method is again on aver-
duces to an L1 norm in a space that is known to strictly ap- age less than 9% away from the optimal matching’s measure
proximate the optimal bijective matching [10], empirically for the unequal cardinality case, while the L1 matching has
we find the pyramid kernel approximates the optimal partial an average error of 400%. Space constraints do not permit
matching of unequal cardinality sets. their inclusion, but additional experiments have shown that
We conducted an experiment to evaluate how close the this trend continues for larger dimensions.
correspondences implicitly assigned by the pyramid match 4.2. Object Recognition
are to the true optimal correspondences – the matching
that results in the maximal summed similarity between cor- For our object recognition experiments we use SVM clas-
responding points. We compared our kernel’s outputs to sifiers, which are trained by specifying the matrix of ker-
those produced by the optimal partial matching obtained via nel values between all pairs of training examples. The ker-
a linear programming solution to the transportation prob- nel’s similarity values determine the examples’ relative po-
lem [19].2 sitions in an embedded space, and quadratic programming
is used to find the optimal separating hyperplane between
2 This optimal solution requires time exponential in the number of fea-
the two classes in this space. We use the implementation
tures in the worst case, although it often exhibits polynomial-time behavior
in practice. In contrast, the pyramid kernel’s complexity is only linear in given by [4]. When kernel matrices have dominant diag-
the number of features. onals we use the transformation suggested in [26]: a sub-
polynomial kernel is applied to the original kernel values, Object recognition on ETH−80 images
85