some of our results to demonstrate
the impact with real image data.
Our approach enables rapid, scalable search for meaningful metrics that
were previously restricted to artificially
modestly sized inputs or databases.
Additionally, we show how minimal
annotations can be exploited to learn
how to compare images more reliably.
Both contributions support the ultimate goal of harnessing the potential
of very large repositories and providing
direct access to visual content.
comparing images with
Local feature matches
Earlier work in content-based image
retrieval focused on global representations that describe each image with
a single vector of attributes, such as a
color histogram, or an ordered list of
intensity values or filter responses.
While vector representations permit
the direct application of standard
distance functions and indexing
structures, they are known to be prohibitively sensitive to realistic image
conditions. For example, consider
stacking the images in Figure 2 one
on top of the other, and then checking the intensity at any given pixel for
each example—it is quite likely that
few of them would be in agreement,
even though each image contains a
koala as its most prominent object.
Much recent work shows that
decomposing an image into its component parts (or so-called “local features”)
grants resilience to image transformations and variations in object appearance. 23, 30 Typically, one either takes a
dense sample of regions at multiple
scales, or else uses an interest operator to identify the most salient regions
in an image. Possible salient points
include pixels marking high contrast
(edges), or points selected for a region’s
repeatability at multiple scales (see
Tuytelaars30 for a survey). Then, for
each detected region, a feature descriptor vector is formed. Descriptors may
be lists of pixel values within a patch, or
histograms of oriented contrast within
the regions, 23 for example. The result
is one set of local appearance or shape
description vectors per image, often
numbering on the order of 2,000 or
more features per image.
The idea behind such representat-
ions is to detect strong similarity
between local portions of related
images, even when the images appear
quite different at the global level.
Local features are more reliable for
several reasons:
the Pyramid match algorithm
To address this challenge, we developed the pyramid match—a linear-time matching function over
unordered feature sets—and showed
how it allows local features to be
used efficiently within the context
of multiple image search and learning problems. 12 The pyramid match
approximates the similarity measured by the optimal partial matching
between feature sets of variable car-dinalities. Because the matching is
partial, some features may be ignored
without penalty to the overall set
similarity. This tolerance makes the
measure robust in situations where
superfluous or “outlier” features may
appear. Note that our work focuses
on the image matching and indexing
aspects of the problem, and is flexible
to the representation choice, that is,
which particular image feature detectors and descriptors are used as input.
We consider a feature space V of
d-dimensional vectors for which the
values have a maximal range D. The
point sets we match will come from
the input space S, which contains
sets of feature vectors drawn from
V: S = {X|X = {x1, …, xm}}, where each-feature xi Î V Í d, and m = |X|. We
can think of each xi as a descriptor for
one of the elliptical image regions on
the koalas in Figure 2. Note that the
point dimension d is fixed for all features in V, but the value of m may vary
across instances in S.
Given point sets X, y Î S, with
|X| £ |y|, the optimal partial
matching p* pairs each point in X
to some unique point in y such
that the total distance between
matched points is minimized: p*
where pi
specifies which point is matched
to xi, and ||.|| 1 denotes the L1
norm. For sets with m features, the
Hungarian algorithm computes the
optimal match in O(m3) time, which