structure is much more effective for
high-dimensional features.
The center plot shows accuracy as
a function of computation time when
the eight categories of the same dataset
are learned using local feature matches
between images. The plot compares
the performance of the pyramid match
to an exact matching function that
averages the cost between the closest
features in one set to the other. The horizontal axis measures the total training
time, which is directly affected by the
size of the feature sets. To vary the size
of a typical set, we tune the saliency
parameter controlling how many
regions are detected per image. For
both methods, more features lead to
striking accuracy improvements; this
behavior is expected since introducing more features assures better coverage of all potentially relevant image
regions. However, the linear-time pyramid match offers a key advantage in
terms of computational cost, reaching
peak performance for significantly less
computation time.
On the Caltech- 101 benchmark, we
have shown that classifiers employing
the PMK with a variety of features currently yield one of the most accurate
results in the field, 20 with 74% accuracy on the 101-way decision problem
when training with just 15 exemplars
per class. Figure 4c shows example
images from four pairs of categories
in the Caltech- 101 dataset that cause
the most confusion for the pyramid
match: schooner and ketch, lotus and
figure 5. the learned metric.
water lily, gerenuk and kangaroo, and
nautilus and brain. In each row, the
two images on the left have local features that match quite well to the two
on the right, as compared to images
from any of the other 100 classes in
the dataset. Some of these confused
category pairs have rather subtle distinctions in appearance. However,
the case of the gerenuk and kangaroo
reveals a limitation of the completely
local description, as by definition it
fails to capture the significance of
the global contour shapes of the two
objects.
Overall, approaches based on the
pyramid match consistently show
accuracy that is competitive with (or
better than) the state of the art while
requiring dramatically less computation time. This complexity advantage
frees us to consider much richer representations than were previously practical. Methods that compute explicit
correspondences require about one
minute to match a novel example; in
the time that these methods recognize
a single object, the pyramid match can
recognize several hundred. 15 Due to its
flexibility and efficiency, the pyramid
match has been adapted and extended
for use within a number of tasks, such
as scene recognition, 22 video indexing, 6 human action recognition, 24 and
robot localization. 25
Learning image metrics
Thus far, we have considered how to
robustly measure image similarity
in situations where we have no back-
ground knowledge; that is, where the
system only has access to the image
content itself. However, in many cases
the system could also receive external
side-information that might benefit its
comparisons. For example, if provided
with partially annotated image exam-
ples, or if a user wants to enforce simi-
larity between certain types of images,
then we ought to use those constraints
to adapt the similarity measure.
Caltech 101: gains over original (nonlearned) kernels
ML+CORR
ML+PMK
CORR
PMK
(a)
(b)
5 10 15 20 25 30
0
10
20
30
40
50
60
70
Mean accuracy/class (%)
Number of constrained examples per class
(a) by constraining some examples to be similar (green solid line), and others to be dissimilar (red dotted lines), the learned metric refines the original distance
function so that examples are close only when they share the relevant features. (b) retrieval accuracy is improved by replacing two matching-based metrics (PMK
and Corr) with their learned counterparts (Ml + PMK and Ml + Corr).