this illustrative example, even though
we may measure high similarity
between the greenery portions of both
the cockatoo and koala images, the
dissimilarity constraint serves to refocus the metric on the other (distinct)
parts of the images.
In the case of the pyramid match,
the weights associated with matches
at different pyramid levels can be
treated as learnable parameters.
While fixing the weights according to
the bin diameters gives the most accurate approximation of true inter-feature distances in a geometric sense,
when we have some annotated images
available, we can directly learn the
weights that will best map same-class
images close together. 18
The idea is that the best matching function is specific to the class
of images, and is inherently defined
by the variability a given class exhibits. For example, to distinguish one
skyscraper from another, we might
expect same-class examples to contain some very tight local feature correspondences, whereas to distinguish
all skyscrapers from koalas, we expect
feature matches to occur at greater
distances even among same-class
examples. While the same type of
image feature may be equally relevant
in both situations, what is unique is
the distance at which similarity is significant for that feature. Therefore, by
learning the reward (weight) associated with each matching level in the
pyramid, we can automatically determine how close feature matches must
be in order to be considered significant for a given object class.
To achieve this intuition, we
observe that the PMK can be written as
a weighted sum of base kernels, where
each base kernel is the similarity computed at a given bin resolution. We
thus can compute the weights using a
form of kernel alignment, 3 where we
find the optimal combination of kernel matrices that most closely mimics
the “ideal” kernel on the training data,
that is, the one that gives maximal
similarity values for in-class examples
and minimal values for out-of-class
examples (see Jain et al. 18 for details).
We have also shown how image
retrieval can benefit from learning the
Mahalanobis parameterization for several distinct base metrics, including
h
figure 6. Query hashes.
Database
h
…
if hash functions guarantee a high
probability of collision for similar images under a metric of interest, one can
search a large database in sublinear
time via lsh techniques. the query
hashes directly to bins with similar
examples, and only that subset needs
to be searched. We have designed lsh
functions for matching, learned distances, and arbitrary kernel functions.
hash table
Query
10010
?
10101
10111
…
matching functions. 19 Given points
{x1, ..., xn}, with xi Î d, a positive-definite d × d matrix A parameterizes
the squared Mahalanobis distance:
dA(xi, xj) = (xi - xj)T A(xi - xj). ( 1)
A generalized inner product measures the pairwise similarity associated with that distance: SA(xixj) = x T i Axj.
Thus for a kernel K(xi, xj) = f(xi) Tf(xj),
the parameters transform the inner
product in the implicit feature space
as f(xi)TAf(xj). Given a set of inter-example distance constraints, one
can directly learn a matrix A to yield
a measure that is more accurate for
a given problem. We use the efficient
method of Davis et al. 7 because it is
kernelizable. This method optimizes
the parameters of A so as to minimize
how much that matrix diverges from
an initial user-provided “base” parameterization, while satisfying constraints that require small distances
between examples specified as similar, and large distances between pairs
specified as dissimilar.
Figure 5b shows the significant
retrieval accuracy gains achieved
when we learn image metrics using
two matching-based kernels as the
base metrics. The first kernel is the
PMK, the approximate matching mea-
sure defined here. The second kernel
is defined in Zhang et al., 33 and uses
exhaustive comparisons between
features to compute a one-to-many
match based on both descriptor and
positional agreement; we refer to it
as CORR for “correspondence.” For
this dataset of 101 object types, note
that chance performance would yield
an accuracy rate of only 1%. Both
base metrics do the most they can by
matching the local image features;
the learned parameters adapt those
metrics to better reflect the side-infor-
mation specifying a handful of images
from each class that ought to be near
(or far) from the others.
searching image collections
in sublinear time
Now that we have designed effective
similarity measures, how will image
search scale? We must be able to use
these metrics to query a very large image
database—potentially on the order of
millions of examples or more. Certainly,
a naive linear scan that compares the
query against every database image is