actually feature the landmark. This
matching algorithm is computationally expensive but is easily parallelized.
9
Figure 5 illustrates SIFT feature
matching in more detail. Given the in-
put photo (a) on the top left of Figure
8, SIFT extracts a number of features,
consisting of salient locations and
scales in the image, as well as a high-di-
mensional descriptor summarizing the
appearance of each feature. A subset of
detected feature locations, depicted as
yellow circles, is superimposed on the
image (b) on the top right. The image is
figure 8. Pose network—can you tell where the second photo was taken relative to
the first?
Image pair with matching features
50°05 N 14° 25 E
50°01 N 14° 22 E
Relative camera poses
50°03 N 14° 27 E
figure 9. 3D reconstructions.
shown again on the bottom (c) next to
an image from a similar viewpoint; we
can match SIFT features to find a correspondence between these images. Because of the robustness of SIFT, most
of these matches are correct.
Once we have the network of visual
connectivity between images, we need
to estimate the precise position and
orientation of the camera used to capture each image—that is, exactly where
the photographer was standing and in
which direction he or she was pointing
the camera—as well as the 3D coordinates of every SIFT point matched in
the images. It turns out that this can
be posed as an enormous optimization problem, in which the location of
each scene point and the position of
each camera are estimated given constraints induced by the same scene
points appearing in multiple images.
This optimization tries to find the camera and scene geometry that, when related to each other through perspective
projection, most closely agree with the
2D SIFT matches found between the
images. This optimization problem is
difficult to solve, not only because of
its size, but also because the objective
function is highly nonlinear.
Information in the visual network,
as well as absolute location information from geotags, however, can help
with this reconstruction task. Consider a pair of visually overlapping
images, such as the two photographs
shown in the upper left of Figure 8. Using the computed SIFT matches and
geometric reasoning algorithms, we
can determine the geometric relationship between these two images—that
image 2 is taken to the left of image
1 and rotated slightly clockwise, say.
We can compute such relative camera poses for each edge in the network
(such as the small network on the right
of Figure 8). By computing many such
relationships, we can build up a network of information on top of a set of
images, shown to the right. We also
have geotags for some images, shown
as latitude/longitude coordinates.
Unfortunately, these geotags are very
noisy and can at times be hundreds of
meters away from a photo’s true location. On the other hand, some geotags
are quite accurate. If we knew which
were good, we could propagate locations from those photos to their neigh-