depth perception. Fortunately, we have two eyes, and our
brains can estimate depth by correlating points between
the two images we perceive. This gives us some hope that
from multiple photos of a scene, we can recover the shape
of that scene.
Consider the three images of a cube shown in Figure
1a. We do not know where these images were taken, and
we do not know a priori that they depict a specific shape
(in this case, a cube). However, suppose we do know that
the corners of the cube as seen in the images, i.e., the 2D
projections of the 3D corners, are in correspondence: we
know that the 2D dots with the same color correspond to
the same 3D points. This correspondence gives us a powerful set of constraints on the 3D geometry of the cameras
and points.
10 One way to state these constraints is that
given scene geometry (represented with 3D points) and
camera geometry (a 3D position and orientation for each
camera), we can predict where the 2D projections of each
point should be in each image via the equations of perspective projection; we can then compare these projections to
our original measurements.
Concretely, let Xi, i = 1,…, 8 denote the 3D positions of the
corners of the cube and let Rj , cj , and fj , j = 1, 2, 3 denote the
orientation, position, and the focal length of the three cam-
eras. Then if xij is the image of point Xi in image j, we can
write down the image formation equations as
xij = fj ∏ (Rj (Xi – cj)),
( 1)
where P is the projection function: P(x, y, z) = (x/z, y/z). The
Structure from Motion (SfM) problem is to infer Xi, Rj, cj, and fj
from the observations xi j.
The standard way to do this is to formulate the problem as
an optimization problem that minimizes the total squared
reprojection error:
( 2)
Here, i ∼ j indicates that the point Xi is visible in image j. An
example reconstruction, illustrating reprojection error, is
shown in Figure 1. While this toy problem is easily solved,
( 2) is in general a difficult nonlinear least squares problem
with many local minima, and has millions of parameters in
large scenes. Section 5 describes the various techniques we
use to solve ( 2) at scale.
figure 1. (a) three images of a cube, from unknown viewpoints.
the color-coded dots on the corners show the known
correspondence between certain 2D points in these images;
each set of dots of the same color are projections of the same
3D point. (b) a candidate reconstruction of the 3D points
(larger colored points) and cameras for the image collection
shown above. each image in the collection has an associated
position and orientation. this reconstruction largely agrees
with the observed 2D projections; when the red 3D point is
projected into each image (depicted with the dotted lines),
the predicted projection is close to the observed one. in the case
of camera 3 the projection is slightly off; the resulting residual
is called the reprojection error, and is what we seek to minimize.
(a)
Camera 3
Camera 1
(b)
Camera 2
3. the coRResPonDence PRoBLem
In the cube example above, we assumed that we were given
as input a set of 2D correspondences between the input
images. In reality, these correspondences are not given
and also have to be estimated from the images. How can we
do so automatically? This is the correspondence problem.
To solve the correspondence problem between two
images, we might consider every patch in the first image and
find the most similar patch in the second image. However,
this algorithm quickly runs into problems. First, many
image patches might be very difficult to match. For instance,
a patch of clear blue sky is very challenging to match
unambiguously across two images, as it looks like any other
patch of sky, i.e., it is not distinct. Second, what happens if
the second image is taken at a different time of day or with a
different level of zoom?
The last 10 years have seen the development of algorithms for taking an image and detecting the most distinctive, repeatable features in that image. Such feature
detectors not only reduce an image representation to
a more manageable size, but also produce much more
robust features for matching, invariant to many kinds
of image transformations. One of the most successful of these detectors is SIFT (Scale-Invariant Feature
Transform).
13
Once we detect features in an image, we can match
features across image pairs by finding similar-looking
features. While exhaustive matching of all features
between two images is prohibitively expensive, excellent
results have been reported with approximate nearest
neighbor search18; we use the ANN library.
3 For each pair
of images, the features of one image are inserted into a
k-d tree and the features from the other image are used as
queries. For each query if the nearest neighbor returned
by ANN is sufficiently far away from the next nearest
neighbor, it is declared a match.
13