The problem of track generation can be formulated as
the problem of finding connected components in a graph
where the vertices are the features in all the images and
edges connect matching features. Since the matching
information is stored locally on the compute node where
the matches were computed, the track generation process
is distributed and proceeds in two stages. First, each node
generates tracks from its local matching data. This data is
gathered at the master node and then broadcast over the
network to all the nodes. Second, each node is assigned a
connected component of the match graph (which can be
processed independently of all other components), and
stitches together tracks for that component.
5. cit Y-scaLe sfm
Once the tracks are generated, the next step is to use a SfM
algorithm on each connected component of the match
graph to recover the camera poses and a 3D position for
every track.
Directly solving Equation 2 is a hard nonlinear optimization problem. Most SfM systems for unordered photo
collections are incremental, starting with a small reconstruction, then growing a few images at a time, triangulating new points, and doing one or more rounds of nonlinear
least squares optimization (known as bundle adjustment20)
to minimize the reprojection error. This process is repeated
until no more images can be added. However, due to the
scale of our collections, running such an incremental
approach on all the photos at once was impractical.
To remedy this, we observed that Internet photo collections by their very nature are redundant. Many photographs are taken from nearby viewpoints (e.g., the front of
the Colosseum) and processing all of them does not necessarily add to the reconstruction. Thus, it is preferable
to find and reconstruct a minimal subset of photographs
that capture the essential geometry of the scene (called a
skeletal set in Snavely et al.
19). Once this subset is reconstructed, the remaining images can be added to the reconstruction in one step by estimating each camera’s pose
with respect to known 3D points matched to that image.
This process results in an order of magnitude or more
improvement in performance.
Having reduced the SfM problem to its skeletal set,
the primary bottleneck in the reconstruction process is
the solution of ( 2) using bundle adjustment. Levenberg–
Marquardt (LM) is the algorithm of choice for solving
bundle adjustment problems; the key computational bottleneck in each iteration of LM is the solution of a symmetric positive definite linear system known as the normal
equations.
We developed new high-performance bundle adjust-
ment software that, depending upon the problem size,
chooses between a truncated or an exact step LM algo-
rithm. In the first case, a preconditioned conjugate gra-
dient method is used to approximately solve the normal
equations. In the second case, CHOLMOD,
4 a sparse direct
method for computing Cholesky factorizations, is used.
The first algorithm has low time complexity per iteration,
but uses more LM iterations, while the second converges
faster at the cost of more time and memory per iteration.
The resulting code uses significantly less memory than the
state-of-the-art methods and runs up to an order of magni-
tude faster. The runtime and memory savings depend upon
the sparsity of the linear system involved.
1
6. muLtiVie W steReo
SfM recovers camera poses and 3D points. However, the
reconstructed 3D points are usually sparse, containing only
distinctive image features that match well across photographs. The next stage in 3D reconstruction is to take the
registered images and recover dense and accurate models
using a multiview stereo (MVS) algorithm.
MVS algorithms recover 3D geometric information
much in the same way our visual system perceives depth
by fusing two views. In the MVS setting, we may have many
images that see the same point and could be potentially
used for depth estimation. Figure 3 illustrates how a basic
algorithm estimates a depth value at a single pixel. To
recover a dense model, we estimate depths for every pixel
in every image and then merge the resulting 3D points into
a single model.
For city-scale MVS reconstruction, the number of photos
is well beyond what any standard MVS algorithm can operate on at once due to prohibitive memory consumption.
Therefore, a key task is to group photos into a small number
of manageable sized clusters that can each be used to reconstruct a part of the scene well.
Concretely, if we consider the SfM points as a sparse
proxy for the dense MVS reconstruction, we want a clustering such that
1. Each SfM point is visible from enough images in a
cluster.
2. The total number of clusters is small.
3. The size of each cluster is constrained to be lower than
figure 3. a standard window-based multiview stereo algorithm.
Given a pixel and an image window around it, we hypothesize a finite
number of depths along its viewing ray. at each depth, the window is
projected into the other images, and consistency among textures at
these image projections is evaluated. at the true depth (highlighted
in green), the consistency score is at its maximum.
Hypothesized
depths
Correct
depth
Wndow