with the same class as the surrounding background. The second duck
is mislabeled as a boat. While this
may seem absurd to humans, it is a
reasonable mistake for an algorithm
when considering the context is correct (boats also co-occur with water)
and the model was trained on only a
handful of images containing ducks.
We have explored scene understanding as a pixel-labeling task, including a
number of technical challenges facing
scene-understanding algorithms and
a glimpse at current trends toward addressing them. Active research along
these lines and the growing availability
of high-quality datasets reflect current
scene-understanding research; for example, better low-level feature representations are being learned automatically from large volumes of data rather
than engineered by hand.
26 Researchers are also looking toward mid-level
visual cues (also called “attributes”)
to overcome some of the limitations
of scarce training data; for example,
knowing an object has feathers narrows the range of possible labels for
Moreover, improved learning algorithms based on structured-predic-tion models31 means large numbers
of parameters can be tuned simultaneously. This results not only in more
optimal parameters but enables use
of richer models (such as those with
parameterized higher-order terms).
Other models being studied are hybrid models (such as grid-structured
labeling result (bottom right).
These examples of semantic segmentation are indicative of more
general trends in scene-understanding algorithms. More sophisticated
features that incorporate contextual
information (such as pixel location
and global and shape-based features)
perform much better than local appearance features, in general. Moreover, CRF models, with their pairwise
smoothness priors, improve performance over independent pixel classification, but the benefit decreases
as the sophistication of the features
used by the independent classifiers
increases. This trade-off is to be expected, as these features allow both
the baseline performance to increase
and the features to encode contextual
information that can act as a surrogate for the smoothness assumption.
The qualitative results from the
Darwin software framework (see Fig-
ure 8) also highlight a few points;
as shown, the accuracy of the pre-
dictions is generally good, and the
model is able to identify the bound-
ary between object categories quite
well. The labeling of foreground
objects occasionally leaks into the
background. This leakage is more
prominent in the MSRC results and
can be attributed to, in part, rough
ground-truth labeling in that dataset.
In models that use superpixels, these
boundary errors can also be caused
by inaccurate over-segmentations.
An interesting result is the labeling of the ducks in Figure 8 (MSRC,
left column, third row down). Here,
the water is classified correctly, but
both ducks are labeled incorrectly.
The white duck is mislabeled as water by the model due to both confusion of its local appearance with that
of water and a strong smoothness
assumption preferring to label it
Figure 7. Example semantic segmentation for an image from the MSRC dataset.
Shown are the original image (left) and color-coded pixel labels (right) from different scene-understanding models.
The models vary by features (local appearance versus local and global appearance) and model complexity (independent
pixel classification versus a CRF model with pairwise term); see Figure 8 for the related color legend.
Figure 8. Representative results on two standard scene-understanding datasets produced by the Darwin software library; shown are the
original image and predicted class labels overlay; best viewed in color.