or eight-connected neighborhoods
around each pixel and often involve a
contrast-sensitive smoothness prior34
that discourages adjacent pixels from
taking different labels when the pixels
are similar in color.
Moreover, many well-labeled datasets are readily available, with many
researchers using them to develop and
compare scene-understanding algorithms. To give a flavor of the results
that can be achieved, consider the following results on two standard datasets
from the Darwin software framework:
The Stanford Background Dataset (SBD)
10 consisting of 715 images of
rural, urban, and harbor scenes.
Images are labeled from two different
label sets: the first captures semantic
class and includes seven background
classes (sky, tree, road, grass, water,
building, and mountain) and a single
foreground object class; the second
captures scene geometry (sky, vertical, and horizontal). Each image pixel
is allocated two labels, one semantic
and one geometric; and
The Microsoft Research Cambridge
(MSRC) dataset34 consisting of 591 images.
Pixels are labeled with one of 23 different classes. However, due to the rare
occurrence of the horse and mountain
class, they are often discarded. Pixels
not belonging to one of the remaining
21 categories are ignored during training and evaluation. One drawback of
this dataset is the ground-truth labeling is rough and often incorrect near
object boundaries. Nevertheless, the
dataset contains a diverse set of images
and is widely used.
As scene-understanding research
matures, larger and more diverse da-
tasets are becoming more important
for applying existing scene-under-
standing algorithms and inspiring
new ones. The PASCAL Visual Object
Classes (VOC) dataset6 is a very large
collection of images annotated with
object-bounding boxes and pixelwise
segmentation masks for 20 differ-
ent (foreground) object categories.
It contains approximately 20,000 im-
ages organized into numerous chal-
lenges, with training, validation, and
evaluation image sets pre-specified.
Another large dataset of interest to
scene-understanding researchers is
the SIFT Flow dataset,
29 a subset of
outdoor images from the LabelMe im-
age repository (http://labelme.csail.
mit.edu), which contains 2,688 images
annotated using 33 diverse object and
background categories. Performing
well on both these datasets requires a
combination of many of the techniques
Accuracy of scene-understanding
algorithms can be evaluated by many
measures, including sophisticated
boundary-quality metrics and inter-section-over-union (Jaacard) scores.
The simplest measure computes the
percentage of pixels that were correctly
labeled by the model on a “hold out,”
or separate, set of images, referred to
as the “test set” or “evaluation set.” As
is standard practice when evaluating
machine-learning algorithms, these
images should not be viewed during
training of the model parameters. Formally, we can write
; y ;i = y i ; n i = 1 Σ ( 8)
where y i is the label for pixel i predicted
by the algorithm, y i is the ground-truth
label for pixel i, and · is the indicator
function taking value one when its argument is true and zero otherwise. An
alternative evaluation metric that better accounts for performance on rare
categories is class-averaged accuracy,
1 |L| ;∈L accclass–avg= ;y ;i = ;) ∧ (y i = ;); n i = 1 Σ ; y i = ;; n i = 1 Σ ( 9)
The different accuracy measures defined
by Equation 8 and Equation 9 are often
referred to in statistics as “micro averaging” and “macro averaging,” respectively.
State-of-the-art performance on the
semantic categories of the MSRC and
Stanford Background datasets is approximately 86% and 77% pixelwise
accuracy, respectively; class-averaged
accuracies are typically 5%–10% less.
On larger datasets, performance can
be quite poor without top-down and
contextual cues, especially on the less
frequently occurring classes.
Illustrating the effects of different aspects of a scene-understanding
model, Figure 7 includes results on an
example image from the MSRC dataset. Classifying pixels independently
(left results column) produces very
noisy predictions, as shown. Adding
a pairwise smoothness term helps remove the noise (right side). However,
when the features are weak (top row),
the algorithm cannot correctly classify the object in the image, though
the background is easily identified using local features. Stronger features,
including local and global cues, as
discussed, coupled with the pairwise
smoothness term, produce the correct
Figure 6. Combining bottom-up and top-down information in a CRF framework.
The bottom-up process forms region hypotheses and maps them to semantic labels.
The top-down process represents part-based shape constraints on object instances.
Shape Mask Prior