This is illustrated as a probabilistic graphical model (see
e.g., Koller and Friedman5) in Figure 5. There is one node
for each random variable: the shaded nodes are observed
(these are deterministic functions of the image), and the
unshaded nodes are hidden or unknown, and need to be
inferred. There is a directed edge into each node from all
the variables it directly depends on. For example, the g ® S
are reflects the scene classifier; the g ® Yt arc reflects the
location priming based on the gist; the S ® Nt arc reflects
the object counts given the scene category; the Ot i ® ct i arc
reflects the fact that the presence or absence of an object
of type t in patch i affects the detector score or confidence
ct i ; the Ot i ® t i arc is a deterministic link encoding of the
location of patch i; the Yt ® t i arc reflects the p(t i|Yt, Ot i )
term; finally, there are the Ot i ® St and Nt ® St arcs, which is
simply a trick for enforcing the Nt = SDi= 1 I(Ot i = 1) constraint.
The St node is a dummy node used to enforce the constraint
between the Nt nodes and the Ot i nodes. Specifically, it is
“clamped” to a fixed state, and we then define p(St|Ot 1:D,
Nt = n) = I(Si Ot i = n) (conditional on the observed child St, all
the parent nodes, Nt and Ot i , become correlated due to the
“explaining away” phenomenon5).
From Figure 5, it is clear that by conditioning on S, we
can perform inference on each type of object independently
in parallel. The time complexity for exact inference in this
model is O(ST2D), ignoring the cost of running the detectors. (Techniques for quickly evaluating detectors on large
images, using cascades of features, are discussed in Viola
and Jones20.) We can speed up inference in several ways. For
example, we can prune out improbable object categories
(and not run their detectors) if p(Nt > 0|g) is too low, which
is very effective since g is fast to compute. Of the categories
that survive, we can just run their detectors in the primed
region, near E(Yt|g). This will reduce the number of detections D per category. Finally, if necessary, we can use Monte
Carlo inference (such as Gibbs sampling) in the resulting
pruned graphical model to reduce time complexity.
Examples of the integrated system in action are shown in
Figure 6c: We see that location priming, based on the gist,
has down-weighted the scores of the detections in improbable locations, thus eliminating false positives. In the second row, the local detector is able to produce a confident
detection, but the second car produces a low confidence
detection. As the low confident detection falls inside the
predicted region, the confidence of the detection increases.
Note that in this example there are two false alarms that
happen to also fall within the prediction region. In this case,
the overall system will increase the magnitude of the error.
If the detector produces errors that are contextually correct,
the integrated model will not be able to discard those. The
third row shows a different example of failure of the integrated model. In this case, the structure of the scene makes
the system think that this is a street scene, and then mixes
the boats with cars. Despite these sources of errors, the performances of the integrated system are substantially better
than the performances of the car detectors in isolation.
For a more quantitative study of the performance of
our method, we used the scenes dataset from Oliva and
Torralba11 consisting of 2688 images covering 8 scene categories (streets, building facades, skyscrapers, highways, mountainous landscapes, coast, beach, and fields). We use half of
the dataset to train the models and the other half for testing.
Figure 7 shows performances at two tasks: object localization and object presence detection. The plots correspond
to precision–recall plots: the horizontal axis denotes the
percentage of cars in the database that have been detected
for a particular detection threshold and the vertical axis is
the percentage of correct detections for the same threshold.
Different points in the graph are achieved by varying the decision threshold. For both tasks, the plot shows the performances using an object detector alone, the performances of
the integrated model, and the performance of an integrated
model with an oracle that tells for each image the true context.
figure 5. Integrated system represented as a directed graphical model. We show two object types, t and t', for simplicity. the observed
variables are shaded circles, the unknown variables are clear circles. Variables are defined in the text. the Ât node is a dummy node used
to enforce the constraint between the Nt nodes and the Ot i nodes. Ot i = indicator of presence of object class t in box i; Yt = vertical location of
object class t; Nt = number of instances of object class t; lt i = location of box i for object class t; ct i = score of box i for object class i; D = number
of high-confidence detections; g = gist descriptor; S = scene category.
l t’ D
O t’ 1
g —Gist descriptor