has shown that simple global image features, known as the
“gist” of the image, are sufficient to provide robust predictions about the presence and location of different object
categories. Such features are fast to compute, and provide
information that is useful for many classes and locations
simultaneously.
In this paper, which is an extension of our previous
work, 8, 9, 17 we present a simple approach for combining standard sliding-window object detection systems, which use
local, “bottom up” image features, with systems that predict the presence and location of object categories based
on global, or “top-down,” image features. These global features serve to define the context in which the object detection is happening. The importance of context is illustrated
in Figure 1, which shows that the same black “blob,” when
placed in different surroundings, can be interpreted as a
plate or bottle on the table, a cell phone, a pedestrian or car,
or even a shoe. Another example is shown in Figure 2: it is
easy to infer that there is very probably a computer monitor
behind the blacked out region of the image.
We are not the first to point out the importance of context in computer vision. For example, Strat and Fischler
emphasized its importance in their 1991 paper. 16 However,
there are two key differences between our approach and
previous work. First, in early work, such as16 the systems
consist of hand-engineered if–then rules, whereas more
recent systems rely on statistical models that are fit to data.
Second, most other approaches define the context in terms
of other objects6, 13, 14, 18; but this introduces a chicken-and-
figure 1. In presence of image degradation (e.g., blur), object
recognition is strongly influenced by contextual information. the
visual system makes assumptions regarding object identities based
on its size and location in the scene. In these images, the same black
blob can be interpreted as a plate, bottle, cell phone, car, pedestrian,
or shoe, depending on the context. (each circled blob has identical
pixels, but in some cases has been rotated.)
108 CommunICAtIonS of the ACm | MArCh 2010 | VoL. 53 | no. 3
figure 2. What is hidden behind the mask? In this example, context
is so strong that one can reliably infer that the hidden object is a
computer monitor.
?
egg problem: to detect an object of type 1 you first have to
detect an object of type 2. By contrast, we propose a hierarchical approach, in which we define the context in terms of
an overall scene category. This can be reliably inferred using
global images features. Conditioned on the scene category,
we assume that objects are independent. While not strictly
true, this results in a simple yet effective approach, as we will
show below.
In the following sections, we describe the different components of our model. We will start by showing how we can
represent contextual information without using objects as
an intermediate representation. Then we will show how that
representation can be integrated with an object detector.
2. GLoBAL ImAGe feAtuReS: the GISt of An ImAGe
In the same way that an object can be recognized without
decomposing it into a set of nameable parts (e.g., the most
successful face detectors do not try to detect the eyes and
mouth first, instead they search for less semantically meaningful features), scenes can also be recognized without necessarily decomposing them into objects. The advantage of
this is that it provides an additional source of information
that can be used to provide contextual information for object
recognition. As suggested in Oliva and Schyns and Oliva and
Torralba, 10, 11 it is possible to build a global representation of
the scene that bypasses object identities, in which the scene
is represented as a single entity. Recent work in computer
vision has highlighted the importance of global scene representations for scene recognition1, 7, 11 and as a source of contextual information. 3, 9, 17 These representations are based
on computing statistics of low level features (similar to representations available in early visual areas such as oriented
edges, vector quantized image patches, etc.) over fixed image
regions. One example of a global image representation is the