a set of local features extracted from image I at patch i for
class t. The details of the features and classifier that we used
can be found in Torralba et al. 19
For simplicity, in this paper, we select the D most confident detections (after performing local nonmaximum suppression); let their locations be denoted by t i , for i Î { 1, …, D }.
Figure 6a gives an illustration of the output of our system on
a typical image. For the results in this paper, we set D = 10
so that no correct detections are discarded and still small
enough to be efficient. In the figure we show the top D = 4
detections to avoid clutter. The locations of each detection t i
are indicated by the position and scale of the box, and their
confidences ct i are indicated by the thickness of the border. In Figure 6b (top), we see that although the system has
detected the car, it has also detected three false positives.
This is fairly typical of this kind of approach. Below we will
see how to eliminate many of these false positives by using
global context.
where J is the number of mixture components for each class
conditional density. Some examples of scene classification
are shown in Figure 4. As shown in Quattoni and Torralba, 12
this technique classifies 75% of the images correctly across
15 different scene categories. Other classifiers give similar
performance.
Once we have estimated the scene category, we can pre-
dict the number of objects that are present using
( 3)
where p(Nt = n|S = s) is estimated by simple counting.
3. 2. object presence detection using global image
features
To determine if an object class is present in an image given
the gist, we could directly learn a binary classifier of the
form p(Pt = 1|g). Similarly, to predict the number of objects,
we could learn an ordinal regression function of the form
p(Nt|g). Instead, we choose a two-step approach in which we
first estimate the category or type of scene, p(S = s|g), and then
use this to predict the number of objects present, p(Nt|S = s).
This approach has the benefit of having an explicit representation of the scene category (e.g., a street, a highway, a forest)
which is also an important desired output of an integrated
model.
We can classify the scene using a simple Parzen-window
based density estimator
3. 3. object localization using global image features
The gist captures the overall spatial layout of the image, and
hence can be used to predict the expected vertical location of
each object class before running any detectors; we call this
location priming. However, the gist is not useful for predicting the horizontal locations of objects, which are usually not
very constrained by the overall structure of the scene (except
possibly by the horizontal location of other objects, a possibility we ignore in this paper).
We can use any nonlinear regression function to learn the
mapping from gist to expected vertical location. We used a
mixture of experts model, 4 which is a simple weighted average
of locally linear regression models. More precisely, we define
figure 4. Predicting the presence/absence of cars in images and their locations using gist. the outputs shown here do not incorporate any
information coming from a car detector and are only based on context. note that in the dataset used to fit the distributions of object counts
for each scene category, it is more common to find cars in street scenes (with many cars circulating and parked) than in highway scenes,
where there are many shots of empty roads. hence the histogram for highway shows p(Ncar = 0) = 0.6.
Highway
P(N= n | S= s)
Insidecity
Insidecity
0
0.2
0.4
0.6
1234
Number instances
56789 0
0
0.2
0.4
0.6
Number instances
123456789 0
0
0.2
0.4
0.6
Number instances
123456789 0
Street
0.2
0.1
0
0
123456789
Number instances
Open country
1
0
123456789 0
Number instances
1
Mountain
0
123456789 0
Number instances