In the 2007, 2008, and 2009 PASCAL VOC competitions our
system obtained the highest AP score in 6, 7, and 7 out of 20
categories, respectively. 9 Our entry was declared the winner of
the competition in 2008 and 2009. In the 2010 competition,
our system won in 3 of 20 categories, and the 3 systems that
achieved a higher mean AP score (averaged over all classes)
were all extensions of our system using additional features,
richer context, and more parts. 9 Table 1 summarizes the AP
scores of our system on the 2010 dataset, together with the
best scores achieved across all systems that entered the official
competition. We also show the effect of two post-processing
methods that improve the quality of our detections.
The first method, bounding-box prediction, demon-
strates the added benefit that comes with inferring latent
structure at test time. We use a linear regression model
to predict the true bounding box of a hypothesis from
the inferred part configuration. The second method, con-
text rescoring, computes a new confidence score for each
detection with a polynomial kernel SVM whose features
are the base detection score and the highest score for
each of the 20 object-class detectors in the same image.
This method can learn co-occurrence constraints between
object classes; because cars and sofas tend not to co-occur,
car detections should be downweighted if an image con-
tains a high-scoring sofa. This context rescoring method
Figure 4 shows some models learned from the PASCAL
VOC 2010 dataset. Figure 5 shows some example detections
using those models. We show both high-scoring correct
detections and high-scoring false positives. These examples
illustrate how our models can handle significant variations
in appearance such as in the case of cars and horses.
In some categories our false detections are often due to
similarities among objects in different categories, such as
between horse and cow or between car and bus. In other categories false detections are often due to the relatively strict
bounding box overlap criteria. The two false positives shown
for the person category are due to insufficient overlap with
the ground-truth bounding box. The same is true for the cat
category, where we often detect the face of a cat and report a
bounding box that has relatively little overlap with the correct bounding box that encloses the whole object. In fact, the
top 20 highest scoring false positive detections for the cat
category correspond to a cat face. This is an extreme case but
it gives an explanation for our low AP score in this category.
Many positive training examples of cats contain only the
face, and our cat mixture model has a component dedicated
to detect cat faces, while another component captures an
entire cat. Sometimes the wrong mixture component has
the highest score, suggesting that our scores across different
components could be better calibrated.
figure 4. Visualizations of some of the models learned on the PascaL 2010 dataset.