figure 5. (top) histogram of the area under the precision-recall curve
(auc-PR) for three classification problems using class-specific
object-part representations. (bottom) average auc-PR for each
classification problem.
0.4
0.6
Faces
First layer
Second layer
Third layer
0.2
0.4
0.6
Motorbikes
First layer
Second layer
Third layer
0.2
0.4
0.6
Cars
First layer
Second layer
Third layer
0.2
0.2 0.4 0.6 0.8 1
0
Area under the PR curve (AUC)
Features
First layer
Second layer
Third layer
0.2 0.4 0.6 0.8 1
0
Area under the PR curve (AUC)
0.2 0.4 0.6 0.8 1
0
Area under the PR curve (AUC)
Faces Motorbikes Cars
0.39 ±0.17 0.44 ±0.21 0.43 ±0.19
0.86±0.13 0.69±0.22 0.72±0.23
0.95 ±0.03 0.81 ±0.13 0.87 ±0.15
Higher layers in the CDBN learn features that are not only
higher level, but also more specific to particular object categories. We quantitatively measured the specificity of each
layer by determining how indicative each individual feature is
of object categories. (This setting contrasts with most work in
object classification, which focuses on the informativeness of
the entire feature set, rather than individual features.) More
specifically, we considered three CDBNs trained on faces,
motorbikes, and cars, respectively. For each CDBN, we tested
the informativeness of individual features from each layer for
distinguishing among these three categories. For each feature, we computed the area under the precision-recall curve
(larger means more specific). In detail, for any given image,
we computed the layer-wise activations using our algorithm,
partitioned the activation into L ×L regions for each group,
and computed the q highest quantile activation for each
region and each group. If the q highest quantile activation
in region i was g, we then defined a Bernoulli random variable
Xi, L, q with probability g of being 1. To measure the informativeness between a feature and the class label, we computed
the mutual information between Xi, L, q and the class label. We
report results using (L, q) values that maximized the average
mutual information (averaging over i). Then for each feature,
by comparing its values over positive and negative examples,
we obtained the precision-recall curve for each classification
problem. As shown in Figure 5, the higher-level representations are more selective for the specific object class.
We further tested if the CDBN can learn hierarchical
object-part representations when trained on images from
several object categories, rather than just one. We trained
the second and third layer representations using unlabeled
images randomly selected from four object categories (cars,
faces, motorbikes, and airplanes). As shown in Figure 4 (far
right), the second layer learns class-specific and shared
parts, and the third layer learns more object-specific representations. The training examples were unlabeled, so, in a
sense, the third layer implicitly clusters the images by object
category. As before, we quantitatively measured the specificity of each layer’s individual features to object categories.
Since the training was completely unsupervised, whereas
the AUC-PR statistic requires knowing which specific
0
0.2
0.4
0.6
0.8
1
figure 6. histogram of conditional entropy for the representation
learned from the mixture of four object classes.
First layer
Second layer
Third layer
0 0.5 1 1. 5 2
Conditional entropy
figure 7. hierarchical probabilistic inference. for each column:
(top) input image; (middle) reconstruction from the second layer
units after single bottom-up pass, by projecting the second layer
activations into the image space; (bottom) reconstruction from
the second layer units after 20 iterations of block Gibbs sampling.
object or object parts the learned bases should represent,
we computed the conditional entropy instead. Specifically,
we computed the quantile features g for each layer as previously described, and measured conditional entropy H(class
|g > 0.95). Informally speaking, conditional entropy measures the entropy of the posterior over class labels when
a feature is active. Since lower conditional entropy corresponds to a more peaked posterior, it indicates greater specificity. As shown in Figure 6, the higher-layer features have
progressively less conditional entropy, suggesting that they
activate more selectively to specific object classes.
4. 5. hierarchical probabilistic inference
Lee and Mumford19 proposed that the human visual cortex
can be modeled conceptually as performing “hierarchical
Bayesian inference.” For example, imagine that you observe
a face image with its left half in dark illumination, then you
would still be able to recognize the face and further infer
the darkened parts by combining the image with your prior
knowledge of faces. In this experiment, we show that our
model can tractably perform such (approximate) hierarchical
probabilistic inference in full-sized images. More specifically,
we tested the network’s ability to infer the locations of hidden
object parts.
To generate examples for evaluation, we used Caltech- 101
face images (distinct from the ones the network was trained
on). For each image, we simulated an occlusion by zeroing out the left half of the image. We then sampled from
the joint posterior over all the hidden layers by performing