Labeled training samples
CDbN
ranzato et al.
27
hinton and salakhutdinov11
Weston et al.
34
1,000
2.62% ± 0.12%
3.21%
—
2.73%
2,000
2.13% ± 0.10%
2.53%
—
—
3,000
1.91% ± 0.09%
—
—
1.83%
5,000
1.59% ± 0.11%
1.52%
—
—
60,000
0.82%
0.64%
1.20%
1.50%
figure 4. columns 1–4: the second layer bases (top) and the third layer bases (bottom) learned from specific object categories. column 5: the
second layer bases (top) and the third layer bases (bottom) learned from a mixture of four object categories (faces, cars, airplanes, motorbikes).
classification accuracy relative to the first layer alone.
Overall, we achieve 57.7% test accuracy using 15 training
images per class, and 65.4% test accuracy using 30 training
images per class. Our result is competitive with state-of-the-art results using a single type of highly specialized features, such as SIFT, geometric blur, and shape-context.
3,
16, 38 In addition, recall that the CDBN was trained entirely
from natural scenes, which are completely unrelated to
the classification task. Hence, the strong performance of
these features implies that our CDBN learned a highly general representation of images.
We note that current state-of-the-art methods use multiple kernels (or features) together, instead of using a single
type of features. For example, Gehler and Nowozin6
rve-ported a better performance than ours ( 77.7% for 30 training images/class), but they combined many state-of-the-art
features (or kernels) to improve performance. In another
approach, Yu et al.
36 used kernel regularization using a (
previously published) state-of-the-art kernel matrix to improve
the performance of their convolutional neural network
model (achieving 67.4% for 30 training examples/class).
However, we expect our features can also be used in both
settings to further improve performance.
4. 3. handwritten digit classification
We also evaluated the performance of our model on the
MNIST handwritten digit classification task, a widely used
benchmark for testing hierarchical representations. We
trained 40 first layer bases from MNIST digits, each 12 × 12
pixels, and 40 second layer bases, each 6 × 6. The pooling
ratio C was 2 for both layers. The first layer bases learned pen-
strokes that comprise the digits, and the second layer bases
learned bigger digit-parts that combine the pen-strokes. We
constructed feature vectors by concatenating the first and
second (pooling) layer activations, and used an SVM for clas-
sification using these features. For each labeled training set
size, we report the test error averaged over 10 randomly cho-
sen training sets, as shown in Table 2. For the full training
set, we obtained 0.8% test error. Our result is comparable to
the state of the art.
27
4. 4. unsupervised learning of object parts
We now show that our algorithm can learn hierarchical
object-part representations without knowing the position of
the objects and the object-parts. Building on the first layer
representation learned from natural images, we trained two
additional CDBN layers using unlabeled images from single
Caltech-101 categories. Training was performed on up to 100
images, and testing was performed on images different than
those in the training set. The pooling ratio for the first layer
was set as 3. The second layer contained 40 bases, each 10 × 10,
and the third layer contained 24 bases, each 14 × 14. The
pooling ratio in both cases was 2. We used 0.005 as the target
sparsity level in both the second and third layers. As shown in
Figure 4, the second layer learned features that corresponded
to object parts, even though the algorithm was not given any
labels that specified the locations of either the objects or
their parts. The third layer learned to combine the second
layer’s part representations into more complex, higher-level
features. Our model successfully learned hierarchical object-part representations of most of the other Caltech-101 categories as well. We note that some of these categories (such as
elephants and chairs) have fairly high intra-class appearance
variation, due to deformable shapes or different viewpoints.
Despite this variation, our model still learns hierarchical,
part-based representations fairly robustly.