For example, convolutions between K filters and an input
image are more efficient both in memory and time than
repeating K NH2 times of inner products between the input
image and each of the basis vectors (without weight sharing). As a result, inference in a three-layer network (with
200 × 200 input images) with weight-sharing but without
max-pooling is about 10 times slower. Without weight-sharing, it is more than 100 times slower.
In contemporary work that was done independently of
ours, Desjardins and Bengio4 and Norouzi et al.
21 also applied
convolutional weight-sharing to RBMs. Our work, however,
developed more sophisticated elements such as probabilistic max-pooling to make the algorithm more scalable.
In another contemporary work, Salakhutdinov and
Hinton29 proposed an algorithm to train Boltzmann machines
with layer-wise connections (i.e., the same topological structure as in DBNs, but with undirected connections). They called
this model the deep Boltzmann machine (DBM). Specifically,
they proposed algorithms for pretraining and fine-tuning
DBMs. Our treatment of undirected connections is closely
related to DBMs. However, our model is different from theirs
because we apply convolutional structures and incorporate
probabilistic max-pooling into the architecture. Although
their work is not convolutional and does not scale to as large
images as our model, we note that their pretraining algorithm
(a modification of contrastive divergence that duplicates the
visible units or hidden units when training the RBMs) or fine-tuning algorithm (joint training of all the parameters using a
stochastic approximation procedure32, 35, 37) can also be applied
to our model to improve the training procedure.
4. eXPeRimentaL ResuLts
4. 1. Learning hierarchical representations from
natural images
We first tested our model’s ability to learn hierarchical representations of natural images. Specifically, we trained a
CDBN with two hidden layers from the Kyoto natural image
dataset.h The first layer consisted of 24 groups (or “bases”)i
of 10 × 10 pixel filters, while the second layer consisted of
100 bases, each one 10 × 10 as well. Since the images were
real-valued, we used Gaussian visible units for the first-layer CRBM. The pooling ratio C for each layer was 2, so the
second-layer bases covered roughly twice as large an area
as the first-layer bases. We used 0.003 as the target sparsity
for the first layer and 0.005 for the second layer.
As Figure 3 (top) shows, the learned first layer bases are
oriented, localized edge filters; this result is consistent
with much previous work.
1, 9, 22, 23, 28, 33 We note that sparsity
regularization during training was necessary to learn these
oriented edge filters; when this term was removed, the algorithm failed to learn oriented edges. The learned second
layer bases are shown in Figure 3 (bottom), and many of
them empirically responded selectively to contours, corners,
angles, and surface boundaries in the images. This result is
qualitatively consistent with previous work.
12, 13, 18
h Available at http: //www. cnbc. cmu.edu/cplab/data_kyoto.html
i We will call one hidden group’s weights a “basis.”
figure 3. the first layer bases (top) and the second layer bases
(bottom) learned from natural images. each second layer basis
(filter) was visualized as a weighted linear combination of the first
layer bases.
table 1. test classification accuracy for the caltech- 101 data.
training size (per class)
CDbN (first layer)
CDbN (first + second layer)
raina et al.
24
ranzato et al.
27
Mutch and Lowe20
Lazebnik et al.
16
Zhang et al.
38
15
53.2% ± 1.2%
57.7% ± 1.5%
46.6%
—
51.0%
54.0%
59.0% ± 0.56%
30
60.5% ± 1.1%
65.4% ± 0.5%
—
54.0%
56.0%
64.6%
66.2% ± 0.5%
4. 2. self-taught learning for object recognition
In the self-taught learning framework,
24 a large amount of
unlabeled data can help supervised learning tasks, even
when the unlabeled data do not share the same class labels
or the same generative distribution with the labeled data. In
previous work, sparse coding was used to train single-layer
representations from unlabeled data, and the learned representations were used to construct features for supervised
learning tasks.
We used a similar procedure to evaluate our two-layer
CDBN, described in Section 4. 1, on the Caltech- 101 object
classification task. More specifically, given an image from
the Caltech- 101 dataset,
5 we scaled the image so that its
longer side was 150 pixels and computed the activations
of the first and second (pooling) layers of our CDBN. We
repeated this procedure after reducing the input image
by half and concatenated all the activations to construct
features. We used an SVM with a spatial pyramid matching kernel for classification, and the parameters of the
SVM were cross-validated. We randomly selected 15 or
30 images per class for training test and testing set, and
normalized the result such that classification accuracy for
each class was equally weighted (following the standard
protocol). We report results averaged over 10 random trials, as shown in Table 1. First, we observe that combining the first and second layers significantly improves the