each layer comprises a set of binary or real-valued units.
Two adjacent layers have a full set of connections between
them, but no two units in the same layer are connected.
Hinton et al.
10 proposed an efficient algorithm for training DBNs, by greedily training each layer (from lowest to
highest) as an RBM using the previous layer’s activations
as inputs.
For example, once a layer of the network is trained, the
parameters Wij, bj, ci’s are frozen and the hidden unit values
(given the data) are inferred. These inferred values serve
as the input data used to train the next higher layer in the
network. Hinton et al.
10 showed that by repeatedly applying such a procedure, one can learn a multilayered DBN. In
some cases, this iterative greedy algorithm can be shown to
be optimizing a variational lower-bound on the data likelihood, if each layer has at least as many units as the layer
below. This greedy layer-wise training approach has been
shown to provide a good initialization for parameters for the
multilayered network.
figure 2. convolutional RBm with probabilistic max-pooling. for
simplicity, only group k of the detection layer and the pooling layer are
shown. the basic cRBm corresponds to a simplified structure with
only visible layer and detection (hidden) layer. see text for details.
pk a NP
Pk(pooling layer)
NH
C
hki,j
Hk(detection layer)
Wk
NW NV
v
V(visible layer)
3. aLGoRithm
Both RBMs and DBNs ignore the 2D structure of images, so
weights that detect a given feature must be learned separately for each location. This redundancy makes it difficult
to scale these models to full images. One possible way of
scaling up is to use massive parallel computation, such as
using GPUs, as shown in Raina et al.
25 However, this method
may still suffer from having a huge number of parameters.
In this section, we present a new method that scales up
DBNs using weight-sharing. Specifically, we introduce
our model, the convolutional DBN (CDBN), where weights
are shared among all locations in an image. This model
scales well because inference can be done efficiently using
convolution.
an NV × NV array of binary units. The hidden layer consists
of K groups, where each group is an NH × NH array of binary
units, resulting in NH2 K hidden units. Each of the K groups
is associated with a NW × NW filter (NW ∆= NV − NH + 1); the filter
weights are shared across all the hidden units within the
group. In addition, each hidden group has a bias bk and all
visible units share a single bias c.
We define the energy function E(v, h) as
( 7)
Using the operators defined previously,
( 8)
3. 1. notation
For notational convenience, we will make several simplifying assumptions. First, we assume that all inputs to
the algorithm are NV × NV images, even though there is no
requirement that the inputs be square, equally sized, or even
2D. We also assume that all units are binary-valued, while
noting that it is straightforward to extend the formulation
to the real-valued visible units (see Section 2. 1). We use to
denote convolution,b and • to denote an element-wise product followed by summation, i.e., A • B = tr AT B. We place a
tilde above an array (Ã) to denote flipping the array horizontally and vertically.
As with standard RBMs (Section 2. 1), we can perform block
Gibbs sampling using the following conditional distributions:
( 9)
( 10)
where σ (.) is the sigmoid function.c Gibbs sampling forms
the basis of our inference and learning algorithms.
3. 2. convolutional RBm
First, we introduce the convolutional RBM (CRBM).
Intuitively, the CRBM is similar to the RBM, but the weights
between the hidden and visible layers are shared among all
locations in an image. The basic CRBM consists of two layers: an input layer V and a hidden layer H (corresponding to
the lower two layers in Figure 2). The input layer consists of
3. 3. Probabilistic max-pooling
To learn high-level representations, we stack CRBMs into a
multilayer architecture analogous to DBNs. This architecture is based on a novel operation that we call probabilistic
max-pooling.
In general, higher-level feature detectors need infor-
mation from progressively larger input regions. Existing
b The convolution of an m × m array with an n × n array (m > n) may result in
an (m + n − 1) × (m + n − 1) array (full convolution) or an (m − n + 1) × (m − n + 1)
array (valid convolution). Rather than inventing a cumbersome notation to
distinguish between these cases, we let it be determined by context.
c For the case of real-valued visible units, we can follow the standard formula-
tion as in Section 2. 1 and show that
( 11)