translation-invariant representations (e.g., convolutional
networks) often involve two kinds of alternating layers:
“detection” layers, where responses are computed by
convolving a feature detector with the previous layer, and
“pooling” layers, which shrink the representation of the
detection layers by a constant factor. More specifically,
each unit in a pooling layer computes the maximum activation of the units in a small region of the detection layer.
Shrinking the representation with max-pooling allows
higher-layer representations to be invariant to small translations of the input and reduces the computational burden.
Max-pooling was intended only for deterministic and
feed-forward architectures,
17 and it is difficult to perform
probabilistic inference (e.g., computing posterior probabilities) since max-pooling is a deterministic operator. In contrast, we are interested in a generative model of images that
supports full probabilistic inference. Hence, we designed
our generative model so that inference involves max-pooling-like behavior.
To simplify the notation, we consider a model with a visible layer V, a detection layer H, and a pooling layer P, as
shown in Figure 2. The detection and pooling layers both
have K groups of units, and each group of the pooling layer
has NP × NP binary units. For each k ∈ { 1, …, K}, the pooling
layer Pk shrinks the representation of the detection layer Hk
by a factor of C along each dimension, where C is a small
integer such as 2 or 3. In other words, the detection layer Hk
is partitioned into blocks of size C × C, and each block a is
connected to exactly one binary unit pk a in the pooling layer
(i.e., NP = NH /C). Formally, we define Ba ∆= , {(i, j ) : hij belongs
to the block a}.
The detection units in the block Ba and the pooling unit
pa are connected in a single potential which enforces the
following constraints: at most one of the detection units
may be on, and the pooling unit is on if and only if a detection unit is on. By adding this constraint, we can efficiently
sample from the network without explicitly enumerating all
2C2 configurations, as we show later. With this constraint,
we can consider these C2 + 1 units as a single (softmax) random variable which may take on one of C2 + 1 possible values: one value for each of the detection units being on, and
one value indicating that all units are off.
We formally define the energy function of this simplified
probabilistic max-pooling-CRBM as follows:
turning on unit hki, j is –I(hki, j) , and the conditional probability
is given by
( 14)
( 15)
In our implementation, we sample the random variables
{hk i, j} and pk a in each block a from a multinomial distribution, and this can be done in parallel since the blocks are
disjoint (i.e., each hidden unit belongs to only one block).
Sampling the visible layer V given the hidden layer H can
be performed in the same way as described in Section 3. 2
(e.g., Equation 10 or 11).
3. 4. training via sparsity regularization
Our model is overcomplete in that the size of the representa-
tion is much larger than the size of the inputs. In fact, since
the first hidden layer of the network contains K groups of
units, each roughly the size of the image, it is overcomplete
roughly by a factor of K. In general, overcomplete models run
the risk of learning trivial solutions, such as feature detectors
representing single pixels. One common solution is to force
the representation to be “sparse,” meaning only a tiny frac-
tion of the units should be active in relation to a given stimu-
lus. Following Lee et al.,
18 we regularize the objective function
(log-likelihood) to encourage each hidden unit group to have
a mean activation close to a small constant. Specifically, we
find that the following simple update (followed by contras-
tive divergence update) works well in practice:
( 16)
where p is a target sparsity, and each image is treated as a
mini-batch. The learning rate for sparsity update is chosen
as a value that makes the hidden group’s average activa-
tion (over the entire training data) close to the target spar-
sity, while allowing variations depending on specific input
images. The overall training algorithm for the convolu-
tional RBM (with probabilistic max-pooling) is described in
Algorithm 1.d
( 13)
We now discuss sampling the detection layer H and the pool-
ing layer P given the visible layer V. Note that hidden units in
group k receive the following bottom-up signal from layer V:
Now, we sample each block independently as a multinomial
function of its inputs. Suppose hki, j is a hidden unit contained
in block a (i.e., (i, j ) ∈ Ba), the increase in energy caused by
d To reduce the variance, we followed Hinton and Salakhutdinov11 by setting
V n: = Ep(V|H (n– 1))[V|H(n− 1))]; also, we used 1-step CD (Ncd = 1).