repeat {over the training data (e.g., a set of training images)}
Set V (0) := V (e.g., set the current image as a mini-batch)
Compute the posterior Q(0) ∆= P(H|V (0)) (Equations 14
and 15).
Sample H(0) from Q(0).
for n = 1 to Ncd do
Sample V n from P(V|H (n− 1)) (Equation 10 or 11).c
Compute the posterior Q(n) ∆= P(H|Vn) (Equations 14
and 15).
Sample H (n) from Q (n).
end for
Update weights and biases with contrastive divergence
and sparsity regularization:
To sample the detection layer H and pooling layer P, note
that the detection layer Hk receives the following bottom-up
signal from layer V:
( 21)
and the pooling layer P k receives the following top-down signal
from layer H′:
( 22)
( 17)
Now, we sample each of the blocks independently as a multinomial function of their inputs, as in Section 3. 3. If (i, j) ∈
Ba, the conditional probability is given by ( 18)
( 19)
( 23)
until convergence
( 24)
Specifically, the biases of a given layer are learned twice:
once when the layer is treated as the “hidden” layer of the
CRBM (using the lower layer as visible units), and once
when it is treated as the “visible” layer (using the upper
layer as hidden units). We resolved this problem by simply fixing the biases with the learned hidden biases in
the former case (i.e., using only the biases learned when
treating the given layer as the hidden layer of the CRBM).
However, we note that a potentially better solution would
be to jointly train all the weights for the entire CDBN,
using the greedily trained weights as the initialization
(e.g., Hinton et al.
10, 29).
3. 6. hierarchical probabilistic inference
Once the parameters have all been learned, we compute the
network’s representation of an image by sampling from the
joint distribution over all of the hidden layers conditioned
on the input image. To sample from this distribution, we use
block Gibbs sampling, where each layer’s units are sampled
in parallel (see Sections 2. 1 and 3. 3).
To illustrate the algorithm, we describe a case with one
visible layer V, a detection layer H, a pooling layer P, and
another, subsequently higher detection layer H′. Suppose H′
has K′ groups of nodes, and there is a set of shared weights
G = {G
1, 1, …, G K,K′} where G k, is a weight matrix connecting
pooling unit Pk to detection unit H ′ . The definition can be
extended to deeper networks in a straightforward way.
Note that an energy function for this sub-network consists of two kinds of potentials: unary terms for each of
the groups in the detection layers and interaction terms
between V and H and between P and H′:e
e To avoid clutter, we removed all the terms that do not depend on h and p.
As an alternative to block Gibbs sampling, mean-field (e.g.,
Salakhutdinov et al.
30) can be used to approximate the
posterior distribution. In all our experiments except for
Section 4. 5, we used the mean-field approximation to estimate the hidden layer activations given the input.f
3. 7. Discussion
Our model used undirected connections between layers.
This approach contrasts with Hinton et al.,
10 which used
undirected connections between the top two layers, and
top-down directed connections for the layers below. Hinton
et al.
10 proposed approximating the posterior distribution
using a single bottom-up pass. This feed-forward approach
can often effectively estimate the posterior when the image
contains no occlusions or ambiguities,g but the higher layers cannot help resolve ambiguities in the lower layers. This
is due to feed-forward computation, where the lower layer
activations are not affected by the higher layer activations.
Although Gibbs sampling may more accurately estimate
the posterior, applying block Gibbs sampling would be difficult because the nodes in a given layer are not conditionally independent of one another given the layers above and
below. In contrast, our treatment using undirected edges
enables combining bottom-up and top-down information
more efficiently, as shown in Section 4. 5.
In our approach, probabilistic max-pooling helps to
address scalability by shrinking the higher layers. Moreover,
weight-sharing (convolutions) speeds up the algorithm further.
f We found that a small number of mean-field iterations (e.g., five iterations)
sufficed.
g In our experiments, this feed-forward approximation scheme also resulted
in similar posteriors of the hidden units and classification performance in
most cases.