which seem to make it easier to find
good solutions that generalize well.
However, away from these wide valleys
that lie toward the bottom of the error
function’s landscape, there are countless local minima that could trap an
optimizer in a poor solution.
The secret to deep learning’s success in avoiding the traps of poor local minima may lie in a decision taken
primarily to reduce computation time.
After each pass through the training
set, the backpropagation algorithm
that tunes the weights used by each
neuron for the next test should analyze all of the data. Instead, stochastic
gradient descent (SGD) uses a much
smaller random sample that is far
easier to compute. The simplification
causes the process to follow a more
random path towards the global minimum than full gradient descent. A result of this seems to be that SGD can
often skip over poor local minima.
“We are looking for a minimum
that is most tolerant to perturbation
in parameters or inputs,” says Poggio.
“I don’t know if SGD is the best we can
do now, but I find almost magical that
it finds these degenerate solutions
that work.”
For Soatto and his UCLA colleague
Alessandro Achille, more clues as to
how to make neural networks work bet-
ter will come through studies that use
the concept of the information bottle-
neck theory to look at the interactions
between different network architec-
tures and the training data.
Says Soatto, “We believe [Tishby’s]
ideas are substantially correct, but
there are a few technical details that
have to be worked out. The fact that
of Jerusalem believes the training pro-
cesses in neural networks illustrate a
branch of information theory that he
helped develop two decades ago. He
coined the term “information bottle-
neck” to describe the most efficient
way that a system can find relation-
ships between only the pieces of data
that matter for a particular task and
treat everything else within the sample
as irrelevant noise.
Tishby’s hunch was that neural networks provide examples of the information bottleneck at work. He worked
with colleague Ravid Shwartz-Ziv to
build a simpler form of neural network able to demonstrate how the process works. First the network finds important connections by adjusting the
weights that neurons use to determine
which of their peers in the network
should have the greatest influence.
Then, the network optimizes during
what Tishby calls the compression
phase. Through this process, neurons
adjust weights to disregard irrelevant
inputs. These inputs might represent
the backgrounds of images of animals
presented to a network trained to classify breeds using visual features.
However, an attempt last autumn
by an independent team to replicate
the results obtained by Tishby and
Shwartz-Ziv using techniques employed by production neural networks
failed to yield the expected compression phase consistently. Often, a neural network will achieve peak performance some time before it moves
into the phase that Tishby refers to
as compression, or may simply not
follow the same pattern. Yet, these
networks exhibit the generalization
capability that the information bottleneck concept predicts. “I think the
information bottleneck may be wrong
or, in any case, unable to explain the
puzzles of deep nets,” Poggio says.
Poggio and colleagues look at the
problem of understanding deep learn-
ing from the perspective of it being a
process of iterative optimization. In
learning what is important from the
training data, the network arranges
itself to minimize an error function—
an operation common to optimiza-
tion functions. In practice, the error
functions for neural networks for a
given set of training data seem to ex-
hibit multiple “degenerate” minima,
we converged to similar ideas is re-
markable because we started from
completely independent premises.”
Although the work on the informa-
tion bottleneck and on optimization
theory is beginning to lead to a better
understanding of how deep learning
works, Soatto says, “Most of the field
is still in the ‘let all the flowers bloom’
phase, where people propose different
architectures and folks adopt them,
or not. It is a gruesome trial-and-error
process, also known as ’graduate stu-
dent descent’, or GSD for short. To-
gether with SGD, these are the two bat-
tle-horses of modern deep learning.”
Further Reading
Shwartz-Ziv, R., and Tishby, N.
Opening the Black Box of Deep Neural
Networks via Information. ArXiv: https://
arxiv.org/abs/1703.00810
Zhang, C., Bengio, S., Hardt, M., Recht, B., and
Vinyalis, O.
Understanding Deep Learning Requires
Rethinking Generalization. ArXiV: https://
arxiv.org/abs/1611.03530
Poggio, T., Liao, Q., Miranda, B., Rosasco, L., Boix,
X., Hidary, J., and Mhaskar, H.
Theory of Deep Learning III: Explaining the
Non-Overfitting Puzzle. CBMM Memo 073
(2017). https://cbmm.mit.edu/sites/default/
files/publications/CBMM-Memo-073.pdf
Achille, A., Rovere, M., and Soatto, S.
Critical Learning Periods in Deep Neural
Networks. UCLA-TR-170017. ArXiV: https://
arxiv.org/abs/1711.08856
Chris Edwards is a Surrey, U.K.-based writer who reports
on electronics, IT, and synthetic biology.
© 2018 ACM 0001-0782/18/6 $15.00
The secret to
deep learning’s
success in avoiding
the traps of poor
local minima may lie
in a decision taken
primarily to reduce
computation time.