to each edge’s weight? The natural
conjecture is that local search should
terminate in a polynomial number of
iterations, with high probability over
the perturbation. This conjecture has
been proved for graphs with maximum degree O(log n)
21 and for the
complete graph;
4 for general graphs,
the state-of-the-art is a quasi-polyno-mial-time guarantee (meaning nO(log n)
iterations).
22
More ambitiously, it is tempting
to speculate that for every natural local search problem, local search terminates in a polynomial number of
iterations in the smoothed analysis
model (with high probability). Such a
result would be a huge success story for
smoothed analysis and beyond worst-case analysis more generally.
On Machine Learning
Much of the present and future of
research going beyond worst-case
analysis is motivated by advances in
machine learning.p The unreasonable effectiveness of modern machine
learning algorithms has thrown down
the gauntlet to algorithms researchers,
and there is perhaps no other problem
domain with a more urgent need for
the beyond worst-case approach.
To illustrate some of the challeng-
es, consider a canonical supervised
learning problem, where a learning
algorithm is given a dataset of object-
label pairs and the goal is to produce
a classifier that accurately predicts
the label of as-yet-unseen objects (for
example, whether or not an image
contains a cat). Over the past decade,
aided by massive datasets and compu-
tational power, deep neural networks
have achieved impressive levels of
performance across a range of predic-
tion tasks.
25 Their empirical success
flies in the face of conventional wis-
dom in multiple ways. First, most neu-
ral network training algorithms use
first-order methods (that is, variants
of gradient descent) to solve noncon-
vex optimization problems that had
been written off as computationally
intractable. Why do these algorithms
p Arguably, even the overarching goal of research
in beyond worst-case analysis—determining
the best algorithm for an application-specific
special case of a problem—is fundamentally a
machine learning problem.
26
so often converge quickly to a local op-
timum, or even to a global optimum?q
Second, modern neural networks are
typically over-parameterized, mean-
ing that the number of free param-
eters (weights and biases) is consider-
ably larger than the size of the training
dataset. Over-parameterized models
are vulnerable to large generalization
error (that is, overfitting), but state-
of- the-art neural networks generalize
shockingly well.
40 How can we explain
this? The answer likely hinges on
special properties of both real-world
datasets and the optimization algo-
rithms used for neural network train-
ing (principally stochastic gradient
descent).r
Another interesting case study,
this time in unsupervised learning,
concerns topic modeling. The goal
here is to process a large unlabeled
corpus of documents and produce a
list of meaningful topics and an as-
signment of each document to a mix-
ture of topics. One computationally
efficient approach to the problem is
to use a singular value decomposition
subroutine to factor the term-docu-
ment matrix into two matrices, one
that describes which words belong to
which topics, and one indicating the
topic mixture of each document.
35
This approach can lead to negative
entries in the matrix factors, which
hinders interpretability. Restricting
the matrix factors to be nonnegative
yields a problem that is NP-hard in
the worst case, but Arora et al.
6 gave
a practical factorization algorithm
for topic modeling that runs in poly-
nomial time under a reasonable as-
sumption about the data. Their as-
sumption states that each topic has
at least one “anchor word,” the pres-
ence of which strongly indicates that
the document is at least partly about
that topic (such as the word “Durant”
for the topic “basketball”). Formally
articulating this property of data was
an essential step in the development
of their algorithm.
The beyond worst-case viewpoint
can also contribute to machine learning by
“stress-testing” the existing theory
q See Jin et al.
28 and the references therein for
recent progress on this question.
r See Neyshabur33 and the references therein for
recent developments in this direction.
There are
compelling
parallels between
the recent research
on clustering in
stable instances
and slightly
older results in
a field of applied
mathematics
known as sparse
recovery, where
the goal is to
reverse engineer
a “sparse” object
from a small
number of clues
around it.