cal models that explain empirically
observed phenomena about algorithm performance, and an engineering dimension, where the goals are
to provide accurate guidance about
which algorithm to use for a problem
and to design new algorithms that
perform particularly well on the relevant inputs.
One exemplary result in beyond
worst-case analysis is due to Albers
et al.,
2 for the online paging problem
described in the introduction. The key
idea is to parameterize page request
sequences according to how much
locality of reference they exhibit, and
then prove parameterized worst-case
guarantees. Refining worst-case anal-
ysis in this way leads to dramatically
more informative results.f
Locality of reference is quantified
via the size of the working set of a page
request sequence. Formally, for a func-
tion f : N → N, we say that a request
sequence conforms to f if, in every win-
dow of w consecutive page requests, at
most f (w) distinct pages are requested.
For example, the identity function f (w)= w
imposes no restrictions on the page re-
quest sequence. A sequence can only
conform to a sublinear function like
f (w) = [√w] or f (w) = [ 1 + log2 w] if it ex-
hibits locality of reference.g
The following worst-case guarantee
is parameterized by a number αf (k), be-
tween 0 and 1, that we discuss shortly;
recall that k denotes the cache size. It
assumes that the function f is “con-
cave” in the sense that the number
of inputs with value x under f (that is,
|f - 1(x)|) is nondecreasing in x.
Theorem 1 (Albers et al.
2)
(a) For every f and k and every deterministic cache replacement policy, the worst-case page fault rate (over sequences that
conform to f) is at least αf (k).
(a) For every f and k and every
sequence that conforms to f, the page
f Parameterized guarantees are common in the
analysis of algorithms. For example, the field of
parameterized algorithms and complexity has
developed a rich theory around parameterized
running time bounds (see the book by Cygan
et al.
16). Theorem 1 employs an unusually fine-
grained and problem-specific parameteriza-
tion, and in exchange obtains unusually accu-
rate and meaningful results.
g The notation [x] means the number x, round-
ed up to the nearest integer.
fault rate of the LRU policy is at most
αf (k).
(b) There exists a choice of f and
k, and a page request sequence that
conforms to f, such that the page fault
rate of the FIFO policy is strictly larger
than αf (k).
Parts (a) and (b) prove the worst-case
optimality of the LRU policy in a strong
sense, f-by-f and k-by-k. Part (c) differ-
entiates LRU from FIFO, as the latter
is suboptimal for some (in fact, many)
choices of f and k.
The guarantees in Theorem 1 are so
good that they are meaningful even
when taken at face value—for sublinear f’s, αf (k) goes to 0 reasonably
quickly with k. For example, if f (w)
= [√w], then αf (k) scales with 1/√k.
Thus, with a cache size of 10,000, the
page fault rate is always at most 1%. If
f (w) = [ 1 + log2 w], then αf (k) goes to 0
even faster with k, roughly as k/2k.h
Stable Instances
Are point sets with meaningful clusterings easier to cluster than worst-case
point sets? Here, we describe one way
to define a “meaningful clustering,”
due to Bilu and Linial;
12 for others, see
Ackerman and Ben-David,
1 Balcan et
al.,
9 Daniely et al.,
18 Kumar and Kannan,
29 and Ostrovsky et al.
34
The maximum cut problem.
Suppose you have a bunch of data points
representing images of cats and images of dogs, and you would like to automatically discover these two groups.
One approach is to reduce this task to
the maximum cut problem, where the
h See Albers et al.
2 for the precise closed-form
formula for αf (k) in general.
goal is to partition the vertices V of a
graph G with edges E and nonnegative edge weights into two groups,
while maximizing the total weight of
the edges that have one endpoint in
each group. The reduction forms a
complete graph G, with vertices corresponding to the data points, and
assigns a weight we to each edge e indicating how dissimilar its endpoints
are. The maximum cut of G is a 2-clus-
tering that tends to put dissimilar pairs
of points in different clusters.
There are many ways to quantify
“dissimilarity” between images, and
different definitions might give different optimal 2-clusterings of the data
points. One would hope that, for a
range of reasonable measures of dissimilarity, the maximum cut in the
example above would have all cats on
one side and all dogs on the other. In
other words, the maximum cut should
be invariant under minor changes to
the specification of the edge weights
(Figure 3).
Definition 2 (Bilu and Linial12). An
instance G = (V, E, w) of the maximum
cut problem is γ-perturbation stable if,
for all ways of multiplying the weight we
of each edge e by a factor ae ∈ [ 1, γ], the
optimal solution remains the same.
A perturbation-stable instance
has a “clearly optimal” solution—
a uniqueness assumption on steroids—thus formalizing the idea of a
“meaningful clustering.” In machine
learning parlance, perturbation stability can be viewed as a type of “large
margin” assumption.
The maximum cut problem is NP-
hard in general. But what about the
special case of γ-perturbation-stable in-
Figure 3. In a perturbation-stable maximum cut instance, the optimal solution is invariant
under small perturbations to the edges’ weights.
11
33
3
3
22
33
3
3