DoI: 10.1145/1965724.1965748
Technical Perspective
Is Scale Your Enemy,
or Is Scale Your Friend?
By John ousterhout
ALthOuGh the nOminAL topic of the
following paper is managing crash
reports from an installed software
base, the paper’s greatest contributions are its insights about managing
large-scale systems. Kinshumann et
al. describe how the Windows error
reporting process became almost unmanageable as the scale of Windows
deployment increased. They then
show how an automated reporting
and management system (Windows
Error Reporting, or WER) not only
eliminated the existing problems, but
capitalized on the scale of the system
to provide features that would not be
possible at smaller scale. WER turned
scale from enemy to friend.
Scale has been the single most important force driving changes in system software over the last decade, and
this trend will probably continue for
the next decade. The impact of scale is
most obvious in the Web arena, where
a single large application today can
harness 1,000– 10,000 times as many
servers as the largest pre-Web applications of 10–20 years ago and supports
1,000 times as many users. However,
scale also impacts developers outside
the Web; in this paper, scale comes
from the large installed base of Windows and the correspondingly large
number of error reports emanating
from the installed base.
Scale creates numerous problems
for system developers and managers.
Manual techniques that are sufficient
at small scale become unworkable at
large scale. Rare corner cases that are
unnoticeable at small scale become
common occurrences that impact
overall system behavior at large scale.
It would be easy to conclude that scale
offers nothing to developers except an
unending parade of problems to overcome.
Microsoft, like most companies,
originally used an error reporting pro-
cess with a significant manual com-
ponent, but it gradually broke down
as the scale of Windows deployment
increased. As the number of Windows
installation skyrocketed, so did the
rate of error reports. In addition, the
size and complexity of the Windows
system increased, making it more dif-
ficult to track down problems. For
example, a buggy third-party device
driver could cause crashes that were
difficult to distinguish from problems
in the main kernel.
In any system
of sufficiently
large scale,
automation is not
only necessary,
but it is cheap.
complete data enables the third and
fourth steps.
The third step is to use the data to
make better decisions. At this point the
scale of the system becomes an asset:
the more data, the better. For example,
WER analyzes error statistics to discover correlations with particular system
configurations (a particular error might
occur only when a particular device driver is present). WER also identifies the
buckets with the most reports so they
can be addressed first.
The fourth and final step is that processes change in fundamental ways to
capitalize on the level of automation
and data analysis. For example, WER
allows a bug fix to be associated with a
particular error bucket; when the same
error is reported in the future, WER
can offer the fix to the user at the time
the error happens. This allows fixes to
be disseminated much more rapidly,
which is crucial in situations such as
virus attacks.
Other systems besides WER are also
taking advantage of scale. For example, Web search indexes initially kept
independent caches of index data in
the main memory of each server. As
the number of servers increased they
discovered that the sum total of all
the caches was greater than the total
amount of index data; by reorganizing
their servers to eliminate duplication
they were able to keep the entire index
in DRAM. This enabled higher performance and new features. Another example is that many large-scale Web
sites use an incremental release process to test new features on a small subset of users before exposing them to the
full user base.
I hope you enjoy reading this paper,
as I did, and that it will stimulate you
to think about scale as an opportunity,
not an obstacle.
John Ousterhout ( http://www.stanford.edu/~ouster) is
Professor (research) of cs at stanford university.
© 2011 acM 0001-0782/11/07 $10.00