DoI: 10.1145/1965724.1965749
Debugging in the (Very) Large:
Ten Years of Implementation
and Experience
Abstract
Windows Error Reporting (WER) is a distributed system
that automates the processing of error reports coming from
an installed base of a billion machines. WER has collected
billions of error reports in 10 years of operation. It collects
error data automatically and classifies errors into buckets,
which are used to prioritize developer effort and report fixes
to users. WER uses a progressive approach to data collection, which minimizes overhead for most reports yet allows
developers to collect detailed information when needed.
WER takes advantage of its scale to use error statistics as a
tool in debugging; this allows developers to isolate bugs that
cannot be found at smaller scale. WER has been designed
for efficient operation at large scale: one pair of database
servers records all the errors that occur on all Windows
computers worldwide.
1. In TRoDuCTIon
Debugging a single program run by a single user on a single
computer is a well-understood problem. It may be arduous,
but follows general principles: a user reports an error, the
programmer attaches a debugger to the running process or
a core dump and examines program state to deduce where
algorithms or state deviated from desired behavior. When
tracking particularly onerous bugs the programmer can
resort to restarting and stepping through execution with the
user’s data or providing the user with a version of the pro-
gram instrumented to provide additional diagnostic infor-
mation. Once the bug has been isolated, the programmer
fixes the code and provides an updated program.a
In 1999, we realized we could completely change our
model for debugging in the large, by combining two tools
a We use the following definitions: error (noun): a single event in which pro-
gram behavior differs from that intended by the programmer; bug (noun): a
root cause, in program code, that results in one or more errors.
then under development into a new service called Windows
Error Reporting (WER). The Windows team devised a tool
to automatically diagnose a core dump from a system crash
to determine the most likely cause of the crash and identify
any known resolutions. Separately, the Office team devised
a tool to automatically collect a stack trace with a small of
subset of heap memory on an application failure and upload
this minidump to servers at Microsoft. WER combines these
tools to form a new system which automatically generates
error reports from application and operating systems failures, reports them to Microsoft, and automatically diagnoses them to point users at possible resolutions and to aid
programmers in debugging.
Beyond mere debugging from error reports, WER enables
a new form of statistics-based debugging. WER gathers all
error reports to a central database. In the large, programmers can mine the error report database to prioritize work,
spot trends, and test hypotheses. Programmers use data
from WER to prioritize debugging so that they fix the bugs
that affect the most users, not just the bugs hit by the loud-est customers. WER data also aids in correlating failures to
co-located components. For example, WER can identify that
a collection of seemingly unrelated crashes all contain the
same likely culprit—say a device driver—even though its
code was not running at the time of failure.
Three principles account for the use of WER by every
Microsoft product team and by over 700 third-party companies to find thousands of bugs: automated error diagnosis
and progressive data collection, which enable error processing at global scales, and statistics-based debugging, which
harnesses that scale to help programmers more effectively
improve system quality.
WER is not the first system to automate the collection of
memory dumps. Postmortem debugging has existed since
the dawn of digital computing. In 1951, The Whirlwind I
system2 dumped the contents of tube memory to a CRT in
octal when a program crashed. An automated camera took
a snapshot of the CRT on microfilm, delivered for debug-
ging the following morning. Later systems dumped core
to disk; used partial core dumps, which excluded shared
code, to minimize the dump size5; and eventually used
A previous version of this paper appeared in Proceedings
of the 22nd ACM Symposium on Operating Systems
Principles (SOSP ’09).