telecommunication networks to deliver core dumps to the
computer manufacturer. 4
WER is the first system to provide automatic error diagnosis, the first to use progressive data collection to reduce
overheads, and the first to automatically direct users to
available fixes based on automated error diagnosis. WER
remains unique in four aspects:
1. WER is the largest automated error-reporting system in
existence. Approximately one billion computers run WER
client code: every Windows system since Windows XP.
2. WER automates the collection of additional client-side
data for hard-to-debug problems. When initial error
reports provide insufficient data to debug a problem,
programmers can request that WER collect more data in
future error reports including: broader memory dumps,
environment data, log files, and program settings.
3. WER automatically directs users to solutions for corrected errors. For example, 47% of kernel crash reports
result in a direction to an appropriate software update
or work around.
4. WER is general purpose. It is used for operating systems and applications, by Microsoft and non-Microsoft
programmers. WER collects error reports for crashes,
non-fatal assertion failures, hangs, setup failures,
abnormal executions, and hardware failures.
2. PRoBLEM, SCALE, AnD STRATEGY
The goal of WER is to allow us to diagnose and correct every
software error on every Windows system. We realized early on
that scale presented both the primary obstacle and the primary
solution to address the goals of WER. If we could remove
humans from the critical path and scale the error reporting
mechanism to admit millions of error reports, then we could
use the law of large numbers to our advantage. For example,
we did not need to collect all error reports, just a statistically
significant sample. And we did not need to collect complete
diagnostic samples for all occurrences of an error with the
same root cause, just enough samples to diagnose the problem and suggest correlation. Moreover, once we had enough
data to allow us to fix the most frequently occurring errors,
then their occurrence would decrease, bringing the remaining
errors to the forefront. Finally, even if we made some mistakes,
such as incorrectly diagnosing two errors as having the same
root cause, once we fixed the first then the occurrences of the
second would reappear and dominate future samples.
Realizing the value of scale, five strategies emerged as necessary components to achieving sufficient scale to produce
an effective system: automatic bucketing of error reports, collecting data progressively, minimizing human interaction,
preserving user privacy, and directing users to solutions.
2. 1. Automatic bucketing
WER automatically aggregates error reports likely originating
from the same bug into a collection called a bucket.b If not,
WER data naively collected with no filtering or organization,
b bucket (noun): a collection of error reports likely caused by the same bug;
bucket (verb): to triage error reports into buckets.
112 CoMMunICATIonS oF ThE ACM | july2011 | vol. 54 | no. 7
would absolutely overwhelm programmers. The ideal
bucketing algorithm would map all error reports caused by
the one bug into one unique bucket with no other bugs in
that bucket. Because we know of no such algorithm, WER
instead employs a set of bucketing heuristics in two phases.
First, errors are labeled, assigned to a first bucket based on
immediate evidence available at the client with the goal
that each bucket contains error reports from just one bug.
Second, errors are classified at the WER service; they are
consolidated to new buckets as additional data is analyzed
with the goal of minimizing programmer effort by placing
error reports from just one bug into just one final bucket.
Bucketing enables automatic diagnosis and progressive
data collection. Good bucketing relieves programmers and
the system of the burden of processing redundant error
reports, helps prioritize programmer effort by bucket prevalence, and can be used to link users to updates when the
bugs has been fixed. In WER, bucketing is progressive. As
additional data related to an error report is collected, such
as symbolic information to translate from an offset in a
module to a named function, the report is associated with a
new bucket. Although the design of optimal bucketing algorithms remains an open problem, the bucketing algorithms
used by WER are in practice quite effective.
2. 2. Progressive data collection
WER uses a progressive data collection strategy to reduce the
cost of error reporting so that the system can scale to high
volume while providing sufficient detail for debugging. Most
error reports consist of no more than a simple bucket identifier, which just increments its count. If additional data is
needed, WER will next collect a minidump (an abbreviated
stack and memory dump) and the configuration of the faulting system into a compressed cabinet archive file (the CAB
file). If data beyond the minidump is required to diagnose the
error, WER can progress to collecting full memory dumps,
memory dumps from related programs, related files, or additional data queried from the reporting computer. Progressive
data collection reduces the scale of incoming data enough
that one pair of SQL servers can record every error on every
Windows system worldwide. Progressive data collection also
reduces the cost to users in time and bandwidth of reporting
errors, thus encouraging user participation.
2. 3. Minimizing human interaction
WER removes users from all but the authorization step of error
reporting and removes programmers from initial error diagnosis. User interaction is reduced in most cases to a yes/no
authorization (see Figure 1). Users may permanently opt in or
out of future authorization requests. WER servers analyze each
error report automatically to direct users to existing fixes, or, as
needed, ask the client to collect additional data. Programmers
are notified only after WER determines that a sufficient number of error reports have been collected for an unresolved bug.
2. 4. Preserving user privacy
We take considerable care to avoid knowingly collecting personal identifying information (PII). This encourages user
participation and reduces regulatory burden. For example,