Figure 1. Typical WER authorization dialog.
although WER collects hardware configuration information,
client code zeros serial numbers, and other known unique
identifiers to avoid transmitting data that might identify the
sending computer. WER operates on an informed consent
policy with users. Errors are reported only with user consent.
All consent requests default to negative, thus requiring that
the user opt-in before transmission. WER reporting can be
disabled on a per-error, per-program, or per-computer basis
by individual users or by administrators. Because WER does
not have sufficient metadata to locate and filter possible PII
from collected stack or heap data, we minimize the collection of heap data. Microsoft also enforces data-access policies that restrict the use of WER data strictly to debugging
and improving program quality.
2. 5. Providing solutions to users
Many errors have known corrections. For example, users
running out-of-date software should install the latest service
pack. The WER service maintains a mapping from buckets
to solutions. A solution is the URL of a web page describing
steps a user should take to prevent reoccurrence of the error.
Solution URLs can link the user to a page hosting a patch for
a specific problem, to an update site where users can get the
latest version, or to documentation describing workarounds.
Individual solutions can be applied to one or more buckets
with a simple regular expression matching mechanism. For
example, all users who hit any problem with the original
release of Word 2003 are directed to a web page hosting the
latest Office 2003 service pack.
3. BuCKETInG ALGoRIThMS
The most important element of WER is its mechanism for
automatically assigning error reports to buckets. Conceptually
WER bucketing heuristics can be divided along two axes. The
first axis describes where the bucketing code runs: heuristics performed on client computers attempt to minimize the
load on the WER servers and heuristics performed on servers
attempt to minimize the load on programmers. The second
axis describes the effect of the heuristic on the number of final
buckets presented to programmers from a set of incoming error
reports: expanding heuristics increase the number of buckets
so that no two bugs are assigned to the same bucket;
condensing heuristics decrease the number of buckets so that no two
buckets contain error reports from the same bug. Working in
concert, expanding and condensing heuristics should move
WER toward the desired goal of a one-to-one mapping between
bugs and buckets.
3. 1. Client-side bucketing
When an error report is first generated, the client-side
bucketing heuristics attempt to produce a unique bucket
label using only local information; ideally a label likely to align
with other reports caused by the same bug. The client-side
heuristics are important because in most cases, the only data
communicated to the WER servers will be a bucket label. An
initial label contains the faulting program, module, and offset
of the program counter within the module. Additional heuristics apply under special conditions, such as when an error is
caused by a hung application. Programs can also apply custom
client-side bucketing heuristics through the WER APIs.
Most client-side heuristics are expanding heuristics,
intended to spread separate bugs into distinct buckets. For
example, the hang_wait_chain heuristic starts from the
program’s user-input thread and walks the chain of threads
waiting on synchronization objects held by other threads to
find the source of the hang. The few client-side condensing
heuristics were derived empirically for common cases where
a single bug produces many buckets. For example, the
unloaded_module heuristic condenses all errors where a
module has been unloaded prematurely due to an application reference counting bug.
3. 2. Server-side bucketing
Errors collected by WER clients are sent to the WER service.
The heuristics for server-side bucketing attempt to classify
error reports to maximize programmer effectiveness. While
the current server-side code base includes over 500 heuristics, the most important heuristics execute in an algorithm
that analyzes the memory dump to determine which thread
context and stack frame most likely caused the error. The
algorithm finds all thread context records in the memory
dump. It assigns each stack frame a priority from 0 to 5
based on its increasing likelihood of being a root cause. The
frame with the highest priority is selected. Priority 1 is used
for core OS components, like the kernel, priority 2 for core
device drivers, priority 3 for other OS code like the shell, and
priority 4 for most other code. Priority 5, the highest priority,
is reserved for frames known to trigger an error, such as a
caller of assert. Priority 0, the lowest priority, is reserved
for functions known never to be the root cause of an error,
such as memcpy, memset, and strcpy.
WER contains a number of server-side heuristics to filter
out error reports unlikely to be debuggable, such as applications executing corrupt binaries. Kernel dumps are placed
into special buckets if they contain evidence of out-of-date
device drivers, drivers known to corrupt the kernel heap, or
hardware known to cause memory or computation errors.
4. STATISTICS-BASED DEBuGGInG
Perhaps the most important feature enabled by WER is
statistics-based debugging. With data from a sufficient percentage of all errors that occur on Windows systems worldwide, programmers can mine the WER database to prioritize
debugging effort, find hidden causes, test root cause hypotheses, measure deployment of solutions, and monitor for regressions. The amount of data in the WER database is enormous,
yielding opportunity for creative and useful queries.