Programmers sort their buckets and prioritize debugging
effort on the buckets with largest volumes of error reports, thus
helping the most users per unit of work. Often, programmers
will aggregate error counts by function and then work through
the buckets for the function in order of decreasing bucket
count. This strategy tends to be effective as errors at different
locations in the same function often have the same root cause.
The WER database can help find root causes which are
not immediately obvious from memory dumps. For example,
in one instance we received a large number of error reports
with invalid pointer usage in the Windows event tracing
infrastructure. An analysis of the error reports revealed that
96% of the faulting computers were running a specific third-party device driver. With well below 96% market share (based
on all other error reports), we approached the vendor who
found a memory corruption bug in their code. By comparing
expected versus occurring frequency distributions, we similarly have found hidden causes from specific combinations
of third-party drivers and from buggy hardware. A similar
strategy is “stack sampling” in which error reports for similar buckets are sampled to determine which functions, other
than the first target, occur frequently on the thread stacks.
WER can help test programmer hypotheses about the
root causes of errors. The basic strategy is to construct a test
function that can evaluate a hypothesis on a memory dump,
and then apply it to thousands of memory dumps in the
WER database to verify that the hypothesis is not violated.
For example, a Windows programmer debugging an error
related to a shared lock in the Windows I/O subsystem constructed a query to extract the current holder of the lock from
a memory dump and then ran the expression across 10,000
memory dumps to see how many reports had the same lock
holder. One outcome of the analysis was a bug fix; another
was the creation of a new server-side heuristic.
The WER database can measure how widely a software
update has been deployed. Deployment can be measured
by absence, measuring the decrease in error reports fixed by
the software update. Deployment can also be measured by
an increased presence of the new program or module version in error reports for other issues.
The WER database can be used to monitor for regressions. Similar to the strategies for measuring deployment,
we look at error report volumes over time to determine if a
software fix had the desired effect of reducing errors. We also
look at error report volumes around major software releases
to quickly identify and resolve new errors that may appear
with the new release.
5. EVALuATIon AnD IMPACT
5. 1. Scalability
WER collected its first million error reports within 8 months
of its deployment in 1999. Since then, WER has collected billions more. The WER service employs approximately 60 servers provisioned to process well over 100 million error reports
per day. From January 2003 to January 2009, the number of
error reports processed by WER grew by a factor of 30.
The WER service is over provisioned to accommodate
globally correlated events. For example, in February 2007,
Figure 2. Renos Malware: number of error reports per day. Black bar
shows when a fix was released through Wu.
Reports per day
users of Windows Vista were attacked by the Renos Malware.
If installed on a client, Renos caused the Windows GUI shell,
explorer.exe, to crash when it tried to draw the desktop.
A user’s experience of a Renos infection was a continuous
loop in which the shell started, crashed, and restarted. While
a Renos-infected system was useless to a user, the system
booted far enough to allow reporting the error to WER—on
computers where automatic error reporting was enabled—
and to receive updates from Windows Update (WU).
As Figure 2 shows, the number of error reports from systems
infected with Renos rapidly climbed from 0 to almost 1. 2 million
per day. On February 27, shown in black in the graph, Microsoft
released a Windows Defender signature for the Renos infection
via WU. Within 3 days enough systems had received the new
signature to drop reports to under 100,000 per day. Reports for
the original Renos variant became insignificant by the end of
March. The number of computers reporting errors was relatively small: a single computer (somehow) reported 27,000
errors, but stopped after being automatically updated.
5. 2. Finding bugs
WER augments, but does not replace, other methods for
improving software quality. We continue to apply static
analysis and model-checking tools to find errors early in the
development process. 1 These tools are followed by extensive testing regimes before releasing software to users.
WER helps us to rank all bugs and to find bugs not exposed
through other techniques. The Windows Vista programmers fixed 5000 bugs found by WER in beta deployments
after extensive static analysis, but before product release.
Compared to errors reported directly by humans, WER
reports are more useful to programmers. Analyzing data
sets from Windows, SQL, Excel, Outlook, PowerPoint, Word,
and Internet Explorer, we found that a bug reported by WER
is 4. 5–5. 1 times more likely to be fixed than a bug reported
directly by a human. This is because error reports from WER
document internal computation state whereas error reports
from humans document external symptoms.
Given finite programmer resources, WER helps focus
effort on the bugs that have the biggest impact on the
most users. Our experience across many application and
OS releases is that error reports follow a Pareto distribution with a small number of bugs accounting for most
error reports. As an example, the graphs in Figure 3 plot