figure 1. Bugs down over
time = manager bonus.
bad
time
had done, and two, how could we have
missed them the first time?
How do upgrades happen when
more bugs is no good? Companies independently settle on a small number
of upgrade models:
Never. Guarantees “improvement”;
Never before a release (where it would
be most crucial). Counterintuitively happens most often in companies that believe the tool helps with release quality
in that they use it to “gate” the release;
Never before a meeting. This is at least
socially rational;
Upgrade, then roll back. Seems to happen at least once at large companies;
and
Upgrade only checkers where they fix
most errors. Common checkers include
use-after-free, memory corruption,
(sometimes) locking, and (sometimes)
checkers that flag code contradictions.
Do missed errors matter? If people
don’t fix all the bugs, do missed errors
(false negatives) matter? Of course not;
they are invisible. Well, not always.
Common cases: Potential customers
intentionally introduced bugs into the
system, asking “Why didn’t you find it?”
Many check if you find important past
figure 2. no bonus.
bad
time
bugs. The easiest sale is to a group whose
code you are checking that was horribly
burned by a specific bug last week, and
you find it. If you don’t find it? No matter the hundreds of other bugs that may
be the next important bug.
Here is an open secret known to bug
finders: The set of bugs found by tool
A is rarely a superset of another tool B,
even if A is much better than B. Thus,
the discussion gets pushed from “A is
better than B” to “A finds some things,
B finds some things” and does not help
the case of A.
Adding bugs can be a problem; losing already inspected bugs is always a
problem, even if you replace them with
many more new errors. While users
know in theory that the tool is “not a
verifier,” it’s very different when the tool
demonstrates this limitation, good and
hard, by losing a few hundred known errors after an upgrade.
The easiest way to lose bugs is to add
just one to your tool. A bug that causes
false negatives is easy to miss. One such
bug in how our early research tool’s
internal representation handled array
references meant the analysis ignored
most array uses for more than nine
months. In our commercial product,
blatant situations like this are prevented through detailed unit testing, but uncovering the effect of subtle bugs is still
difficult because customer source code
is complex and not available.
churn
Users really want the same result from
run to run. Even if they changed their
code base. Even if they upgraded the tool.
Their model of error messages? Compil-
er warnings. Classic determinism states:
the same input + same function = same
result. What users want: different input
(modified code base) + different func-
tion (tool version) = same result. As a
result, we find upgrades to be a constant
headache. Analysis changes can easily
cause the set of defects found to shift.
The new-speak term we use internally is
“churn.” A big change from academia is
that we spend considerable time and en-
ergy worrying about churn when modify-
ing checkers. We try to cap churn at less
than 5% per release. This goal means
large classes of analysis tricks are disal-
lowed since they cannot obviously guar-
antee minimal effect on the bugs found.
Randomization is verboten, a tragedy
given that it provides simple, elegant so-
lutions to many of the exponential prob-
lems we encounter. Timeouts are also
bad and sometimes used as a last resort
but never encouraged.