is not ’y’ on line 4. Since the value of the input is unknown,
one simple approach is to represent the unknown value
using a special abstract constant é. Now, programs may
have multiple sources of unknown values, all of which are
represented by é. Thus, é is not a particular unknown but
the set of all unknowns in the program. Hence, the predicates é = ’y’ (which should be read as: ’y’ is equal to
some element of values represented by é) and é ’y’
(which should be read as: ’y’ is not equal to some element
of values represented by é) are simultaneously satisfiable.
As a result, program paths where input is equal to ’y’ at
line ( 2), but not equal to ’y’ at line ( 4) (or vice versa) cannot be ruled out, and the analysis would erroneously report
an error.
A more precise alternative for reasoning about unknown
values is to name them using variables (called choice variables) that stand for a single, but unknown, value. Observe
that this strategy of introducing choice variables is a
refinement over the previous approach because two distinct environment choices are modeled by two distinct
choice variables, b and b¢. Thus, while a choice variable b
may represent any value, it cannot represent two distinct
values at the same time. For instance, if we introduce the
choice variable b for the unknown value of the result of the
call to get_user_input on line 1, the constraint characterizing the failure condition is b = y Ù b π y, which is unsatisfiable, establishing that the call to fopen is matched by
a call to fclose. The insight is that the use of choice variables allows the analysis to identify when two values arise
from the same environment choice without imposing any
restrictions on their values.
While this latter strategy allows for more precise reasoning, it leads to two difficulties—one theoretical and one
bool query_user(bool feature_enabled) {
A: if(!feature_enabled) return false;
B: char input = get_user_input();
C: if(input == ’y’) return true;
D: if(input == ’n’) return false;
E: printf(“Input must be y or n!
F: Please try again.\n”);
G: return query_user(true);
}
practical—that the simpler, but less precise, strategy does
not suffer from. Consider the following function:a
Suppose we want to know when query_user returns
true. The return value of get_user_input is statically
unknown; hence it is identified by a choice variable b. The
variable feature_enabled, however, is definitely not a
nondeterministic choice, as its value is determined by the
function’s caller. We represent feature_enabled by an
observable variable, a, provided by callers of this function.
The condition, P, under which query_user returns true
a While this function would typically be written using a loop, the same problem arises both for loops and recursive functions, and we use a recursive
function because it is easier to explain.
116 communicaTionS of The acm | auGust 2010 | Vol. 53 | no. 8
(abbreviated T) in any calling context, is then given by the
constraint:
P.b = (a = T) Ù (b = ’y’Ú (¬(b = ’n’) Ù P[T/a] = T)) (*)
This formula is read as follows. The term a = T captures that
the function returns true only if feature_enabled is true
(line A). Furthermore, the user input must either be ’y’ (term
b = ’y’ and line C) or it must not be ’n’ (term ¬(b = ’n’) and
line D) and the recursive call on line G must return true (term
P[T/a]). Observe that because the function is recursive, so is
the formula. In the term P[T/a], the substitution [T/a] models
that on the recursive call, the formal parameter a is replaced by
actual parameter true. Finally, the binding P.b reminds us
that b is a choice variable. When the equation is unfolded to perform the substitution [T/a] we must also make the environment
choice for b. The most general choice we can make is to replace
b with a fresh variable b¢, indicating that we do not know what
choice is made, but it is potentially different from any other
choice on subsequent recursive calls. Thus, P[T/a] unfolds to:
(T = T) Ù (b ¢ = ’y’Ú (¬( b ¢ = ’n’) Ù P[T/a]
While the equation (*) expresses the condition under
which query_user returns true, the recursive definition
means it is not immediately useful. Furthermore, it is easy
to see that there is no finite nonrecursive formula that is
a solution of the recursive equation (*) because repeated
unfolding of P [ T/a] introduces an infinite sequence of fresh
choice variables b ¢, b ², b ¢², . . . . Hence, it is not always possible to give a finite closed-form formula describing the exact
condition under which a program property holds.
On the practical side, real programs have many sources of
unknowns; for example, assuming we do not reason about
the internal state of the memory management system, every
call to malloc in a C program appears as a nondeterministic choice returning either NULL or newly allocated memory.
In practice, the number of choice variables grows rapidly
with the size of the program, overwhelming the constraint
solver and resulting in poor analysis scalability. Therefore,
it is important to avoid tracking choice variables whenever
they are unnecessary for proving a property.
Our solution to both the theoretical and the practical
problems can be understood only in the larger context of why
we want to perform static analysis in the first place. Choice
variables allow us to create precise models of how programs
interact with their environment, which is good because we
never know a priori which parts of the program are important to analyze precisely and so introducing unnecessary
imprecision anywhere in the model is potentially disastrous.
But the model has more information than needed to answer
most individual questions we care about; in fact, we are usually interested in only two kinds of 1-bit decision problems,
may and must queries. If one is interested in proving that a
program does not do something “bad” (so-called safety properties), then the analysis needs to ask may questions, such as
“May this program dereference NULL?” or “May this program
raise an exception?”. On the other hand, if one is interested in
proving that a program eventually does something good (so
called liveness properties), then the analysis needs to ask must
questions, such as “Must this memory be eventually freed?”.