ates repairs for several classes of defects in 11 production programs taken from multiple domains.
• Experiments to analyze how algorithm performance
scales up with problem size and the relative contribution
of different components of the evolutionary algorithm.
In the remaining sections of the paper we first give an
overview of our technical approach (Section 2), illustrating it
with a recent bug in Microsoft’s Zune media player (Section
3). In Section 4 we report results obtained for repairs of several benchmark programs and study how algorithm performance scales with problem size. We place the work in the
context of prior contributions in Section 5 and discuss our
experiences, caveats, and thoughts for future work in Section
6, concluding in Section 7.
2. technicaL aPPRoach
The core of our method is an evolutionary algorithm that
repairs programs by selectively searching through the space
of related program variants until it discovers one that avoids
known defects and retains key functionality. We use a novel
GP representation and make assumptions about the probable nature and location of the necessary repair, improving
search efficiency. Given a defective program, there are several issues to be addressed:
1. What is it doing wrong? We take as input a set of
negative test cases that characterizes a fault. The input program fails all negative test cases.
2. What is it supposed to do? We take as input a set of
positive test cases that encode functionality requirements.
The input program passes all positive test cases.
3. Where should we change it? We favor changing program locations visited when executing the negative
test cases and avoid changing program locations visited when executing the positive test cases.
4. how should we change it? We insert, delete, and swap
program statements and control flow using existing
program structure. We favor insertions based on the
existing program structure.
5. When are we finished? We call the first variant that
passes all positive and negative test cases a primary
repair. We minimize the differences between it and the
original input program to produce a final repair.
To present the repair process, we first describe our program representation (Section 2. 1) and fault localization
(Section 2. 2) choices. We then detail the GP-based repair
strategy (Section 2. 3), discussing the genetic operators
(Section 2. 4), which modify the representation, and the fitness function (Section 2. 5), which uses test cases to evaluate
the results of the modifications. Finally, a postprocessing
step is used to minimize the resulting repair (Section 2. 6).
2. 1. Representation
There are a number of commonly accepted structures for
representing programs, such as control-flow graphs (CFGs)
and abstract syntax trees (ASTs). We chose ASTs because
they can losslessly represent all structured programs and
tree operations are well studied in GP. ASTs can be expressed
at multiple levels of abstraction or granularity, and our program representation reflects the trade-off between expressive
power and scalability. For example, C programs contain both
statements, such as the conditional statement “if (!p) {x =
0;}” and expressions, such as “0” or “(!p)”. For scalability, we
treat the statement as the basic unit, or gene. Thus, we never
modify “(!p)” into “(p || error_flag)” because doing so
would involve changing the inner structure of an expression.
Instead, when manipulating compound statements, we operate on entire AST subtrees. For example, we might delete the
entire “if …” statement, including its then-branch and else-branch children. Finally, we never directly modify low-level
control-flow directives such as break, continue, or goto,
although statements around them can be modified.
2. 2. fault localization
We assume that software defects are local and that fixing one
does not require changing the entire program. This assumption narrows the search space by limiting code changes to
portions of the program likely to contain the defect. We bias
modifications toward statement nodes that were visited when
running the negative test cases but not visited when running
the positive test cases. We find this information by assigning
each statement a unique ID and instrumenting the program
to print out the ID of each statement visited. 14 This allows our
approach to scale to larger program sizes. For example, while
the atris program contains a total of 8068 statement nodes
(Table 1), we use this fault localization information to bias
the search toward 34 statement nodes that are likely to matter, a reduction of over two orders of magnitude.
Formally, each program variant is a pair containing:
1. An abstract syntax tree (AST) including all of the statements s in the program.
2. A weighted path through that program. The weighted
path is a list of pairs ás, wsñ, each containing a statement in the program visited on the negative test case
and the associated weight for that statement.
The default path weight of a statement is 1.0 if it is visited
in the negative test case but not on any positive test case. Its
weight is 0.1 if it is visited on both positive and negative test
cases. All other statements have weight 0.0. The weight represents an initial guess of how relevant the statement is to
the bug. This approach is related to the union/intersection
model of fault localization. 20 The weighted path length is the
sum of statement weights on the weighted path. This scalar
gives a rough estimate of the complexity of the search space
and is correlated with algorithm performance (Section 4).
We return to the issue of fault localization in Section 6.
2. 3. Genetic programming
We use GP to maintain a population of program variants.
Each variant, sometimes referred to as an individual, is represented as an AST annotated with a weighted path (fault
localization information). We modify variants using two
genetic operations, mutation and crossover. Mutation makes
random changes to the nodes along the weighted path, while