crossover exchanges subtrees between two ASTs (see below for
details). Each modification produces a new AST and weighted
program path. The fitness of each variant is evaluated by compiling the AST and running it on the test cases. Its final fitness
is a weighted sum of the positive and negative test cases it
passes. Once the fitnesses have been computed for each individual, a selection phase deletes the bottom-ranked 50% of the
population.a The new population is formed by first crossing
over the remaining high-fitness individuals with the original
program. Each crossover produces a single child. We add the
children to the population and retain the parents unchanged,
maintaining a constant population size. Finally, all surviving
individuals are mutated.
The repair process terminates either when it finds a candidate solution that passes all its positive and negative test
cases, or when it exceeds a preset number of generations.
The first variant to pass all test cases is the primary repair.
2. 4. Genetic operators
As mentioned above, we apply GP operators to a given variant to produce new program variants, thus exploring the
search space of possible repairs. A key operator is mutation,
which makes random changes to an individual. Because
the primitive unit (gene) of our representation is the statement, mutation is more complicated than the simple bit
flip used in other evolutionary algorithms. Only statements
on the weighted path are subject to the mutation operator.
Each location on the weighted path is considered for mutation with probability equal to its path weight multiplied by
a global mutation rate. A statement selected for mutation
is randomly subjected to either deletion (the entire statement and all its substatements are deleted: s ¬ {}), insertion
(another statement is inserted after it: s ¬ {s; ś;}), or swap
of (s ¬ ś while ś ¬ s). Note that a single mutation step in
our scheme might contain multiple statement-level mutation operations along the weighted path.
The second operation for manipulating variants is
crossover, which in GP exchanges subtrees chosen at random
between two individuals. Although our initial experiments
used a more complicated form of crossover, we have seen
that the results do not depend on the particular crossover
operator used. 8 During each generation, every surviving variant undergoes crossover.
Finally, there are a number of other C program components not touched by the GP operators, such as datatype definitions and local and global variable declarations. Because
these are never on the weighted path, they are never modified
by mutation or crossover. This potentially limits the expressive power of the repairs: If the best fix for a bug is to change
a data structure definition, GP will not discover that fix. For
example, some programs can be repaired either by reordering
the data structure fields, or by changing the program control
flow; our technique finds the second repair. Ignoring variable
declarations, on the other hand, can cause problems with
ill-formed variants. Because of the constraints on mutation
and crossover, GP never generates syntactically ill-formed
a We obtained results qualitatively similar to those reported here with a
more standard method known as tournament selection.
programs (e.g., it will never generate unbalanced parentheses). However, it could move the use of a variable outside of
its declared scope, leading to a semantically ill-formed variant that does not type check and thus does not compile.
2. 5. fitness function
In GP, the fitness function is an objective function used to evaluate variants. The fitness of an individual in a program repair
task should assess how well the program avoids the program
bug while still doing “everything else it is supposed to do.” We
use test cases to measure fitness. For our purposes, a test case
consists of input to the program (e.g., command-line arguments, data files read from the disk, etc.) and an oracle comparator function that encodes the desired response. A program P
is said to pass a test case T iff the oracle is satisfied with the
program’s output: Toracle(P(Tinput) ) = pass. Test cases may check
additional behavior beyond pure functional correctness (e.g.,
the program may be required to produce the correct answer
within a given time bound or otherwise avoid infinite loops).
Such testing accounts for as much as 45% of total software life-cycle costs, 17 and finding test cases to cover all parts of the program and all required behavior is a difficult but well-studied
problem in the field of software engineering.
We call the defect-demonstrating inputs and their anomalous outputs (i.e., the bug we want to fix) the negative test cases.
We use a subset of the program’s existing test inputs and
oracles to encode the core functionalities of the program, and
call them the positive test cases. Many techniques are available
for identifying bugs in programs, both statically (e.g., Ball and
Rajamani3 and Hovemeyer and Pugh10) and dynamically (e.g.,
Forrest et al. 7 and Liblit et al. 13). We assume that a bug has been
identified and associated with at least one negative test case.
The fitness function takes a program variant (genotype),
compiles the internal representation into an executable program, and runs it against the set of positive and negative test
cases. It returns the weighted sum of the test cases passed.
The sum is weighted so that passing the negative test cases
is worth at least as much as passing the positive test cases.
Intuitively, this weighting rewards the search for moving
toward a possible repair. Programs that do not compile are
assigned fitness zero.
2. 6. minimizing the repair
Because the GP may introduce irrelevant changes, we use
program analysis methods to trim unnecessary edits from
the primary repair. For example, in addition to the repair,
the GP might produce dead code (x = 3; x = 5;) or calls to
irrelevant functions. We use tree-structured difference algorithms and delta debugging techniques in a postprocessing
step to generate a simplified patch that, when applied to the
original program, causes it to pass all of the test cases.
Using tree-structured differencing, 1 we view the primary
repair as a set of changes against the original program. Each
change is a tree-structured operation such as “take the subtree of the AST rooted at position 4 and move it so that it
becomes the 5th child of the node at position 6”. We seek to
find a small subset of changes that produces a program that
still passes all of the test cases.
Let Cp = {c1, …, cn} be the set of changes associated with