The trial terminates if it discovers a primary repair. We
performed 100 trials for each program, memoizing fitnesses
such that within one trial, two individuals with different
ASTs but the same source code are not evaluated twice.
Similarly, individuals that are copied to the next generation
without change are not reevaluated.
Once a primary repair has been located, the process of
minimizing it to a final repair is quite rapid, requiring less
than 5 s on average. Final repairs, expressed in patch format, varied in size from four lines (e.g., zune: one insert,
one delete, and one context-location line for each edit) to 11
lines (e.g., look utx 4. 3).
Not all trials lead to a successful repair; in the repairs
shown here, an average of 60% of trials produce a primary
repair. The “Time” and “Fitness Evals” columns in Table 1
measure the effort taken for a successful trial. Since all trials
are independent, a number of trials can be run in parallel,
terminating when the first successful trial yields a primary
repair. In addition, the fitnesses of different variants and
the results of different test cases for a given variant can all
be evaluated independently, making our approach easy to
parallelize on multicore architectures. The measurement in
Table 1 was made on a quad-core 3 GHz machine.
Over half of the total time required to create a repair
is spent evaluating fitness by running compiled variants
through test cases. For programs with large test suites, this
cost can be considerable. A dominant factor in the scalability
of our approach is thus the number of such fitness evaluations that must be made to find a repair. The number of fitness evaluations required is related to the size of the search
space and the efficacy of the search strategy: each fitness
evaluation represents a single probe. We hypothesize that
the size of the weighted path is a good representation of the
search space size; recall that we only modify statements along
the path (Sections 2. 2 and 2. 4). Figure 1 shows the results of
an empirical investigation into this relationship, plotting the
average number of fitness evaluations required to produce
each of 18 repairs against the length of their weighted paths
(note log–log scale). Although more data points are needed
before strong conclusions can be drawn, the plot suggests
that the number of fitness evaluations, and thus the search
time, may scale as a power law of the form y = axb where b
is the slope of the best fit line (0.78). This suggests that the
time to find a repair scales nearly linearly with the size of the
weighted path for a fixed number of test cases.
Our approach has also failed to repair some defects,
including those that require many simultaneous edits or
changes that cannot be made directly at the statement level
(e.g., matmul(b, a) should be matmul(a, b) ). We return to
the issue of repair quality in Section 6.
5. ReLateD WoRK
Arcuri2 proposed the idea of using GP to repair software
bugs automatically, and Orlov and Sipper experimented
with evolving Java bytecode. 15 However, our work is the
first to report substantial experimental results on real programs with real bugs. The field of Search-Based Software
Engineering (SBSE) 9 uses evolutionary and related methods for software testing, e.g., to develop test suites, improve
figure 1. GP search time scales with weighted path size. Data are
shown for 18 programs successfully repaired by GP (gcd and zune
examples omitted; figure includes several additional programs to
those listed in table 1), with best linear fit. the x-axis is the base- 10
logarithm of the weighted path length, and the y-axis shows the
logarithm of the total number of fitness evaluations performed
before the primary repair is found (averaged over 100 runs).
4
log of avg fitness evals per repair
3
3. 5
2
2. 5
1
1. 5
f(x) = 0.78x + 0.08
R2 = 0.62
flex units
indent
ccrypt nullhttpd tiff
wu-ftpd
imagemagick
atris
php
openldap
uniq look utx
look svr leukocyte
deroff lighttpd
0.5
0.5
1
3 2. 5 2 1. 5
log of weighted path length
3. 5
4
software project management, and effort estimation find
safety violations and in some cases refactor or reengineer
large software bases. In SBSE, most innovations in the GP
technique involve new kinds of fitness functions, and there
has been less emphasis on novel representations and operators, such as those explored here.
Our approach automatically repairs programs without
specifications. In previous work, we developed an automatic
algorithm for soundly repairing programs with specifications. 23 However, formal specifications are not always available (e.g., there were no formal specifications available for
any of the programs repaired here), so the present work
focuses on test cases to check and ensure correctness.
Trace and fault localization, minimization, and explanation (e.g., Jones and Harrold11) projects also aim to elucidate
faults and ease debugging. These approaches typically narrow down an error symptom to a few lines (a potential cause).
Our work extends this work by proposing a concrete repair.
In addition, these other algorithms are usually limited to the
given trace or source code and will thus never localize the
“cause” of an error to a missing statement or suggest that a
statement be moved. Our approach can infer new code that
should be added, deleted, or swapped: 6 of the 11 repairs in
Table 1 required insertions or swaps.
Demsky et al. 5 present a technique for data structure
repair. Given a formal specification of data structure
consistency, they modify a program so that if the data
structures ever become inconsistent, they can be modified
back to a consistent state at runtime. Their technique does
not modify the program source code in a user-visible way.
Instead, it inserts runtime monitoring code that “patches