up” inconsistent state so that the buggy program can continue to execute. Thus, their programs continue to incur
the runtime overhead after the repair is effected. Another
difference from our work is that their data structure repair
requires formal specifications. Finally, their technique is
limited to data structures and does not address the full
range of logic errors. The gcd infinite loop in Section 3, for
example, is outside the scope of this technique. However,
this technique complements ours: in cases where runtime
data structure repair does not provide a viable long-term
solution, it may enable the program to continue to execute
while our technique searches for a long-term repair.
Clearview16 automatically detects and patches assem-bly-level errors in deployed software. Clearview monitors a
program at runtime, learns invariants that characterize normal behavior, and subsequently flags violations for repair.
Candidate patches that make the implicated invariant true
are generated and tested dynamically. Although the performance overhead of Clearview is high, it has successfully been
applied to buggy versions of Mozilla Firefox and evaluated
against a Red Team of hackers. However, Clearview can repair
only those errors that are relevant to selected monitors. Our
method is more generic, providing a single approach to repair
multiple classes of faults without the need for specific monitors, and we do not require continual runtime monitoring
(and the incumbent slowdown) to create and deploy repairs.
This body of work illustrates a burgeoning interest in
the problem of automated software repair and some of the
many possible approaches that might be tried. There are
several other recent but less mature proposals for automatically finding and repairing bugs in software, e.g., Dallmeier
et al., 4 suggesting that we can expect rapid progress in this
area over the next several years.
6. Discussion
The results reported here demonstrate that GP can be
applied to the problem of bug repair in legacy C programs.
However, there are some caveats.
Basic limitations. First, we assume that the defect is reproduc-
ible and that the program behaves deterministically on the
test cases. This limitation can be mitigated by running the test
cases multiple times, but ultimately if the program behavior is
random it will be difficult for our method to find or evaluate a
correct patch. We further assume that positive test cases can
encode program requirements. Test cases are much easier to
obtain than formal specifications or code annotations, but if
too few are used, a repair could sacrifice important functional-
ity. In practice, we are likely to have too many test cases rather
than too few, slowing down fitness evaluation and impeding
the search. We also assume that the path taken along the nega-
tive test case is different from the positive path. If they over-
lap completely, our weighted representation will not guide GP
modifications as effectively. Finally, we assume that the repair
can be constructed from statements already extant in the pro-
gram; in future work, we plan to extend our method to include
a library of repair templates.
evolution. One concern about our results to date is the role
of evolution. Most of our repairs result from one or two ran-
dom modifications to the program, and they are often found
within the first few generations or occasionally, not at all.
We have conducted some experiments using a brute force
algorithm (which applies simple mutation operations in a
predetermined order) and random search (which applies
mutation operations randomly without any selection or
inheritance of partial solutions). Both these simpler alterna-
tives perform as well or better than the GP on many, but not
all, of our benchmark programs. We do not fully understand
what characteristics, either of the program or the particular
bug, determine how easily a solution can be found through
random trial and error. However, thus far GP outperforms
the other two search strategies in cases where the weighted
path is long (i.e., where the fault is difficult to localize).
There are several interesting questions related to the design
of our GP algorithm, but the overall process proved so suc-
cessful initially that we have not experimented carefully with
parameter values, selection strategies, and operator design.
These all could almost certainly be improved.