FEBRUARY2016 | VOL. 59 | NO. 2 | COMMUNICATIONS OF THE ACM 117
MCMC sampling, rerank each based on their actual runtimes, and return the best result.
Finally, there is the implementation of MCMC sampling for
x86_ 64 optimization. Rewrites are represented as loop-free
sequences of instructions of length l, where a distinguished
token (UNUSED) is used to represent unused instruction
slots. This simplification to sequences of bounded length is
crucial, as it places a constant value on the dimensionality
of the resulting search space. 1 The proposal distribution q(·)
chooses from four possible moves: the first two minor and
the last two major:
• Opcode. An instruction is randomly selected, and its
opcode is replaced by a random opcode.
• Operand. An instruction is randomly selected and one
of its operands is replaced by a random operand.
• Swap. Two lines of code are randomly selected and
• Instruction. An instruction is randomly selected and
replaced either by a random instruction or the UNUSED
token. Proposing UNUSED corresponds to deleting an
instruction, and replacing UNUSED by an instruction
corresponds to inserting an instruction.
These moves satisfy the ergodicity property described in
Section 3: any code sequence can be transformed into any
other through repeated application of instruction moves.
These moves also satisfy the symmetry property, and allow
the use of Equation ( 5). To see why, note that the probabilities of performing all four moves are equal to the probabilities of undoing the transformations they produce using a
move of the same type: opcode and operand moves are constrained to sample from identical equivalence classes before
and after acceptance, and swap and instruction moves are
unconstrained in either direction.
5. SEARCH STRATEGIES
An early version of the implementation described above
was able to transformllvm −O0 code into the equiva-
lent of gcc −O3 code, but was unable to produce results
of between 1 and 10 million per second. Using this imple-
mentation, we define an optimized method for computing
eq(·), which achieves sufficient throughput to be useful for
Besides improved performance, Equation ( 10) has
two desirable properties. First, failed computations of
eq(·) will produce a counterexample test case4 that can
be used to refine τ. Although doing so changes the search
space defined by cost(·), in practice the number of failed
validations that are required to produce a robust set of
test cases that accurately predict correctness is quite low.
Second, as remarked above, it improves the search space
by smoothly interpolating between correct and incorrect
Similar considerations apply to the implementation
of the perf(·) term. Although it seems natural to JIT com-
pile both target and rewrite and compare runtimes, the
amount of time required to execute a code sequence suf-
ficiently many times to eliminate transient effects is pro-
hibitively expensive. To account for this, we define a simple
heuristic for approximating runtime which is based on a
static approximation of the average latencies (lat(·)), of
Figure 2 shows a reasonable correlation between this heuristic and actual runtimes for a representative corpus of code
sequences. Outliers are characterized either by disproportionately high instruction level parallelism at the micro-op
level or inconsistent memory access times. A more accurate
model of the higher order performance effects introduced
by a modern CISC processor is feasible if tedious to construct and would likely be necessary for more complex code
Regardless, this approximation is sufficient for the benchmarks that we consider (Section 8). Errors that result from
this approach are addressed by recomputing perf(·) using
the JIT compilation method as a postprocessing step. During
search, we record the n lowest cost programs produced by
Predicted runtime (unitless)
40 50 60 70
Figure 2. Predicted versus observed runtimes for selected
code sequences. Outliers are characterized by instruction level
parallelism and memory effects.
Figure 3. Search space for the Montgomery multiplication
benchmark: O0 and O3 codes are densely connected, whereas expert
code is reachable only by an extremely low probability path.