118 COMMUNICATIONS OF THE ACM | FEBRUARY2016 | VOL. 59 | NO. 2
that were competitive with expert hand-written code. The
reason is suggested by Figure 3, which abstractly depicts
the search space for the Montgomery multiplication benchmark shown in Figure 1. For loop-free code sequences,
llvm −O0 and gcc −O3 differ primarily in stack use and
instruction selection, but otherwise produce algorithmically similar results. Compilers are generally designed to
compose many small local transformations: dead code
elimination deletes an instruction, constant propagation
changes a register to an immediate, and strength reduction replaces a multiplication by an add. Sequences of
local optimizations such as these correspond to regions of
equivalent code sequences that are densely connected by
very short sequences of moves (often just one) and easily
traversed by local search methods. Beginning from llvm
−O0 code, MCMC sampling will quickly improve local inef-ficiencies one by one and hill climb its way to the equivalent of gcc −O3 code.
The code discovered by STOKE occupies an entirely different region of the search space. As remarked earlier, it represents a completely distinct algorithm for implementing
the kernel at the assembly level. The only method for a local
search procedure to produce code of this form from compiler generated code is to traverse the extremely low probability path that builds the code in place next to the original
(all the while increasing its cost) only to delete the original
code at the very end.
Although MCMC sampling is guaranteed to traverse this
path in the limit, the likelihood of it doing so in any reason-
able amount of time is so low as to be useless in practice.
This observation motivates the division of cost minimiza-
tion into two phases:
• A synthesis phase focused solely on correctness, which
attempts to locate regions of equivalent code sequences
that are distinct from the region occupied by the
• An optimization phase focused on performance, which
searches for the fastest sequence within each of those
These phases share the same implementation and differ
only in starting point and acceptance function. Synthesis
begins with a random code sequence, while optimization
begins from a code sequence that is known to be equivalent
to the target. Synthesis ignores the perf(·) term and uses
Equation ( 10) as its cost function, whereas optimization
uses both terms, which allows it to improve performance
while also experimenting with “shortcuts” that (
temporarily) violate correctness.
It is perhaps unintuitive that synthesis should be able to
produce a correct rewrite from such an enormous search
space in a tractable amount of time. In our experience, syn-
thesis is effective precisely when it is possible to discover
portions of a correct rewrite incrementally, rather than all
at once. Figure 4 compares cost over time against the per-
centage of instructions that appear in the final rewrite for
the Montgomery multiplication benchmark. As synthesis
proceeds, the percentage of correct code increases in inverse
proportion to the value of the cost function.
While this is encouraging and there are many code
sequences that can be synthesized in pieces, there are
many that cannot. Fortunately, even when synthesis fails,
optimization is still possible. It must simply proceed only
from the region occupied by the target as a starting point.
6. SEARCH OPTIMIZATIONS
Equation ( 10) is sufficiently fast for MCMC sampling,
however its performance can be further improved. As
described above, the eq* (·) term is computed by execut-
ing a proposal on test cases, noting the ratio in total
cost with the current rewrite, and then sampling a ran-
dom variable to decide whether to accept the proposal.
However, by first sampling p, and then computing the
maximum ratio that the algorithm will accept for that
value, it is possible to terminate the evaluation of test
cases as soon as that bound is exceeded and the proposal
is guaranteed to be rejected
Percent (with trend)
Figure 4. Cost over time versus percentage of instructions
that appear in the final zero-cost rewrite for the Montgomery
multiplication synthesis benchmark.
Figure 5. Proposals evaluated per second versus test cases evaluated
prior to early termination, for the Montgomery multiplication
synthesis benchmark. Cost function shown for reference.