Stochastic Program Optimization
By Eric Schkufza, Rahul Sharma, and Alex Aiken
The optimization of short sequences of loop-free, fixed-point
assembly code sequences is an important problem in high-performance computing. However, the competing constraints
of transformation correctness and performance improvement often force even special purpose compilers to produce sub-optimal code. We show that by encoding these
constraints as terms in a cost function, and using a Markov
Chain Monte Carlo sampler to rapidly explore the space of
all possible code sequences, we are able to generate aggressively optimized versions of a given target code sequence.
Beginning from binaries compiled by llvm −O0, we are able
to produce provably correct code sequences that either
match or outperform the code produced by gcc −O3, icc
−O3, and in some cases expert handwritten assembly.
For many application domains there is considerable value
in producing the most performant code possible. However,
the traditional structure of a compiler’s optimization phase
is ill-suited to this task. Factoring the optimization problem
into a collection of small subproblems that can be solved
independently—although suitable for generating consistently good code—can lead to sub-optimal results. In many
cases, the best possible code can only be obtained through
the simultaneous consideration of mutually dependent
issues such as instruction selection, register allocation, and
Previous approaches to this problem have focused on the
exploration of all possibilities within some limited class of
code sequences. In contrast to a traditional compiler, which
uses performance constraints to drive the generation of a single sequence, these systems consider multiple sequences and
select the one that is best able to satisfy those constraints. An
attractive feature of this approach is completeness: if a code
sequence exists that meets the desired constraints it is guaranteed to be found. However, completeness also places practical limitations on the type of code that can be considered.
These techniques are either limited to sequences that are
shorter than the threshold at which many interesting optimizations take place or code written in simplified languages.
We overcome this limitation through the use of incomplete search: the competing requirements of correctness
and performance improvement are encoded as terms in
a cost function which is defined over the space of all loop-free x86_ 64 instruction sequences, and the optimization
task is formulated as a cost minimization problem. While
the search space is highly irregular and not amenable to
exact optimization techniques, we show that the common approach of employing a Markov Chain Monte Carlo
(MCMC) sampler to explore the function and produce low-cost samples is sufficient for producing high-quality code.
Although the technique sacrifices completeness it produces dramatic increases in the quality of the resulting code.
Figure 1 shows two versions of the Montgomery multiplication kernel used by the OpenSSL RSA encryption library.
Beginning from code compiled by llvm −O0 ( 116 lines, not
shown), our prototype stochastic optimizer STOKE produces
code (right) that is 16 lines shorter and 1. 6 times faster than
the code produced by gcc −O3 (left), and even slightly faster
than the expert handwritten assembly included in the
2. RELATED WORK
Although techniques that preserve completeness are effective within certain domains, their general applicability
remains limited. The shortcomings are best highlighted in
the context of the code sequence shown in Figure 1.
The code does the following: two 32-bit values, ecx and
edx, are concatenated and multiplied by the 64-bit value
rsi to produce a 128-bit product. The 64-bit values in rdi
and r8 are added to that product, and the result is stored
inr8 andrdi. The version produced bygcc −O3 (left)
implements the 128-bit multiplication as four 64-bit multiplications and a summation of the results. In contrast, the
version produced by STOKE (right), uses a hardware intrinsic which requires that the inputs be permuted and moved
to distinguished register locations so that the multiplication
may be performed in a single step. The odd looking move on
line 4 (right) produces the non-obvious but necessary side
effect of zeroing the upper 32 bits of rdx.
Massalin’s superoptimizer12 explicitly enumerates sequences
of code of increasing length and selects the first that behaves
identically to the input sequence on a set of test cases.
Massalin reports optimizing instruction sequences of up to
length 12, but only after restricting the set of enumerable
opcodes to between 10 and 15. In contrast, STOKE uses a large
subset of the nearly 400 x86_ 64 opcodes, some with over
20 variations, to produce the 11 instruction kernel shown in
Figure 1. It is unlikely that Massalin’s approach would scale
to an instruction set of this size.
Denali9 and Equality Saturation, 17 achieve improved
scalability by only considering code sequences that are
equivalent to the input sequence; candidates are explored
through successive application of equality preserving trans-
formations. Because both techniques are goal-directed,
they dramatically improve the number of primitive instruc-
tions and the length of sequences that can be considered in
practice. However, both also rely heavily on expert knowl-
edge. It is unclear whether an expert would know to encode
This work was supported by NSF grant CCF-0915766 and
the Army High Performance Computing Research Center.