merge sort to insertion sort during its
recursive decomposition whenever the
best algorithm for the current input
size changes.
Second, we illustrate the benefits to
program portability. Multi-core architectures have drastically increased the
processor design space resulting in a
large variety of processor designs currently on the market. Such variance can
significantly hinder efforts to port performance critical code. We optimized
our sort benchmark on three parallel architectures designed for a variety of purposes: Intel Core 2 Due mobile processor, Intel Xeon E7340 server processor,
and the Sun Fire T200 Niagara low power, high throughput server processor.
Table 1 illustrates the benefits of
targeted tuning on each architecture.
We found that programs trained on architectures different from the ones on
which they are run exhibit significant
slowdowns. For example, even though
they have the same number of cores,
the autotuned configuration file from
the Niagara machine results in a 2.35x
loss of performance when used on the
Xeon processor. On average we observed a slowdown of 1.68x across all
of the systems we tested.
Table 2 displays the optimal configurations for the sort benchmark after
running the same autotuning process
on the three architectures. It is interesting to note the dramatic differences
bet ween the choice of algorithms, composition switching points, and scalability. The Intel architectures (with
Figure 3: The flow for the compilation of a PetaBricks program with a single transform is shown. Additional transforms would cause the center part of the diagram
to be duplicated.
tation is the choice dependency graph,
which is the primary way that choices
are represented in PetaBricks. At a
high level, the information contained
in the choice dependency graph is similar to the dependency graph shown
for our example program in Figure 2.
However, the data is represented as an
“inverse” of that graph: data dependencies (previously represented by edges)
are represented by vertices, while rules
(previously represented by vertices) are
represented by graph hyperedges. Additionally, data may be split into multiple vertices in the choice dependency
graph if the transform contains rules
that operate on just subregions of that
data. The PetaBricks compiler uses
this graph to manage code choices and
to synthesize the outer control flow of
the rules.
The final phase of compilation generates an output binary and a training
information file containing static analysis information. These two outputs
are used by the autotuner to search the
space of possible algorithmic paths.
Autotuning creates a choice configuration file, which can either be used by
the output binary to run directly or can
be fed back into the compiler to allow
additional optimizations.
The autotuner follows a genetic al-
gorithm approach to search through
the available choice space. It maintains
a population of candidate algorithms
which it continually expands using a
fixed set of high level mutators, which
are generated through static code
analysis. The autotuner then prunes
the population in order to allow it to
evolve more optimal algorithms. The
input sizes used for testing during this
process grow exponentially, which nat-
urally exploits any optimal substruc-
ture inherent to most programs.
PERFORMANCE
In our prior work [ 1], we implemented
a suite of benchmarks to showcase the
benefits of using our compiler and autotuner. First, we illustrate the impact of
allowing greater algorithmic flexibility
on computational performance. Figures 4-6 show the performance of our
autotuned sort, eigenproblem, and 2D
Poisson’s equation solver algorithms,
respectively, compared to implementations that utilize a single algorithmic
choice. In all cases the autotuned algorithm obtained a significant speedup.
These results were gathered on a 8-way
(dual socket, quad core) Intel Xeon
E7340 system running at 2. 4 GHz using 64-bit CSAIL Debian 4.0 with Linux
kernel 2. 6. 18 and GCC 4. 1. 2.
Note that in some cases, such as
with our sort benchmark, the autotuned algorithm was able to significantly outperform the best available
single algorithm. This is because our
autotuned algorithm is able to leverage
multiple algorithms during a single execution. Our autotuned sort program
is able to switch from radix sort to
“Multi-core
architectures have
drastically increased
the processor design
space resulting in
a large variety of
processor designs
currently on the
market.”