Table 1: Slowdown for sort benchmark on an input size of 100,000 can be seen
when the program is tuned on a machine different from the one it is run on.
Slowdowns are relative to training natively. Descriptions of abbreviated system
names can be found in Table 2.
types of cores in a single system. Examples of this include coprocessors such
as GPGPUs and asymmetric multi-core
chips [ 3]. We are expanding our run-time and autotuning infrastructure to
support these types of processors so
PetaBricks programs can effectively
utilize heterogeneous architectures.
Distributed memory and clusters. An interesting future direction
is support for a distributed memory
back-end so that we can run unmodified PetaBricks programs on clusters.
Moving to clusters will add even more
choices for the compiler to analyze, as
it must decide both what algorithms to
use and where to run them. A key challenge in this area is autotuning the
management of data.
work. For readers who are more appli-cations-oriented, please see an in-depth
example of our framework applied to a
variable-accuracy multigrid solver [ 2].
PORTABILITY AND PERFORMANCE
Getting consistent, scalable, and portable performance is difficult. The compiler has the daunting task of selecting
an effective optimization configuration from possibilities with drastically
different impacts on the performance.
No single choice of parameters can
yield the best possible result as different algorithms may be required under
different circumstances. The high performance computing community has
always known that in many problem
domains, the best sequential algorithm is different from the best parallel algorithm. Varying problem size
and data sets will also require different algorithms. Currently there is no
viable way for incorporating all these
algorithmic choices into a single program to produce portable programs
with consistently high performance.
We have introduced the first language that allows programmers to
naturally express algorithmic choice
explicitly so as to empower the compiler
to perform deeper optimization. We
have created a compiler and an autotuner that is not only able to compose a
complex program using fine-grained algorithmic choices but also find the right
choice for many other parameters.
Trends show us that programs have
a lifetime running into decades while
architectures are much shorter lived.
With the advent of multi-core processors, architectures are experiencing drastic changes at an even faster
rate. Under these circumstances, it
is a daunting task to write a program
that will perform well on not only today’s architectures but also those of
the future. We believe that PetaBricks
can give programs the portable performance needed to increase their effective lifetimes.
Jason Ansel and Cy Chan are Ph D candidates in the
Computer Science and Artificial Intelligence Laboratory at
MI T. Ansel is advised by Professor Saman Amarasinghe and
Chan is advised by Professor Alan Edelman.
RECOMMENDED FUR THER READING
There is a large variety of work related to
PetaBricks’ autotuning approach of optimizing programs. Please see our original PetaBricks paper [ 1] for a more detailed description of our framework and
a more comprehensive listing of related
1. Ansel, J., Chan, C., Wong, Y. L., Olszewski, M., Zhao, Q.,
Edelman, A., and Amarasinghe, S. 2009. PetaBricks:
A language and compiler for algorithmic choice. In
ACM SIGPLAN Conference on Programming Language
Designand Implementation, Dublin, Ireland.
2. Chan, C., Ansel, J., Wong, Y. L., Amarasinghe,
S., and Edelman, A. 2009. Autotuning multigrid
with PetaBricks. In ACM/IEEE Conference on
Supercomputing, Portland, Ore.
3. Kumar, R., Tullsen, D. M., Jouppi, N. P., and
Ranganathan, P. 2005. Heterogeneous chip
multiprocessors. Computer,38: 32–38, 2005.
© 2010 ACM 1528-4972/10/0900 $10.00
Table 2: The table shows the automatically tuned configuration settings for the sort benchmark on various architectures.
Core 2 Duo Mobile
Xeon E7340 (2x4 core)
Xeon E7340 (2x4 core)
Sun Fire T200 Niagra
Algorithm Choices (with switching points)
IS (150) 8MS (600) 4MS (1295) 2MS (38400) QS(∞)
IS (75) 4MS (98) RS(∞)
IS (600) QS (1420) 2MS (∞)
16MS (75) 8MS (1461) 4MS (2400) 2MS (∞)
IS insertion sort
QS quick sort
RS radix sort
16MS 16-way merge sort
8MS 8-way merge sort
4MS 4-way merge sort
2MS 2-way merge sort,
The numbers in parentheses next to the algorithm indicate the upper bound
on the range of input sizes handled by that algorithm; for example, for Xeon
1-way, insertion sort is used for sizes up to 75, 4MS for 76 to 98, and RS for
99 and larger. Multiple algorithms are used as the problem is recursively
decomposed into smaller and smaller sub-problems.