8 16
# of Processors
32
64
128
working on cavities that span multiple partitions. However,
load-balancing is more of a problem than in the baseline
approach. We also developed a scheduling framework that
gives programmers control over the scheduling policy used
by the Galois runtime. 12
We produced a new implementation of the Galois system in Java, which incorporates these changes. We then
evaluated an implementation of mesh refinement, using
an input mesh of 100,000 triangles. Figure 12 shows the
performance of this implementation on a 128-core Sunfire
system, normalized to a sequential implementation in
plain Java. We see that the Galois system is able to achieve
significant speedup up to 64 cores, beyond which load
imbalance and communication latency begin to dominate
performance.
5. coNcLuSioN
In this paper, we described the Galois system, which is a
fresh approach to automatic parallelization of irregular
applications. Rather than attempt to parallelize all programs no matter how obscurely they are written, our system
provides programming abstractions that programmers use
to highlight opportunities for exploiting parallelism. The
runtime system uses optimistic parallelization to exploit
these opportunities for parallel execution highlighted by
the programmer. It detects conflicts between concurrent
computations, and rolls back computations appropriately to preserve the sequential semantics of the program.
Experimental results for two real-world irregular applications, a Delaunay mesh refinement application and a
graphics application that performs agglomerative clustering, demonstrate that this approach is promising.
The Galois approach should be viewed as a baseline parallel implementation for irregular applications.
Handwritten parallel versions of many irregular applications exploit particular kinds of structure in these applications to reduce parallel overheads. How do we identify
such opportunities for exploiting structure in irregular
programs? Can the relevant optimizations be performed
automatically by the compiler? How do we reduce run-time overheads? These are some of the exciting research
opportunities that lie ahead.
This work is supported in part by NSF grants 0833162,
0719966, 0702353, 0724966, 0739601, and 0615240, as well
as grants from IBM and Intel Corporation.
1. Burke, M., Carini, P., Choi, j.-D.
Interprocedural Pointer Alias
Analysis. Technical Report IBM RC
21055, IBM yorktown Heights, 1997.
2. Chew, L.P. Guaranteed-quality mesh
generation for curved surfaces.
In SCG’93: Proceedings of the 9th
Annual Symposium on Computational
Geometry (1993), 274–280.
3. de Galas, j. The quest for more
processing power: is the single core
CPU doomed? http://www.anandtech.
com/cpuchipsets/showdoc.
aspx?I=2377, February 2005.
4. Diniz, P. C., Rinard, M. C. Commutativity
analysis: a new analysis technique
for parallelizing compilers. ACM
Trans. Prog. Lang. Syst. 19, 6 (1997),
942–991.
5. Ghiya, R., Hendren, L. Is it a tree,
a dag, or a cyclic graph? A shape
analysis for heap-directed pointers in
c. In POPL, 1996.
6. Herlihy, M., Koskinen, E. Transactional
boosting: a methodology for highly-concurrent transactional objects. In
Principles and Practices of Parallel
Programming (PPoPP), 2008.
7. Herlihy, M., Moss, j.E.B. Transactional
memory: architectural support
for lock-free data structures. In
ISCA ‘93: Proceedings of the 20th
Annual International Symposium on
Computer Architecture (1993).
8. Hudson, B., Miller, G. L., Phillips, T.
sparse parallel Delaunay mesh
refinement. In SPAA (2007).
9. jefferson, D.R. Virtual time. ACM
Trans. Prog. Lang. Syst. 7, 3 (1985),
404–425.
10. Kennedy, K., Allen, j., editors.
Optimizing Compilers for Modern
Architectures: A Dependence-Based
Approach. Morgan Kaufmann, 2001.
11. Kulkarni, M., Burtscher, M., Inkulu,
R., Pingali, K., Cascaval, C. How
much parallelism is there in irregular
applications? In Principles and
Practices of Parallel Programming
(PPoPP), 2009.
12. Kulkarni, M., Carribault, P., Pingali, K.,
Ramanarayanan, G., Walter, B., Bala,
K., Chew, L. P. scheduling strategies
for optimistic parallel execution of
irregular programs. In Symposium on
Parallel Architectures and Algorithms
(SPAA) (2008).
13. Kulkarni M., Pingali, K.,
Ramanarayanan, G., Walter, B., Bala,
K., Chew, L. P. optimistic parallelism
benefits from data partitioning.
SIGARCH Comput. Archit. News 36,
1 (2008), 233–243.
References
Milind Kulkarni and Keshav Pingali
({milind,pingali}@ cs.utexas.edu),
University of Texas, Austin.
14. Kulkarni, M., Pingali, K., Walter,
B., Ramanarayanan, G., Bala, K.,
Chew, L. P. optimistic parallelism
requires abstractions. SIGPLAN Not
(Proceedings of PLDI 2007) 42, 6
(2007), 211–222.
15. Larus, j., Rajwar, R. Transactional
Memory (Synthesis Lectures on
Computer Architecture). Morgan &
Claypool Publishers, 2007.
16. ni, y., Menon, V., Adl-Tabatabai, A.-R.,
Hosking, A.L., Hudson, R., Moss, j.E.B.,
saha, B., shpeisman, T. open nesting
in software transactional memory. In
Principles and Practices of Parallel
Programming (PPoPP), 2007.
17. Pang-ning Tan, M.s., Kumar, V.,
editors. Introduction to Data Mining.
Pearson Addison Wesley, 2005.
18. Ponnusamy, R., saltz, j., Choudhary,
A. Runtime compilation techniques for
data partitioning and communication
schedule reuse. In Proceedings of
the 1993 ACM/IEEE Conference on
Supercomputing (1993).
19. Rauchwerger, L., Padua, D.A.
The LRPD test: speculative
run-time parallelization of loops
with privatization and reduction
parallelization. IEEE Trans. Parallel
Distrib. Syst. 10, 2 (1999), 160–180.
20. sagiv, M., Reps, T., Wilhelm, R.
solving shape-analysis problems in
languages with destructive updating.
In Proceedings of the 23rd Annual
ACM Symposium on the Principles
of Programming Languages (st.
Petersburg Beach, FL, january 1996).
21. shewchuk, j.R. Triangle: Engineering
a 2D quality mesh generator and
Delaunay triangulator. In Applied
Computational Geometry: Towards
Geometric Engineering, volume 1148
of Lecture Notes in Computer Science.
May 1996, 203–222.
22. steffan, j.G., Colohan, C.B., Zhai, A.,
Mowry, T.C. A scalable approach
to thread-level speculation. In
ISCA ’00: Proceedings of the 27th
Annual International Symposium on
Computer Architecture (2000).
23. Walter, B., Fernandez, s., Arbree, A., Bala,
K., Donikian, M., Greenberg, D. Lightcuts:
a scalable approach to illumination. ACM
Trans. Graphics (SIGGRAPH) 24, 3 (july
2005), 1098–1107.
24. Zhan, L.R.y., Torrellas, j. Hardware for
speculative run-time parallelization
in distributed shared-memory
multiprocessors. In HPCA ’98:
Proceedings of the 4th International
Symposium on High-Performance
Computer Architecture (1998).
Bruce Walter, Ganesh Ramanarayanan,
Kavita Bala, and L. Paul Chew
( bjw@graphics.cornell.edu,
{graman,kb,chew}@ cs.cornell.edu),
Cornell University, Ithaca, ny.