Dynamic analysis techniques help
programmers find the root cause of bugs
in large-scale parallel applications.
BY IGNACIO LAGUNA, DONG H. AHN, BRONIS R. DE SUPINSKI,
TODD GAMBLIN, GREGORY L. LEE, MARTIN SCHULZ,
SAURABH BAGCHI, MILIND KULKARNI, BOWEN ZHOU,
ZHEZHE CHEN, AND FENG QIN
BREAKTHROUGHS IN SCIENCE and engineering are
increasingly made with the help of high-performance
computing (HPC) applications. From understanding
the process of protein folding to estimating short- and
long-term climate patterns, large-scale parallel HPC
simulations are the tools of choice. The applications
can run detailed numerical simulations that model
the real world. Given the great public importance of
such scientific advances, the numerical correctness
and software reliability of these applications is a major
concern for scientists.
Debugging parallel programs is significantly more
difficult than debugging serial programs; human
cognitive abilities are overwhelmed when dealing
˽ Bugs in parallel HPC applications
are difficult to debug because errors
propagate quickly among compute nodes,
programmers must debug thousands of
nodes or more, and bugs might manifest
only at large scale.
˽ Although conventional approaches like
testing, verification, and formal analysis
can detect a variety of bugs, they struggle
at massive scales and do not always
account for important dynamic properties
of program execution.
˽ Dynamic analysis tools and algorithms that
scale with an application’s input and number
of compute nodes can help programmers
pinpoint the root cause of bugs.