veloping high-performance CUDA kernels revolve around efficient use of several memory systems and exploiting all
available data parallelism. Although the
GPU provides tremendous computational resources, this capability comes
at the cost of limitations in the number of per-thread registers, the size of
per-block shared memory, and the size
of constant memory. With hundreds of
processing units, it is impractical for
GPUs to provide a thread-local stack.
Local variables that would normally be
placed on the stack are instead allocated
from the thread’s registers, so recursive
kernel functions are not supported.
Analyzing applications for GPU acceleration potential. The first step in analyzing an application to determine its suit-ability for any acceleration technology
is to profile the CPU time consumed by
its constituent routines on representative test cases. With profiling results in
hand, one can determine to what extent
Amdahl’s Law limits the benefit obtain-able by accelerating only a handful of
functions in an application. Applications that focus their runtime into a few
key algorithms or functions are usually
the best candidates for acceleration.
As an example, if profiling shows that
an application spends 10% of its runtime
in its most time-consuming function,
and the remaining runtime is scattered
among several tens of unrelated functions of no more than 2% each, such an
application would be a difficult target
for an acceleration effort, since the best
performance increase achievable with
moderate effort would be a mere 10%.
A much more attractive case would be
an application that spends 90% of its execution time running a single algorithm
implemented in one or two functions.
Once profiling analysis has identified
the subroutines that are worth accelerating, one must evaluate whether they
can be reimplemented with data-parallel algorithms. The scale of parallelism
required for peak execution efficiency
on the GPU is usually on the order of
100,000 independent computations.
The GPU provides extremely fine-grain
parallelism with hardware support for
multiplexing and scheduling massive
numbers of threads onto the pool of processing units. This makes it possible for
CUDA to extract parallelism at a level of
granularity that is orders of magnitude
finer than is usually practical with other
one of the most
has been molecular
is dominated by
GPU-accelerated clusters for HPC. Given the potential for significant acceleration provided by GPUs, there has been a
growing interest in incorporating GPUs
into large HPC clusters. 3, 6, 8, 19, 21, 24 As a result of this interest, Nvidia now makes
high-density rack-mounted GPU accelerators specifically designed for use in
such clusters. By housing the GPUs in an
external case with its own independent
power supply, they can be attached to
blade or 1U rackmount servers that lack
the required power and cooling capacity for GPUs to be installed internally.
In addition to increasing performance,
GPU accelerated clusters also have the
potential to provide better power efficiency than traditional CPU clusters.
In a recent test on the AC GPU cluster ( http://iacat.uiuc.edu/resources/
cluster/) at the National Center for Supercomputing Applications (NCSA,
http://www.ncsa.uiuc.edu/), a NAMD
simulation of STMV (satellite tobacco
mosaic virus) measured the increase in
performance provided by GPUs, as well
as the increase in performance per watt.
In a small-scale test on a single node
with four CPU cores and four GPUs (HP
xw9400 workstation with a Tesla S1070
attached), the four Tesla GPUs provided
a factor of 7. 1 speedup over four CPU
cores by themselves. The GPUs provided a factor of 2. 71 increase in the performance per watt relative to computing
only on CPU cores. The increases in performance, space efficiency, power, and
cooling have led to the construction of
large GPU clusters at supercomputer
centers such as NCSA and the Tokyo
Institute of Technology. The NCSA Lincoln cluster ( http://www.ncsa.illinois.
tel64TeslaCluster/TechSummary/) containing 384 GPUs and 1,536 CPU cores
is shown in Figure 1.
Despite the relatively recent introduction of general-purpose GPU programming toolkits, a variety of biomolecular
modeling applications have begun to
take advantage of GPUs.
Molecular Dynamics. One of the most
compelling and successful applications
for GPU acceleration has been molecular dynamics simulation, which is
dominated by N-body atomic force calculation. One of the early successes with