Figure 1: In recent years, improvements
in CPU clock speed have leveled off
while the number of cores per processor has increased. This trend is generating increased interest in parallel
programming.
Intel x86 Processors
10 GHz 1 GHz
8 cores
6
100 MHz
4
10 MHz
2
1 MHz
1970 1980 1990 2000
1 core
2010
focused on clusters of multi-socket,
multi-node systems. With such systems, the overhead of communication
and synchronization often dominates
the improved performance obtained
through parallelism, leading to poor
scalability. Now that parallel processors are integrated into monolithic
silicon devices that share the same
off-chip memory, the cost of communication and synchronization has been
reduced by orders of magnitude. This
broadens the scope of applications
that can benefit from parallelism.
Second, the economics behind parallel processing have changed. Parallel processing used to require a large
investment. If you wanted a machine
with twice the parallelism, you needed
to pay twice as much (or sometimes
more) for the extra processors. Consequently, in order for parallelism to be
seen as successful, applications needed
to show linear or near-linear scalability in order to recoup the investment
made in the hardware. Nowadays, virtually every processor on the market is
parallel, including embedded processors for smartphones and other mobile
devices. Parallelism is coming along
essentially for free. It is ubiquitous and
does not require a large investment or
perfect scalability to be profitable.
Finally, this time around, we have
no alternative. Earlier attempts to capitalize on parallel processing were less
successful because sequential processors were improving so rapidly. If an
application wasn’t performing quickly
enough, one could just wait for a year
until a faster generation of sequential
processors came out. Now that sequential processor performance increases
have slowed, we don’t have the option
of waiting for sequential processors to
catch up with our performance needs.
Instead, it’s time to examine our applications for parallelism and find efficient parallel implementations.
The good news is that parallelism
can actually be fairly easy to find and
extract if one uses the correct tools and
mental models. The push to parallel
software, while a change, is far from an
impossible task. Parallelism is abundant, often derived from the properties of the computations we’re trying
to perform, which model, analyze, and
synthesize large data sets consisting
of many objects. In this article, we’ll
show how parallelism is abundant and
accessible in many applications, examine ongoing work into frameworks for
making parallel programming more
productive, and discuss one promising approach for building these frameworks: Selective, Embedded Just-in-Time Specialization.
PARALLELISM EVERYWHERE
Before we discuss methods and tools
for exploiting parallelism, we need to
convince ourselves that parallelism
can be found in many of today’s computationally intensive applications, in
quantities and types that can be practically useful. Anyone who has tried to
incrementally parallelize a large code
base knows how difficult it can be to
untangle a sequential program into a
set of independent tasks.
Having your program multithreaded is not enough, since contention over
shared data structures can cause your
program to execute sequentially, as
one thread after the next acquires and
releases locks and semaphores. Instead, we need to find parallelizations
and algorithms that enable efficient
parallel execution.
The PALLAS group, part of the Paral-
lel Computing Lab at the University of
California-Berkeley, has been investi-
gating several applications to discover
algorithmic approaches that show that
the widespread parallelism we see in
many applications can result in scal-
able, high-performance parallel applica-
tions. Rather than restricting ourselves
to throwing more threads at a computa-
tion in an attempt to parallelize it, we
explore both algorithmic changes and
implementation choices to produce ef-
ficient, scalable applications.
PARALLEL APPLICATIONS
Support vector machines are a classification technique widely used in artificial intelligence, machine learning,
and computer vision. Using a 128-way
parallel GPU, we improved training
performance by an average of 20x over
a sequential processor, and classification performance by an average of 22x
[ 4]. Parallelism in SVMs comes from
the large amount of data points being
used to train a classifier, or the large
amount of data points which need to
be classified.
On a high-end image contour detector, which is the foundation for some
highly accurate image segmentation
and classification routines, we brought
contour detection time from 4 minutes to 1. 8 seconds by using a 240-way
parallel GPU, an improvement of 120x
over a sequential processor [ 3]. This
improvement came from changing the
algorithms to be more parallel and efficient, deriving parallelism from the
individual pixels in the image, and
from a large parallel task graph inherent in the algorithm. Coupling the
image contour detector with other rou-