forces to create threaded code from day
one, not as a revision of the code base.
Parallelizing code must be as much of
a priority as writing correct code, or
achieving a certain time to market.
A more realistic view of the future
is somewhere between these two
extremes. Parallelizing legacy code
is widely viewed as a dead-end, but
building compelling add-ons to existing applications and then “bolting
on” these features to legacy codes is
possible. One does not need to change
the entire code base of a word processor, for example, in order to bolt on
a speech recognition engine that exploits multicore. Furthermore, some
applications that drive sales of new
machines, such as interactive video
games, have ample data parallelism
that is relatively easy to extract with
stream-based programming.
Finally, programmers will end up
writing parallel software without realizing that is what they are doing. For
example, programmers who utilize
SQL databases will see their application’s performance improve just by
virtue of some other developer’s effort
spent on parallelizing the database engine itself. Extending this idea further,
building parallel frameworks that fit
various application classes (business,
Web services, games, and so on) will
enable programmers to more easily
exploit multicore processors without
having to bite off the whole complexity
of parallel programming.
Part ii: the architecture
Research community
Given this technology environment,
what do computer architects currently
research? To answer this question, it is
best to look back over the last decade
and understand what we thought were
important research problems, and
what happened to them.
The memory wall. A workshop, held
in conjunction with the 1997 International Symposium of Computer Architecture (ISCA), focused on the memory
wall and the research occurring on proposed solutions to it. The memory wall
is the problem that accesses to main
memory are significantly slower than
computation. There are two aspects to
it, a high latency to memory (hundreds
of times the latency of a basic ALU operation inside a CPU) and a constrained
Parallelizing legacy
code is widely
viewed as a dead-
end, but building
compelling add-
ons to existing
applications that
take advantage of
multicore, and then
“bolting on” these
features to legacy
codes is possible.
bandwidth. Excitement at the time was
over solutions that proposed placing
computational logic in the DRAM. 11,
19, 20, 29, 32 Such solutions never achieved
broad acceptance in the marketplace
because they required programmers to
alter their software and they required
DRAM manufacturers to restructure
their business models. DRAM is a commodity, and businesses compete on
cost. Adding logic to DRAM makes the
devices expensive and system specific.
While technically feasible, it is a
different business that DRAM manufacturers chose not to enter. However,
less radical solutions, such as prefetching, stream buffers, 18 and ever larger
on-chip caches, 22 did take hold commercially. Moreover, programmers became more amenable to tuning their
applications to the memory hierarchy
architects provide them. Cache-con-scious data structures and algorithms
are an effective, yet burdensome, way
to achieve performance.
The memory wall is still with us.
Accessing DRAM continues to require
hundreds of more cycles than performing a basic ALU operation. While the
drop in the growth of processor clock
speed means that memory latency is
less of a growing concern, the switch to
multicore actually presents new challenges with bandwidth and consistency. Having all these CPU cores on a single die means they will need a Moore’s
Law growth in bandwidth to memory in
order to operate efficiently. At the moment, we are not pin limited in providing this bandwidth, but we quickly will
be; so we can expect a host of future
research that looks at the memory wall
again, but this time from a bandwidth,
instead of a latency perspective.
Along with memory performance
is the evolution of the memory model.
In the past, it was thought providing
a sequentially consistent system with
reasonable performance was not possible. Hence, we devised a range of
relaxed consistency approaches. 30
It is natural for programmers to assume multicore systems are sequentially consistent, however, and recent
work8 suggests that architectures can
use speculation16 to provide it. Looking forward, as much as can be done,
must be done, to make programming
parallel systems as easy as possible.
This author believes this will push