i nterview
chip. We don’t see much progress happening on that.
PH Right. Don’t fool yourself that this problem will be
solved.
TD Really? When Seymour Cray was building the fastest
computers in the world, it was precisely by addressing
that problem, by making memory buses that were enormously wide paths to memory.
PH Let me tell you why my intuition is that the problem
won’t go away. If you look at the cost of computing,
it’s about communication. That’s where all the power
goes. It’s hard and expensive to provide that bandwidth.
Assuming the most expensive part is usually well engineered, you try to do the best job you can with the parts
of the system that matter. People are working as hard as
they can at making communication costs lower. The low-hanging fruit is to take the problem away from being one
involving communication to one that doesn’t involve
your most expensive resource.
Our programming environments have to be more
aware of communication. Let’s say every time you said
“equal sign,” you thought 1,000 times more power was
being exerted than when you said “multiply.”
Bill Dally [chair of the Stanford University computer
science department] has this great number, just to put
this in context. If you build a 32-bit floating-point unit,
it takes a picojoule to do the floating-point operation.
If you execute a 32-bit floating-point instruction on a processor, it takes a nanojoule, 1,000 times more power.
The actual computing part was free, but sending the
data to the floating-point unit, reading it back, putting it
in the cache, and trying to put it onto the bus uses 1,000
times more power. You’re just fighting physics. Physics
tells you communication is expensive, and your programming model has to revolve around the communication if
it is going to be efficient. So, that problem is not going to
go away—there’s just no way to defeat physics.
KA The way to minimize communication is by coherence, by having like things happen in like space and
like time. Parallel processors, SIMD (single instruction,
multiple data), are just a way of establishing execution
coherence; putting in cache memory is a way to create
locality, but it’s a very general way.
Again, the CPU people gave us a really pleasant
abstraction. But in a C program, that equal sign might
be a nanojoule or it might be a millijoule, depending on
what actually happens. There’s no visibility into that to a
C programmer. It’s really hard to look at a C program and
detect that 1000: 1 difference in the cost of that equality,
an assignment operator.
On the other hand, in a parallel-programming envi-
ronment—a fairly crude one today—it’s quite visible to
you because you’re handed something that’s data-parallel,
and you deal with the fact that, roughly speaking, the
same thing is happening to similar data all at the same
time. By being willing to deal with that, you’ve been
able to get this huge increase in coherence that allows
the performance to happen for a reasonable amount of
power or a reasonable amount of communication. So
the question is, what are some abstractions we can find
that aren’t onerous to program to but that allow those
things that matter to perform and to become more visible
to programmers so that they can make more reasonable
choices, or abstract them away so that choices are made
automatically?
But, again, the fact that a modern CPU has so much
of its die area dedicated to cache is expensive. It’s saving
power, but it costs a lot of die area and power to save
the power. You can always do better if you move more
responsibility to a higher level.
TD The idea, then, is moving the work to where the data
is instead of moving the data to where the work is.
KA It’s both. The important thing is having them be near
each other. The original graphics pipeline was this gorgeous example of that: do a bunch of work here, move
the data to something right next door and do a bunch
more work, and then move it to something right next
door and do a bunch more work.
If texture mapping hadn’t come along, your argument
that graphics systems would be worthless for general-purpose computing would be true. Texture mapping is this
awful sort of incoherent thing. It has some coherence, but
as you put it into a shader and allow people to generate
texture addresses, eventually you can completely destroy
the coherence.
TD It’s incoherence, but it’s a scatter/gather kind of
incoherence.
KA It’s a gather mostly, the way GPUs deal with it. The
point is, as they’ve dealt with that more and more, the
communications have gotten a lot richer. That ability to
gather is a huge distinction from the old pipeline that
really had no communication between the elements.
Dealing with that lower level of coherence has made
the machine much more general purpose. It turns out
that you can do that by caching a lot more cheaply than
you can with the general-purpose caching on a CPU,
so it’s not all the way to that extreme. But it’s a lot less
coherent than the non-texture mapped pipelines that I
started with. They were almost perfectly coherent.
TD Pat, a couple of years ago, one of your former students
gave a talk here about the future of computing on GPUs.