i nterview

chip. We don’t see much progress happening on that. PH Right. Don’t fool yourself that this problem will be solved.

TD Really? When Seymour Cray was building the fastest computers in the world, it was precisely by addressing that problem, by making memory buses that were enormously wide paths to memory.

PH Let me tell you why my intuition is that the problem won’t go away. If you look at the cost of computing, it’s about communication. That’s where all the power goes. It’s hard and expensive to provide that bandwidth. Assuming the most expensive part is usually well engineered, you try to do the best job you can with the parts of the system that matter. People are working as hard as they can at making communication costs lower. The low-hanging fruit is to take the problem away from being one involving communication to one that doesn’t involve your most expensive resource.

Our programming environments have to be more aware of communication. Let’s say every time you said “equal sign,” you thought 1,000 times more power was being exerted than when you said “multiply.”

Bill Dally [chair of the Stanford University computer science department] has this great number, just to put this in context. If you build a 32-bit floating-point unit, it takes a picojoule to do the floating-point operation. If you execute a 32-bit floating-point instruction on a processor, it takes a nanojoule, 1,000 times more power.

The actual computing part was free, but sending the data to the floating-point unit, reading it back, putting it in the cache, and trying to put it onto the bus uses 1,000 times more power. You’re just fighting physics. Physics tells you communication is expensive, and your programming model has to revolve around the communication if it is going to be efficient. So, that problem is not going to go away—there’s just no way to defeat physics. KA The way to minimize communication is by coherence, by having like things happen in like space and like time. Parallel processors, SIMD (single instruction, multiple data), are just a way of establishing execution coherence; putting in cache memory is a way to create locality, but it’s a very general way.

Again, the CPU people gave us a really pleasant abstraction. But in a C program, that equal sign might be a nanojoule or it might be a millijoule, depending on what actually happens. There’s no visibility into that to a C programmer. It’s really hard to look at a C program and detect that 1000: 1 difference in the cost of that equality, an assignment operator.

On the other hand, in a parallel-programming envi-

ronment—a fairly crude one today—it’s quite visible to you because you’re handed something that’s data-parallel, and you deal with the fact that, roughly speaking, the same thing is happening to similar data all at the same time. By being willing to deal with that, you’ve been able to get this huge increase in coherence that allows the performance to happen for a reasonable amount of power or a reasonable amount of communication. So the question is, what are some abstractions we can find that aren’t onerous to program to but that allow those things that matter to perform and to become more visible to programmers so that they can make more reasonable choices, or abstract them away so that choices are made automatically?

But, again, the fact that a modern CPU has so much of its die area dedicated to cache is expensive. It’s saving power, but it costs a lot of die area and power to save the power. You can always do better if you move more responsibility to a higher level. TD The idea, then, is moving the work to where the data is instead of moving the data to where the work is. KA It’s both. The important thing is having them be near each other. The original graphics pipeline was this gorgeous example of that: do a bunch of work here, move the data to something right next door and do a bunch more work, and then move it to something right next door and do a bunch more work.

If texture mapping hadn’t come along, your argument that graphics systems would be worthless for general-purpose computing would be true. Texture mapping is this awful sort of incoherent thing. It has some coherence, but as you put it into a shader and allow people to generate texture addresses, eventually you can completely destroy the coherence. TD It’s incoherence, but it’s a scatter/gather kind of incoherence. KA It’s a gather mostly, the way GPUs deal with it. The point is, as they’ve dealt with that more and more, the communications have gotten a lot richer. That ability to gather is a huge distinction from the old pipeline that really had no communication between the elements.

Dealing with that lower level of coherence has made the machine much more general purpose. It turns out that you can do that by caching a lot more cheaply than you can with the general-purpose caching on a CPU, so it’s not all the way to that extreme. But it’s a lot less coherent than the non-texture mapped pipelines that I started with. They were almost perfectly coherent. TD Pat, a couple of years ago, one of your former students gave a talk here about the future of computing on GPUs.

References:

mailto:feedback@acmqueue.com

Archives