tivity, but we did realize a bunch of incremental gains from object-oriented
programming, improved integrated
development environments, and the
emergence of better symbolic debugging and checker tools that looked for
memory leaks. All of that has helped us
incrementally improve our productivity
and increase our ability to manage complexity.
I think we’re seeing much the same
thing happen with parallelism. That
is, whereas the earliest Photoshop synchronization code was written in terms
of “enter critical section, leave critical
section,” we now have tools such as
Boost threads and OpenGL, which essentially are programming languages,
to help us deal with those problems. If
you look at Pixel Bender [the Adobe library for expressing the parallel computations that can be run on GPUs], you’ll
find it’s at a much higher level and so
requires much less synchronization
work of the person who’s coding the algorithm.
coLe: The key is that you try to go to a
higher level each time so you have less
and less of the detail to deal with. If we
can automate more of what happens
below that, we’ll manage to become
more efficient. You also mentioned that
we have better tools now than we did before. Does that suggest we’ll need even
better tools to take our next step? If so,
what are we missing?
WiLLiamS: Debugging multithreaded
programs at all remains really hard.
Debugging GPU-based programming,
whether in OpenGL or OpenCL, is still
in the Stone Age. In some cases you run
it and your system blue-screens, and
then you try to figure out what just happened.
coLe: That much we’re aware of.
We’ve tried to build stronger libraries so
that programmers don’t have to worry
about a lot of the low-level things anymore. We’re also creating better libraries of primitives, such as open source
TBB (Threading Building Blocks). Do
you see those as the kinds of things developers are looking to suppliers and
the research community for?
WiLLiamS: Those things are absolutely
a huge help. We’re taking a long hard
look at TBB right now. Cross-platform
tools are also essential. When somebody comes out with something that’s
Windows only, that’s a nonstarter for
cLem coLe
Locking your data
structures is truly
only the beginning.
the new tuning
problem is going to
be a real nightmare.
us—unless there is an exact-equivalent
technology on the Mac side as well. The
creation of cross-platform tools such as
Boost or TBB is hugely important to us.
The more we can hide under more
library layers, the better off we are.
The one thing that ends up limiting
the benefit of those libraries, though,
is Amdahl’s Law. For example, say that
as part of some operation we need to
transform the image into the frequency
domain. There’s a parallel implementation of FFT (Fast Fourier Transform) we
can just call, and maybe we even have a
library on top of that to decide whether
or not it makes any sense to ship that
all down to the GPU to do a GPU implementation of FFT before sending
it back. But that’s just one step in our
algorithm. Maybe there’s a parallel library for the next step, but getting from
the FFT step to the step where we call
the parallel library is going to require
some messing around. It’s with all that
inter-step setup that Amdahl’s Law just
kills you. Even if you’re spending only
10% of your time doing that stuff, that
can be enough to keep you from scaling
beyond 10 processors.
Still, the library approach is fabulous, and every parallel library implementation of common algorithms we
can lay our hands on is greatly appreciated. Like many of the techniques we
have available to us today, however, it
starts to run out of steam at about eight
to 16 processors. That doesn’t mean
it isn’t worth doing. We’re definitely
headed down the library path ourselves
because it’s the only thing we can even
imagine working if we’re to scale to
eight to 16 processors.
For the engineers on the Photoshop
development team, the scaling limitations imposed by Amdahl’s Law have
become all too familiar over the past
few years. Although the application’s
current parallelism scheme has scaled
well over two- and four-processor systems, experiments with systems featuring eight or more processors indicate
performance improvements that are far
less encouraging. That’s partly because
as the number of cores increases, the
image chunks being processed, called
tiles, end up getting sliced into a greater
number of smaller pieces, resulting in
increased synchronization overhead.
Another issue is that in between each