traditional core requires faster access. As a result, what is needed is a
memory hierarchy that reduces interference among the different cores, yet
deals efficiently with the different requirements of each.
Designing such a hierarchy is far
from easy, especially considering
that, beside performance issues, the
memory system is a nontrivial source
of power consumption. This challenge
is the subject of intensive research in
industry and academia. Moreover, we
are coming close to the era of nonvolatile memory. How can it best be used?
Note here the heterogeneity in memory
modules: for caches (SRAM), volatile
memory (DRAM), nonvolatile memory
(MRAM, STT-RAM, PCM, ReRAM, and
many more technologies).
Another challenge at the hardware
level is the interconnect: How should
we connect the different cores and
memory hierarchy modules? Thick
wires dissipate less power but result
in lower bandwidth because they take
more on-chip space. There is a growing
body of research in optical interconnect. The topology (ring, torus, mesh),
material (copper, optical), and control
(network-on-chip protocols) are hot
topics of research at the chip level, at
the board level, and across boards.
Yet another challenge is distributing the workload among the different
cores to get the best performance with
the lowest power consumption. The
answer to this question must be found
across the whole computing stack,
from algorithms to process technology.
The move from a single board to
multiboard and into high-performance computers also means a move
from shared memory to distributed
memory. This makes the interconnect and workload distribution even
At the software level, the situation is
also very challenging. How are we go-
ing to program these beasts? Sequen-
tial programming is hard. Parallel
programming is harder. Parallel pro-
gramming of heterogeneous machines
is extremely challenging if we care
about performance and power efficien-
cy. There are several considerations:
how much hardware to reveal to the
programmer, the measures of success,
tions, however, and are not as versatile
as the ones mentioned earlier. Brain-
inspired neuromorphic chips, such as
IBM’s TrueNorth chip, are starting an
era of cognitive computing.
tive computing, championed by IBM’s
Watson and TrueNorth, is now used,
after the impressive performance of
the AI computer system Watson on
“Jeopardy,” in medical applications,
and other areas are being explored. It is
a bit early, however, to compare it with
the other more general-purpose cores.
The rest of this article considers
only traditional cores (with different
capabilities), GPU, FPGA, and AP. The
accompanying figure shows the big picture of a heterogeneous computing system, even though, because of the cost
of programmability, finding a system
with the level of heterogeneity shown in
the figure is unlikely. A real system will
have only a subset of these types.
What is the advantage of having this
variety of computing nodes? The answer lies in performance and energy
efficiency. Suppose you have a program
with many small threads. The best
choice in this case is a group of small
cores. If you have very few complicated
threads (for example, complicated con-trol-flow graphs with pointer-chasing),
then sophisticated cores (for example,
fat superscalar cores) are the way to go.
If you assign the complicated threads to
simple cores, the result is poor perfor-
mance. If you assign the simple threads
to the sophisticated cores, you consume
more power than needed. GPUs have
very good performance-power efficiency
for applications with data parallelism.
What is needed is a general-purpose ma-
chine that can execute different flavors
of programs with high performance-
power efficiency. The only way to do this
is to have a heterogeneous machine.
Most machines now, from laptops to
tablets to smart phones, have heteroge-
neous architectures (several cores and a
GPU), and more heterogeneity is expect-
ed in the (very) near future. How should
we deal with this paradigm shift from
homogeneity to heterogeneity?
Several challenges exist at the hardware level. The first is memory hierarchy. The memory system is one of the
main performance bottlenecks in any
computer system. While processors
had been following Moore’s Law until a few years ago, making good leaps
in performance, memory systems
have not. Thus, there is a large performance gap between processor speed
and memory speed. This problem
has existed since the single-core era.
What makes it more challenging in
this case is the shared memory hierarchy (several levels of cache memory
followed by the main memory). Who
shares each level of caches? Each of
the computational cores discussed
here targets a program (or thread or
process) with different characteristics from those targeted by other computational cores. For example, a GPU
requires higher bandwidth, while a
Generic heterogeneous system.