maintain hardware cache coherence.
iLLUSTRATiON BY ANDY GiLMORe
The programmability advantages of
the Bulk Multicore do not come at the
expense of performance. On the contrary, the Bulk Multicore enables high
performance because the processor
hardware is free to aggressively reorder and overlap the memory accesses
of a program within chunks without
risk of breaking their expected behavior in a multiprocessor environment.
Moreover, in an advanced Bulk Multicore design where the compiler observes the chunks, the compiler can
further improve performance by heavily optimizing the instructions within
each chunk. Finally, the Bulk Multicore organization decreases hardware
design complexity by freeing processor designers from having to worry
about many corner cases that appear
when designing multiprocessors.
The Bulk Multicore architecture eliminates one of the traditional tenets of
processor architecture, namely the
need to commit instructions in order,
providing the architectural state of the
processor after every single instruction. Having to provide such state in
a multiprocessor environment—even
if no other processor or unit in the
machine needs it—contributes to the
complexity of current system designs.
This is because, in such an environ-
ment, memory-system accesses take
many cycles, and multiple loads and
stores from both the same and different processors overlap their execution.
In the Bulk Multicore, the default
execution mode of a processor is to
commit chunks of instructions at a
2 A chunk is a group of dynamically contiguous instructions (such as
2,000 instructions). Such a “chunked”
mode of execution and commit is a
hardware-only mechanism, invisible
to the software running on the processor. Moreover, its purpose is not to
parallelize a thread, since the chunks
in a thread are not distributed to other
processors. Rather, the purpose is to