The XMT Processor
the Xmt processor (see the figure
here) includes a master thread
control unit (mtCU), processing
clusters, each comprising several
thread-control units (tCUs),
a high-bandwidth low-latency
interconnection network3 and its
extension to a globally asynchronous
locally synchronous, GaLS-style,
design incorporating asynchronous
logic, 15, 18 memory modules (mm),
each comprising on-chip cache and
off-chip memory, prefix-sum (PS)
unit(s), and global registers. the
shared-memory-modules block
(bottom left of the figure) suppresses
the sharing of a memory controller
by several mms. the processor
alternates between serial mode (in
which only the mtCU is active) and
parallel mode. the mtCU has a
standard private data cache (used
in serial mode) and a standard
instruction cache. the tCUs, which
lack a write data cache, share the
mms with the mtCU.
the overall Xmt design is
guided by a general design ideal
I call no-busy-wait finite-state-
machines, or NBw fSm, meaning
the fSms, including processors,
memories, functional units,
and interconnection networks
comprising the parallel machine,
never cause one another to busy-
wait. It is ideal because no parallel
machine can operate that way.
Nontrivial parallel processing
demands the exchange of results
among fSms. the NBw fSm
ideal represents my aspiration to
minimize busy-waits among the
various fSms comprising a machine.
here, I cite the example of how
the mtCU orchestrates the tCUs
to demonstrate the NBw fSm
ideal. the mtCU is an advanced
serial microprocessor that also
executes Xmt instructions (such as
spawn and join). typical program
execution flow, as in figure 3, can
also be extended through nesting of
sspawn commands. the mtCU uses
the following Xmt extension to the
standard von Neumann apparatus
of the program counters and stored
program: Upon encountering
a spawn command, the mtCU
broadcasts the instructions in the
parallel section starting with that
spawn command and ending with a
join command on a bus connecting
to all tCU clusters.
the largest ID number of a
thread the current spawn command
must execute Y is also broadcast
to all tCUs. the ID (index) of the
largest executing threads is stored in
a global register X. In parallel mode,
a tCU executes one thread at a time.
executing a thread to completion
(upon reaching a join command), the
tCU does a prefix-sum using the PS
unit to increment global register X. In
response, the tCU gets the ID of the
thread it could execute next; if the ID is
≤Y, the tCU executes a thread with this
ID. otherwise, the tCU reports to the
mtCU that it finished executing. when
all tCUs report they’ve finished, the
mtCU continues in serial mode.
the broadcast operation is
essential to the Xmt ability to start all
tCUs at once in the same time it takes
to start one tCU. the PS unit allows
allocation of new threads to the tCUs
that just became available within the
same time it takes to allocate one
thread to one tCU. this dynamic
allocation provides runtime load-
balancing of threads coming from an
XmtC program.
we are now ready to connect with
the NBw fSm ideal. Consider an Xmt
program derived from the workflow.
from the moment the mtCU starts
executing a spawn command until
each tCU terminates the threads
allocated to it, no tCU can cause
any other tCU to busy-wait for it. an
unavoidable busy-wait ultimately
occurs when a tCU terminates and
begins waiting for the next spawn
command.
tCUs, with their own local
registers, are simple in-order
pipelines, including fetch, decode,
execute/memory-access, and write-
back stages. the fPGa computer has
64 tCUs in four clusters of 16 tCUs
each. Xmt designers and evangelists
aspire to develop a machine with 1,024
tCUs in 64 clusters. a cluster includes
functional units shared by several
tCUs and one load/store port to the
interconnection network shared by all
its tCUs.
the global memory address
space is evenly partitioned into the
mms through a form of hashing. the
Xmt design eliminates the cache-
coherence problem, a challenge in
terms of bandwidth and scalability. In
principle, there are no local caches at
the tCUs. within each mm, the order
of operations to the same memory
location is preserved.
for performance enhancements
(such as data prefetch) incorporated
into the Xmt hardware, along with
more on the architecture, see wen
and Vishkin28; for more on compiler
and runtime scheduling methods for
nested parallelism, see tzannes et
al., 23 and for prefetching methods, see
Caragea et al. 6
Block diagram of the xmt architecture.
tCu 0
tCu 1
tCu 2
tCu t
read buffers
tCu i-Cache
register File
Ps
network
Ps unit
(and global register)
Cluster 1
Cluster 2
Cluster 0
Cluster n
Fu interconnection network
lsu with hashing Function
shared Functional units
Fu 0 Fu 1 Fu p Ps
unit
instruction
broadcast
Cluster-Memory interconnection network
mm 0
l1 Cache
l2 Cache
shared Memory Modules
mm 1
master tcu
Functional units
and register File
Private
l1 d-Cache
Private
l1 i-Cache
january 2011 | vol. 54 | no. 1 | communications of the acm 83