improve programmability and performance.
Each chunk executes on the processor atomically and in isolation.
Atomic execution means that none of
the chunk’s actions are made visible
to the rest of the system (processors or
main memory) until the chunk completes and commits. Execution in isolation means that if the chunk reads a
location and (before it commits) a second chunk in another processor that
has written to the location commits,
then the local chunk is squashed and
must re-execute.
To execute chunks atomically and
in isolation inexpensively, the Bulk
Multicore introduces hardware address signatures.
3 A signature is a
register of ≈ 1,024 bits that accumulates hash-encoded addresses. Figure
1 outlines a simple way to generate a
signature (see the sidebar “Signatures
and Signature Operations in Hardware” for a deeper discussion). A signature, therefore, represents a set of
Figure 1 in the main text shows a simple implementation of a signature. the bits of an
incoming address go through a fixed permutation to reduce collisions and are then
separated in bit-fields Ci. each field is decoded and accumulated into a bit-field Vj in the
signature. Much more sophisticated implementations are also possible.
A module called the Bulk Disambiguation Module contains several signature
registers and simple functional units that operate efficiently on signatures. these
functional units are invisible to the instruction-set architecture. note that, given a
signature, we can recover only a superset of the addresses originally encoded into the
signature. Consequently, the operations on signatures produce conservative results.
the figure here outlines five signature functional units: intersection, union, test
for null signature, test for address membership, and decoding (δ). intersection finds
the addresses common to two signatures by performing a bit-wise AnD of the two
signatures. the resulting signature is empty if, as shown in the figure, any of its bit-
fields contains all zeros. union finds all addresses present in at least one signature
through a bit-wise or of the two signatures. testing whether an address a is present
(conservatively) in a signature involves encoding a into a signature, intersecting the
latter with the original signature and then testing the result for a null signature.
Decoding (δ) a signature determines which cache sets can contain addresses
belonging to the signature. the set bitmask produced by this operation is then passed
to a finite-state machine that successively reads individual lines from the sets in the
bitmask and checks for membership to the signature. this process is used to identify
and invalidate all the addresses in a signature that are present in the cache.
overall, the support described here enables low-overhead operations on sets of
addresses.
3
Signatures and Signature
Operations in Hardware
operations on signatures.
S1
S2
S
V1
S1 Ç S2
V2
T/F
Address
a
Signature
S
S1
S2
Encode
V3
S
V4
Logic
Cache
set
bitmask
S1 È S2
S = Ø
a Î S
T/F
= Ø
(S)
addresses.
In the Bulk Multicore, the hardware automatically accumulates the
addresses read and written by a chunk
into a read (R) and a write (W) signature, respectively. These signatures
are kept in a module in the cache hierarchy. This module also includes
simple functional units that operate
on signatures, performing such operations as signature intersection (to
find the addresses common to two
signatures) and address membership
test (to find out whether an address
belongs to a signature), as detailed in
the sidebar.
Atomic chunk execution is supported by buffering the state generated by the chunk in the L1 cache.
No update is propagated outside the
cache while the chunk is executing.
When the chunk completes or when a
dirty cache line with address in the W
signature must be displaced from the
cache, the hardware proceeds to commit the chunk. A successful commit
involves sending the chunk’s W signature to the subset of sharer processors indicated by the directory2 and
clearing the local R and W signatures.
The latter operation erases any record
of the updates made by the chunk,
though the written lines remain dirty
in the cache.
The W signature carries enough
information to both invalidate stale
lines from the other coherent caches
(using the δ signature operation on W,
as discussed in the sidebar) and enforce that all other processors execute
their chunks in isolation. Specifically,
to enforce that a processor executes a
chunk in isolation when the processor
receives an incoming signature Winc,
its hardware intersects Winc against
the local Rloc and Wloc signatures. If any
of the two intersections is not null, it
means (conservatively) that the local
chunk has accessed a data element
written by the committing chunk.
Consequently, the local chunk is
squashed and then restarted.
Figure 2 outlines atomic and isolated execution. Thread 0 executes
a chunk that writes variables B and
C, and no invalidations are sent out.
Signature W0 receives the hashed addresses of B and C. At the same time,
Thread
1 issues reads for B and C,
which (by construction) load the non-