ally had complete test coverage of all of
the processor error-handling code in
Solaris, something that had been lacking prior to this work. (The injection of
correctable and uncorrectable memory
errors is discussed later.)
The device driver used the diagnostic facilities of the UltraSPARC-II
processor to inject the errors into the
e-cache. (Similar diagnostic facilities
were used by the cache scrubber.) Before I explain how that worked, it will
help to understand the following:
˲ ˲ The UltraSPARC-II uses a 64-byte
˲ ˲ A cache line is moved between
memory and the e-cache in 8-byte
˲ ˲ Each of these chunks is protected
in memory by eight bits of ECC (
error-correcting code) that can correct any
single-bit error and detect any double-bit error (SEC-DED).
˲ ˲ Each byte of data is protected by a
single parity bit when in the e-cache.
˲ ˲ There are two UDB (UltraSPARC
Data Buffer) chips in parallel between the e-cache and main memory,
and each UDB converts eight bytes of
ECC-protected data at a time to eight
bytes of parity-protected data (and vice
versa). When a 64-byte cache line is
moved from memory into the e-cache
or vice versa, each UDB processes four
The interface between the processor
and the e-cache is 16 bytes wide. The
processor’s LSU (load/store unit) contains a control register that includes a
16-bit field called the force mask (FM).
Each bit in the FM corresponds to one
byte of the 16-byte interface between
the CPU and the e-cache. When a bit is
zero, a store of the corresponding byte
is done with good parity. When a bit is
one, a store of the corresponding byte
is done with bad parity. The FM bits
do not affect the checking of parity on
loads from the e-cache.
Injecting a parity error into the e-cache is fairly straightforward. The
physical memory address of the desired byte is determined, and the following steps performed:
1. Using its physical address, load
the desired byte into a register; this has
the side effect of bringing it into the e-cache if it isn’t there already.
2. Disable interrupts.
3. Set LSU.FM to all ones.
as the density of
combine to reduce
the amount of
charge used to
a bit, increasing
of memory to
4. Store the desired byte back to its
physical address. (If for some reason
the containing cache line got displaced
from the cache after the load, then this
will bring it back into the cache.) The
targeted byte will be written back into
the cache line with bad parity.
5. Reset LSU.FM to zero.
6. Reenable interrupts.
Now that the desired byte is in the
e-cache with bad parity, the latent error can be triggered via several mechanisms: data load in user or kernel
mode, instruction fetch in user or kernel mode, displacement flush to cause
a write-back, access from another CPU
to cause a copy-back, and so on.
Interrupts must be disabled for the
duration that the LSU.FM is not zero;
otherwise, if an interrupt occurs and
the interrupt handler (or any code it invokes) performs a store, then undesired
parity errors will be introduced into the
cache and triggered unpredictably.
This six-step sequence is used to inject e-cache parity errors at locations
corresponding to specific physical
memory addresses, kernel virtual addresses, or user virtual addresses. (
Virtual addresses are translated to their
corresponding physical addresses by
the memtest device driver.) To simulate bit flips caused by background
radiation, however, we would like to
inject an e-cache parity error at an arbitrary e-cache offset, without regard
to the physical memory address corresponding to the e-cache line.
Fortunately, the LSU.FM field also
applies to stores to the e-cache using
diagnostic accesses. Unfortunately,
diagnostic loads and stores work only
with 8-byte quantities, not with single
bytes. In order to affect just a single
byte, we must set only the one bit in
LSU.FM that corresponds to the byte
we want to change. The sequence in
this case then becomes:
1. Disable interrupts.
2. Fool the instruction prefetcher
3. Set the desired bit in LSU.FM to
4. Load the containing eight bytes
into a register with a diagnostic load.
5. Store the containing eight bytes
back into the e-cache with a diagnostic
6. Reset LSU.FM to zero.
7. Reenable interrupts.