Article development led by
Error-detection and correction features
are only as good as our ability to test them.
BY steVe Chessin
errors for fun
“That which isn’t tested is broken.” —Author unknown
“Well, everything breaks, don’t it, Colonel.”
—Monty Python’s Flying Circus
it iS an unfortunate fact of life that anything with
moving parts eventually wears out and malfunctions,
and electronic circuitry is no exception. In this case,
of course, the moving parts are electrons. In addition
to the wear-out mechanisms of electromigration (the
moving electrons gradually push the metal atoms out
of position, causing wires to thin, thus increasing their
resistance and eventually producing open circuits)
and dendritic growth (the voltage difference between
adjacent wires causes the displaced metal atoms to
migrate toward each other, just as magnets will attract
each other, eventually causing shorts), electronic
circuits are also vulnerable to background radiation.
These fast-moving charged particles knock electrons
out of their orbits, leaving ionized trails in their wake.
Until those electrons find their way
back home, a conductive path exists
where there once was none.
If the path is between the two plates
of a capacitor used to store a bit, the capacitor discharges, and the bit can flip
from one to zero or from zero to one.
Once the capacitor discharges, the displaced electrons return home, and the
part appears to have healed itself with
no permanent damage, except perhaps
to the customer’s data. For this reason,
memory is usually protected with some
level of redundancy, so flipped bits can
be detected and perhaps corrected. Of
course, the error-detection and correction circuitry itself must be tested, and
that is the main topic of this article.
(If the path is between a current
source and ground, then it cannot heal
until power is removed. This is called
single event latchup, which simulates a
hard failure, at least until the power is
turned off, such as when preparing to remove and replace the apparently failing
part. The returned part, of course, will
test out as “no trouble found,” frustrating everyone involved. Single event latchup is difficult for software to deal with
and will not be discussed further here.)
In addition to the causes of errors
mentioned here, transmission lines
are subject to noise-induced errors, so
transmitted signals are also often protected with redundancy.
As the density of circuits increases,
features get smaller; as frequencies increase, voltages get lower. These trends
combine to reduce the amount of charge
used to represent a bit, increasing the
sensitivity of memory to background radiation. For example, the original Ultra-SPARC-I processor ran at 143MHz and
had a 256KB e-cache (external cache).
The cache design used simple byte parity to protect the data, which was sufficient as the amount of charge used
to hold a bit was large enough that an
ionizing particle would drain off only a
small amount, not enough to flip a bit.
When this design was scaled up in
the UltraSPARC-II processor to run at
400MHz with an 8MB e-cache, however,
the amount of charge used to hold a bit