Doi: 10.1145/1897816.1897843
technical Perspective
DRAm errors in the Wild
By Norman P. Jouppi
in an era of mobile devices used as
windows into services provided by
computing in the cloud, the cost and
reliability of services provided by
large warehouse-scale computers1 is
paramount. These warehouse-scale
computers are implemented with
racks of servers, each one typically
consisting of one or two processor
chips but many memory chips. Even
with a crash-tolerant application layer, understanding the sources and
types of errors in server memory systems is still very important.
Similarly, as we look forward to
exascale performance in more traditional supercomputing applications, even memory errors correctable
through traditional error-correcting
codes can have an outsized impact on
the total system performance. 3 This is
because in many systems, execution
on a node with hardware-corrected errors that are logged in software runs
significantly slower than on nodes
without errors. Since execution of
bulk synchronous parallel applications is only as fast as the slowest local
computation, in a million-node computation the slowdown of one node
from memory errors can end up delaying the entire million-node system.
At the system level, low-end PCs
have historically not provided any
error detection or correction capability, while servers have used error-correcting codes (ECC) that have enabled correction of a single error per
codeword. This worked especially
well when a different memory chip
was used for each bit read or written
by a memory bus (such as when using
“x1” memory chips). However, in the
last 15 years as memory busses have
become wider, more bits on the bus
need to be read or written from each
memory chip, leading to the use of
memory chips that can provide four
(“x4”) or more bits at a time to a memory bus. Unfortunately, this increases
the probability of errors correlated
across multiple bits, such as when
part of a chip address circuit fails. In
i hope the
following paper
will motivate the
collection and
publication of even
more large-scale
system memory
reliability data.
order to handle cases where an entire chip’s contribution to a memory
bus is corrupted, chip-kill correct error correcting codes have been developed. 2
Since the introduction of DRAMs
in the mid-1970s, there has been
much work on improving the reliability of individual DRAM devices. Some
of the classic problems addressed
were tolerance of radiation, from either impurities in the package or cosmic sources. In contrast, very little
information has been published on
reliability of memory at the system
level. There are several reasons for
this. First, much of the industrial data
is specific to particular memory or
CPU vendors. This industrial data typically focuses on configurations that
are particularly problematic. Therefore neither DRAM, CPU, nor system
vendors find it in their best interest to
publish this data.
Nevertheless, in order to advance
the field, knowledge of the types of
memory errors, their frequencies,
and conditions that exacerbate or are
unrelated to higher error rates are of
critical importance. In order to fill
this gap, Bianca Schroeder, Eduardo
Pinheiro, and Wolf-Dietrich Weber
analyzed measurements of memory
errors in a large fleet of commodity
servers over a period of 2. 5 years. They
collected data on multiple DRAM ca-
pacities, technologies, and vendors
(suitably anonymized), totaling mil-
lions of DIMM days.
References
1. Barroso, L. and Hölzle, U. The datacenter as
a computer: An introduction to the design of
warehouse-scale machines. Syntesis Lectures on
Computer Science. Morgan Claypool, 2009.
2. Dell, T.J. A white paper on the benefits of chipkill-correct ECC for PC server main memory. IBM
Microelectronics, 1997.
3. Yelick, K. Ten Ways to Waste a Parallel Computer;
http://isca09.cs.columbia.edu/ISCA09-
WasteParallelComputer.pdf
norman P. Jouppi ( Norm.Jouppi@hp.com) is a Senior
Fellow and Director of Hewlett-Packard’s Intelligent
Infrastructure Lab in Palo Alto, CA.
© 2011 ACM 0001-0782/11/0200 $10.00