Doi: 10.1145/1897816.1897844
DRAM Errors in the Wild:
A Large-Scale Field Study
Abstract
Errors in dynamic random access memory (DRAM) are a
common form of hardware failure in modern compute
clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of
work exists on DRAM in laboratory conditions, little has been
reported on real DRAM failures in large production clusters.
In this paper, we analyze measurements of memory errors in a
large fleet of commodity servers over a period of 2. 5 years. The
collected data covers multiple vendors, DRAM capacities and
technologies, and comprises many millions of dual in-line
memory module (DIMM) days.
The goal of this paper is to answer questions such as the
following: How common are memory errors in practice? What
are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology,
and DIMM age?
We find that DRAM error behavior in the field differs
in many key aspects from commonly held assumptions.
For example, we observe DRAM error rates that are orders
of magnitude higher than previously reported, with 25,000–
70,000 errors per billion device hours per Mb and more than
8% of DIMMs affected by errors per year. We provide strong
evidence that memory errors are dominated by hard errors,
rather than soft errors, which previous work suspects to be the
dominant error mode. We find that temperature, known to
strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly
feared, we do not observe any indication that newer generations of DIMMs have worse error behavior.
1. intRoDuction
Errors in dynamic random access memory (DRAM) devices
have been a concern for a long time. 3, 11, 15–17, 22 A memory error
is an event that leads to the logical state of one or multiple
bits being read differently from how they were last written.
Memory errors can be caused by electrical or magnetic interference (e.g., due to cosmic rays), can be due to problems with
the hardware (e.g., a bit being permanently damaged), or can
be the result of corruption along the data path between the
memories and the processing elements. Memory errors can
be classified into soft errors, which randomly corrupt bits but
do not leave physical damage; and hard errors, which corrupt
bits in a repeatable manner because of a physical defect.
The consequence of a memory error is system-dependent.
In systems using memory without support for error correction
and detection, a memory error can lead to a machine crash or
applications using corrupted data. Most memory systems in
server machines employ error correcting codes (ECC), 6 which
allow the detection and correction of one or multiple bit
errors. If an error is uncorrectable, i.e., the number of affected
bits exceed the limit of what the ECC can correct, typically a
machine shutdown is forced. In many production environments, including ours, a single uncorrectable error (UE) is
considered serious enough to replace the dual in-line memory
module (DIMM) that caused it.
Memory errors are costly in terms of the system failures
they cause and the repair costs associated with them. In
production sites running large-scale systems, memory
component replacements rank near the top of component
replacements19 and memory errors are one of the most
common hardware problems to lead to machine crashes. 18
There is also a fear that advancing densities in DRAM technology might lead to increased memory errors, exacerbating
this problem in the future. 3, 12, 13
Despite the practical relevance of DRAM errors, very little
is known about their prevalence in real production systems. Existing studies; for example, see Baumann, Borucki
et al., Johnston, May and Woods, Normand, and Ziegler
and Lanford, 3, 4, 9, 11, 16, 22 are mostly based on lab experiments using accelerated testing, where DRAM is exposed to
extreme conditions (such as high temperature) to artificially
induce errors. It is not clear how such results carry over
to real production systems. The few prior studies that are
based on measurements in real systems are small in scale,
such as recent work by Li et al., 10 who report on DRAM errors
in 300 machines over a period of 3–7 months. Moreover,
existing work is not always conclusive in their results. Li et
al. cite error rates in the 200–5000 FIT per Mb range from
previous lab studies, and themselves found error rates of < 1
FIT per Mb.
This paper provides the first large-scale study of DRAM
memory errors in the field. It is based on data collected from
Google’s server fleet over a period of more than 2 years making
up many millions of DIMM days. The DRAM in our study
covers multiple vendors, DRAM densities and technologies
(DDR1, DDR2, and FBDIMM).
The goal of this paper is to answer the following questions:
How common are memory errors in practice? How are they
affected by external factors, such as temperature, and system
utilization? How do they vary with chip-specific factors, such
as chip density, memory technology, and DIMM age? What
are their statistical properties?
The original version of this paper was published in
Proceedings of ACM SIGMETRICS, June 2009.