2. BAcKGRounD AnD DAtA
Our data covers the majority of machines in Google’s fleet and
spans nearly 2. 5 years, from January 2006 to June 2008. Each
machine comprises a motherboard with some processors and
memory DIMMs. We study six different hardware platforms,
where a platform is defined by the motherboard and memory
generation. We refer to these platforms as platforms A to F
throughout the paper.
The memory in these systems covers a wide variety of the
most commonly used types of DRAM. The DIMMs come from
multiple manufacturers and models, with three different
capacities (1GB, 2GB, 4GB), and cover the three most common DRAM technologies: Double Data Rate 1 (DDR1), Double
Data Rate 2 (DDR2), and Fully-Buffered (FBDIMM).
Most memory systems in use in servers today are protected by error detection and correction codes. Typical error
codes today fall in the single error correct double error detect
(SECDED) category. That means they can reliably detect and
correct any single-bit error, but they can only detect and not
correct multiple bit errors. More powerful codes can correct
and detect more error bits in a single memory word. For
example, a code family known as chip-kill7 can correct up to
four adjacent bits at once, thus being able to work around
a completely broken 4-bit wide DRAM chip. In our systems,
Platforms C, D, and F use SECDED, while Platforms A, B, and
E rely on error protection based on chipkill. We use the terms
correctable error (CE) and uncorrectable error (UE) in this
paper to generalize away the details of the actual error codes
used. Our study relies on data collected by low-level daemons
running on all our machines that directly access hardware
counters on the machine to obtain counts of correctable and
uncorrectable DRAM errors.
If done well, the handling of correctable memory errors is
largely invisible to application software. In contrast, UEs typically lead to a catastrophic failure. Either there is an explicit
response action (such as a machine reboot), or there is risk of
a data-corruption-induced failure, such as a kernel panic. In
the systems we study, all UEs are considered serious enough
to shut down the machine and replace the DIMM at fault.
Memory errors can be soft errors, which randomly
corrupt bits, but do not leave any physical damage; or hard
errors, which corrupt bits in a repeatable manner because
of a physical defect (e.g., stuck bits). Our measurement
infrastructure captures both hard and soft errors, but does
not allow us to reliably distinguish these types of errors. All
our numbers include both hard and soft errors.
3. Ho W common ARe eRRoRs?
The analysis of our data shows that CEs are not rare events:
We find that about a third of all machines in Google’s fleet,
and over 8% of individual DIMMs saw at least one CE per
year. Figure 1 (left) shows the average number of CEs across
all DIMMs in our study per year of operation broken down by
hardware platform. Figure 1 (middle) shows the fraction of
DIMMs per year that experience at least one CE. Consistently
across all platforms, errors occur at a significant rate, with a
fleet-wide average of nearly 4,000 errors per DIMM per year.
The fraction of DIMMs that experience CEs varies from around
3% (for Platforms C, D and F) to around 20% (for Platforms A
and B). Our per-DIMM rates of CEs translate to an average of
25,000– 75,000 FIT (failures in time per billion hours of operation) per Mb and a median FIT range of 778–25,000 per Mb
(median for DIMMs with errors). We note that this rate is significantly higher than the 200– 5,000 FIT per Mb reported in
previous studies and will discuss later in the paper reasons for
the differences in results.
We also analyzed the rate of UEs and found that across
the entire fleet 1.3% of machines are affected by UEs per year,
with some platforms seeing as many as 2%–4% of machines
affected. Figure 1 (right) shows the fractions of DIMMs that see
the UEs in a given year, broken down by hardware platform.
We note that, while the rate of CEs was comparable across
platforms (recall Figure 1 (left) ), the incidence of UEs is much
more variable, ranging from 0.05% to 0.4%. In particular,
Platforms C and D have a 3–6 times higher probability of seeing a UE than Platforms A and E.
figure 1. frequency of errors: the average number of correctable errors (ces) per year per Dimm (left), the fraction of Dimms that see at least
one ce in a given year (middle) and the fraction of Dimms that see at least one uncorrectable error (ue) in a given year (right). Platforms c, D,
and f use secDeD, while platforms A, B, and e rely on error protection based on chipkill.
5000 25 0.4
Number of CEs / year / DIMM
affected by CEs / year
affected by UEs / year