figure 8. correlations between correctable and uncorrectable
errors: the graph shows the ue probability in a month depending on
whether there were ces earlier in the same month (three left-most
bars) or in the previous month (three right-most bars). the numbers
on top of the bars give the increase in ue probability compared to a
month without ces (three left-most bars) and the case where there
were no ces in the previous month (three right-most bars).
UE probability (%)
CE same month
CE previous month
8× lower than if the same number of CEs had happened in
the same month, but still significantly higher than in a random month.
Given the above observations, one might want to use CEs as
an early warning sign for impending UEs. Another interesting
view is therefore what fraction of UEs are actually preceded
by a CE, either in the same month or the previous month. We
find that 65%–80% of UEs are preceded by a CE in the same
month. Nearly 20%–40% of UEs are preceded by a CE in the
previous month. These probabilities are significantly higher
than those in an average month.
The above observations lead to the idea of early replacement policies, where a DIMM is replaced once it experiences
a significant number of CEs, rather than waiting for the first
UE. However, while UE probabilities are greatly increased
after observing CEs, the absolute probabilities of a UE are still
relatively low (e.g., 1.7%– 2.3% in the case of Platform C and
Platform D, see Figure 8).
We also experimented with more sophisticated methods
for predicting UEs, including CAR T (classification and regression trees) models based on parameters such as the number
of CEs in the same and previous month, CEs and UEs in other
DIMMs in the machine, DIMM capacity and model, but were
not able to achieve significantly better prediction accuracy.
Hence, replacing DIMMs solely based on CEs might be worth
the price only in environments where the cost of downtime is
high enough to outweigh the cost of the relatively high rate of
Our study of correlations and the presented evidence of
correlations between errors, both in short and in longer time
scales, might also shed some light on the common nature
of errors. In simple terms, our results indicate that once a
DIMM starts to experience errors it is likely to continue to
have errors. This observation makes it more likely that most
of the observed errors are due to hard errors, rather than soft
errors. The occurrence of hard errors would also explain the
correlation between utilization and errors that we observed in
Section 4. 1.
6. summARy AnD Discussion
This paper studied the incidence and characteristics of DRAM
errors in a large fleet of commodity servers. Our study is based
on data collected over more than 2 years and covers DIMMs of
multiple vendors, generations, technologies, and capacities.
Below; we briefly summarize our results and discuss their
Conclusion 1: We found the incidence of memory errors and
the range of error rates across different DIMMs to be much
higher than previously reported.
A third of machines and over 8% of DIMMs in our fleet saw
at least one CE per year. Our per-DIMM rates of CEs translate
to an average of 25,000– 75,000 FIT (failures in time per billion hours of operation) per Mb, while previous studies report
200– 5,000 FIT per Mb. The number of CEs per DIMM is highly
variable, with some DIMMs experiencing a huge number of
errors, compared to others. The annual incidence of UEs was
1.3% per machine and 0.22% per DIMM.
Conclusion 2: More powerful error codes (chip-kill versus
SECDED) can reduce the rate of UEs by a factor of 3–8.
We observe that platforms with more powerful error codes
(chip-kill versus SECDED) were able to significantly reduce
the rate of UEs (from 0.25%–0.4% per DIMM per year for
SECDED-based platforms, to 0.05%–0.08% for chipkill based
platforms). Nonetheless, the remaining incidence of UEs
makes a crash-tolerant application layer indispensable for
large-scale server farms.
Conclusion 3: There is no evidence that newer generation
DIMMs have worse error behavior (even when controlling for
DIMM age). There is also no evidence that one technology
(DDR1, DDR2, FB-DIMM) or one manufacturer consistently
outperforms the others.
There has been much concern that advancing densities in
DRAM technology will lead to higher rates of memory errors
in future generations of DIMMs. We study DIMMs in six different platforms, which were introduced over a period of several years, and observe no evidence that CE rates increase with
newer generations. In fact, the DIMMs used in the three most
recent platforms exhibit lower CE rates, than the two older
platforms, despite generally higher DIMM capacities. This
indicates that improvements in technology are able to keep
up with adversarial trends in DIMM scaling.
Conclusion 4: Within the range of temperatures our
production systems experience in the field, temperature has
a surprisingly low effect on memory errors.
Temperature is well known to increase error rates. In fact,
artificially increasing the temperature is a commonly used
tool for accelerating error rates in lab studies. Interestingly,