we find that differences in temperature in the range they arise
naturally in our fleet’s operation (a difference of around 20°C
between the 1st and 9th temperature decile) seem to have a
marginal impact on the incidence of memory errors, when
controlling for other factors, such as utilization.
Conclusion 5: Error rates are strongly correlated with
We find that DIMMs in machines with high levels of utilization, as measured by CPU utilization and the amount of memory allocated, see on average a 4–10 times higher rates of CEs,
even when controlling for other factors, such as temperature.
Conclusion 6: DIMM capacity tends to be correlated with CE
and UE incidence.
When considering DIMMs of the same type (manufacturer
and hardware platform), that only differ in their capacity, we
see a trend of increased CE and UE rates for higher capacity
DIMMs. Based on our data we do not have conclusive results
on the effect of chip size and chip density, but we are in the
process of conducting a more detailed study that includes
Conclusion 7: The incidence of CEs increases with age.
Given that DRAM DIMMs are devices without any mechanical components, unlike for example hard drives, we see a
surprisingly strong and early effect of age on error rates. For
all DIMM types we studied, aging in the form of increased CE
rates sets in after only 10–18 months in the field.
Conclusion 8: Memory errors are strongly correlated.
We observe strong correlations among CEs within the
same DIMM. A DIMM that sees a CE is 13–228 times more
likely to see another CE in the same month, compared to a
DIMM that has not seen errors. Correlations exist at short
time scales (days) and long time scales (up to 7 months).
We also observe strong correlations between CEs and UEs.
Most UEs are preceded by one or more CEs, and the presence
of prior CEs greatly increases the probability of later UEs. Still,
the absolute probabilities of observing a UE following a CE are
relatively small, between 0.1% and 2.3% per month, so replacing a DIMM solely based on the presence of CEs would be
attractive only in environments where the cost of downtime
is high enough to outweigh the cost of the expected high rate
of false positives.
Conclusion 9: Error rates are unlikely to be dominated by soft
The strong correlation between errors in a DIMM at both
short and long time scales, together with the correlation
between utilization and errors, leads us to believe that a large
fraction of the errors are due to hard errors.
Conclusion 9 is an interesting observation, since much pre-
vious work has assumed that soft errors are the dominating
error mode in DRAM. Some earlier work estimates hard errors
to be orders of magnitude less common than soft errors21 and
to make up about 2% of all errors. 1 Conclusion 9 might also
explain the significantly higher rates of memory errors we
observe compared to previous studies.
We would like to thank Luiz Barroso, Urs Hoelzle, Chris
Johnson, Nick Sanders, and Kai Shen for their feedback on
drafts of this paper. We would also like to thank those who
contributed directly or indirectly to this work: Kevin Bartz,
Bill Heavlin, Nick Sanders, Rob Sprinkle, and John Zapisek.
Special thanks to the System Health Infrastructure team for
providing the data collection and aggregation mechanisms.
Finally, the first author would like to thank the System Health
Group at Google for hosting her during the summer of 2008.
1. Mosys adds soft-error protection,
correction. Semiconductor Business
News ( 28 Jan. 2002).
2. Al-Ars, Z., van de Goor, A.J., Braun, J.,
Richter, D. Simulation based analysis
of temperature effect on the faulty
behavior of embedded DRAMs.
In ITC’01: Proceedings of the 2001
IEEE International Test Conference
3. Baumann, R. Soft errors in advanced
computer systems. IEEE Design Test
Comput. (2005), 258–266.
4. Borucki, L., Schindlbeck, G., Slayman, C.
Comparison of accelerated DRAM soft
error rates measred at component and
system level. In Proceedings of 46th
Annual International Reliability Physics
5. Chang, F., Dean, J., Ghemawat, S.,
Hsieh, W. C., Wallach, D.A., Burrows,
M., Chandra, T., Fikes, A., Gruber, R.E.
Bigtable: A distributed storage system
for structured data. In Proceedings of
6. Chen, C., Hsiao, M. Error-correcting
codes for semiconductor memory
applications: A state-of-the-art review.
IBM J. Res. Dev. 28, 2 (1984), 124–134.
7. Dell, T.J. A white paper on the benefits
of chipkill-correct ECC for PC server
main memory. IBM Microelectronics
8. Hamamoto, T, Sugiura, S., Sawada,
S. On the retention time distribution
of dynamic random access memory
(DRAM). IEEE Trans. Electron Dev. 45,
6 (1998), 1300–1309.
9. Johnston, A.H. Scaling and
technology issues for soft error
rates. In Proceedings of the 4th Annual
Conference on Reliability (2000).
10. Li, X., Shen, K., Huang, M., Chu, L. A
memory soft error measurement on
production systems. In Proceedings of
USENIX Annual Technical Conference
11. May, T.C., Woods, M.H. Alpha-particle-induced soft errors in dynamic
memories. IEEE Trans. Electron
Dev. 26, 1 (1979).
12. Messer, A., Bernadat, P., Fu, G., Chen, D.,
Dimitrijevic, Lie, D., Mannaru, D. D.,
Riska, R., Milojicic, D. Susceptibility of
commodity systems and software to
memory soft errors.
Bianca Schroeder ( email@example.com.
edu), Computer Science Department,
University of Toronto, Toronto, Canada.
Eduardo Pinheiro, Google Inc., Mountain
Wolf-Dietrich Weber, Google Inc.,
Mountain View, CA.
© 2011 ACM 0001-0782/11/0200 $10.00