6
figure 6. memory errors and Dimm capacity: the graph shows for
different Platform-manufacturer pairs the factor increase in ce
rates, ce probabilities and ue probabilities, when doubling the
capacity of a Dimm.
Factor increase when doubling (GB)
5
4
3
2
1
CE Prob
CE Rate
UE Prob
0
A− 1 B− 1 B− 2 D− 6 E− 1 E− 2 F− 1
also looked for correlations of chip size with incidence of correctable and UEs. Again we observe no clear trends. We also
repeated the study of chip size effect without taking information on the manufacturer and/or age into account, again without any clear trends emerging.
The best we can conclude therefore is that any chip size
effect is unlikely to dominate error rates given that the trends
are not consistent across various other confounders, such as
age and manufacturer.
5. A cLoseR LooK At coRReLAtions
The goal of this section is to study correlations between
errors. Understanding correlations might help identify when
a DIMM is likely to produce a large number of errors in the
future and replace it before it starts to cause serious problems.
We begin by looking at correlations between CEs within
the same DIMM. Figure 7 shows the probability of seeing a CE
in a given month, depending on whether there were CEs in the
same month (group of bars on the left) or the previous month
(group of bars on the right). As the graph shows, for each platform the monthly CE probability increases dramatically in
the presence of prior errors. In more than 85% of the cases
a CE is followed by at least one more CE in the same month.
Depending on the platform, this corresponds to an increase
in probability between 13× to more than 90×, compared to an
average month. Also seeing CEs in the previous month significantly increases the probability of seeing a CE: The probability
increases by factors of 35× to more than 200×, compared to
the case when the previous month had no CEs.
We also study correlations over time periods longer than
a month and correlations between the number of errors
in 1 month and the next, rather than just the probability of
occurrence. Our study of the autocorrelation function for the
number of errors observed per DIMM per month shows that
even at lags of up to 7 months the level of correlation is still
significant. When looking at the number of errors observed
per month, we find that the larger the number of errors experienced in a month, the larger the expected number of errors in
100
figure 7. correlations between correctable and uncorrectable
errors: the graph shows the probability of seeing a ce in a given
month, depending on whether there were previously ces observed
in the same month (three left-most bars) or in the previous month
(three right-most bars). the numbers on top of each bar show the
factor increase in probability compared to the ce probability in
a randon month (three left-most bars) and compared to the ce
probability when there was no ce in the previous month (three
right-most bars).
13×
64× 91×
80
35×
158× 228×
CE probability (%)
20
0
CE same month
Platform A
Platform C
Platform D
CE previous month
the following month. For example, in the case of Platform C, if
the number of CEs in a month exceeds 100, the expected number of CEs in the following month is more than 1,000. This is
a 100× increase compared to the CE rate for a random month.
Graphs illustrating the above findings and more details are
included in the full paper. 20
While the above observations let us conclude that CEs are
predictive of future CEs, maybe the more interesting question
is how CEs affect the probability of future uncorrectable errors.
Since UEs are simply multiple bit corruptions (too many for
the ECC to correct), one might suspect that the presence of
CEs increases the probability of seeing a UE in the future. This
is the question we focus on next.
The three left-most bars in Figure 8 show how the probability of experiencing a UE in a given month increases if there
are CEs in the same month. For all platforms, the probability
of a UE is significantly larger in a month with CEs compared
to a month without CEs. The increase in the probability of a
UE ranges from a factor of 27× (for Platform A) to more than
400× (for Platform D). While not quite as strong, the presence
of CEs in the preceding month also affects the probability of
UEs. The three right-most bars in Figure 8 show that the probability of seeing a UE in a month following a month with at
least one CEs is larger by a factor of 9× to 47× than if the previous month had no CEs.
We find that not only the presence, but also the rate of
observed CEs in the same month affects the probability of a
later UE. Higher rates of CEs translate to a higher probability
of UEs. We see similar, albeit somewhat weaker trends when
plotting the probability of UEs as a function of the number
of CEs in the previous month. The UE probabilities are about
Throughout this section, when we say “in the same month” we mean within a
30-day period, rather than calendar month.