The differences in the rates of UEs between different
platforms bring up the question of what factors impact the
frequency of UEs. We investigated a number of factors that
might explain the difference in memory rates across platforms, including temperature, utilization, DIMM age, capacity, DIMM manufacturer or memory technology (detailed
tables included in the full paper20). While some of these affect
the frequency of errors, they are not sufficient to explain the
differences we observe between platforms.
While we cannot be certain about the cause of the
differences between platforms, we hypothesize that the
differences in UEs are due to differences in the error correction
codes in use. In particular, Platforms C, D, and F are the only
platforms that do not use a form of chip-kill. 7 Chip-kill is a
more powerful code that can correct certain types of multiple
bit errors, while the codes in Platforms C, D, and F can only
correct single-bit errors.
While the above discussion focused on descriptive
statistics, we also studied the statistical distribution of errors
in detail. We observe that for all platforms the distribution of
the number of CEs per DIMM per year is highly variable. For
example, when looking only at those DIMMs that had at least
one CE, there is a large difference between the mean and the
median number of errors: the mean ranges from 20,000 to
140,000, while the median numbers are between 42 and 167.
When plotting the distribution of CEs over DIMMs (see
Figure 2), we find that for all platforms the top 20% of DIMMs
with errors make up over 94% of all observed errors. The shape
of the distribution curve provides evidence that it follows
a power-law distribution. Intuitively, the skew in the distribution means that a DIMM that has seen a large number of
errors is likely to see more errors in the future. This is an interesting observation as this is not a property one would expect
for soft errors (which should follow a random pattern) and
might point to hard (or intermittent) errors as a major source
of errors. This observation motivates us to take a closer look at
correlations in Section 5.
figure 2. the distribution of correctable errors over Dimms: the
graph plots the fraction Y of all errors that is made up by the fraction
X of Dimmms with the largest number of errors.
Fraction of correctable errors
10− 5 10− 4 10− 3 10− 2 10− 1 100
Fraction of dimms with correctable errors
4. imPAct of eXteRnAL fActoRs
In this section, we study the effect of various factors, including DIMM capacity, temperature, utilization, and age. We
consider all platforms, except for Platform F, for which we do
not have enough data to allow for a fine-grained analysis, and
Platform E, for which we do not have data on CEs.
4. 1. temperature
Temperature is considered to (negatively) affect the reliability of many hardware components due to the strong physical
changes on materials that it causes. In the case of memory
chips, high temperature is expected to increase leakage current, 2, 8 which in turn leads to a higher likelihood of flipped
bits in the memory array. In the context of large-scale production systems, understanding the exact impact of temperature
on system reliability is important, since cooling is a major cost
factor. There is a trade-off to be made between increased cooling costs and increased downtime and maintenance costs
due to higher failure rates.
To investigate the effect of temperature on memory errors,
we plot in Figure 3 (left) the monthly rate of CEs as a function
of temperature, as measured by a temperature sensor on the
motherboard of each machine. Since temperature information is considered confidential, we report relative temperature
values, where a temperature of x on the X-axis means the temperature was x°C higher than the lowest temperature observed
for a given platform. For better readability of the graphs, we
normalize CE error rates for each platform by the platform’s
average CE rate, i.e., a value of y on the Y-axis refers to a CE rate
that was y times higher than the average CE rate.
Figure 3 (left) shows that for all platforms higher temperatures are correlated with higher CE rates. For all platforms,
the CE rate increases by at least a factor of 2 for an increase of
temperature by 20°C; for some it nearly triples.
It is not clear whether this correlation indicates a causal
relationship, i.e., higher temperatures inducing higher error
rates. Higher temperatures might just be a proxy for higher
system utilization, i.e., the utilization increases leading independently to higher error rates and higher temperatures. In
Figure 3 (right), we therefore isolate the effects of temperature
from the effects of utilization. We divide the utilization measurements (CPU utilization) into deciles and report for each
decile the observed error rate when temperature was “high”
(above median temperature) or “low” (below median temperature). We observe that when controlling for utilization,
the effects of temperature vanish. We also repeated these
experiments with higher differences in temperature, e.g., by
comparing the effect of temperatures above the 9th decile to
temperatures below the 1st decile. In all cases, for the same
utilization levels the error rates for high versus low temperature are very similar.
The results presented above were achieved by correlating the number of errors observed in a given month with
the average temperature in that month. In our analysis, we
also experimented with different measures of temperature,
including temperatures averaged over different time scales
(ranging from 1 h, to 1 day, to 1 month, to a dimm’s lifetime),
variability in temperature, and number of temperature excursions (i.e., number of times the temperature went above some