figure 3. the effect of temperature: the left graph shows the normalized monthly rate of experiencing a correctable error (ce) as a function
of the monthly average temperature, in deciles. the right graph shows the monthly rate of experiencing a ce as a function of cPu utilization,
depending on whether the temperature was high (above median temperature) or low (below median temperature). We observe that when
isolating the effects of temperature by controlling for utilization, it has much less of an effect.
2. 5 4
Temp high
Temp low
Normalized monthly CE rate
Platform A
Platform B
Platform C
Platform D
Normalized monthly CE rate
100 0
101
Normalized temperature
102
10− 1 0
100
Normalized CPU utilization
101
threshold). We could not find significant levels of correlations
between errors and any of the above measures for temperature when controlling for utilization.
4. 2. utilization
The observations in Section 4. 1 point to system utilization
as a major contributing factor in the observed memory error
rates. Ideally, we would like to study specifically the impact
of memory utilization (i.e., number of memory accesses).
Unfortunately, obtaining data on memory utilization requires
the use of hardware counters, which our measurement infrastructure does not collect. Instead, we study two signals that
we believe provide indirect indication of memory activity:
CPU utilization and memory allocated. CPU utilization is the
load activity on the CPU(s) measured instantaneously as a percentage of total CPU cycles used out of the total CPU cycles
available and are averaged per machine for each month. For
lack of space, we include here only results for CPU utilization.
Results for memory allocated are similar and provided in the
full paper. 20
Figure 4 (left) shows the normalized monthly rate of CEs
as a function of CPU utilization. We observe clear trends of
increasing CE rates with increasing CPU utilization. Averaging
across all platforms, the CE rates grow roughly logarithmically
as a function of utilization levels (based on the roughly linear
increase of error rates in the graphs, which have log scales on
the X-axis).
One might ask whether utilization is just a proxy for tem-
perature, where higher utilization leads to higher system tem-
peratures, which then cause higher error rates. In Figure 4
(right), we therefore isolate the effects of utilization from
those of temperature. We divide the observed temperature
values into deciles and report for each range the observed
error rates when utilization was “high” or “low.” High utili-
zation means the utilization (CPU utilization and allocated
memory, respectively) is above median, and low means the
utilization was below median. We observe that even when
keeping temperature fixed and focusing on one particular
temperature decile, there is still a huge difference in the error
rates, depending on the utilization. For all temperature levels,
the CE rates are by a factor of 2–3 higher for high utilization
compared to low utilization.
4. 3. Aging
Age is one of the most important factors in analyzing the
reliability of hardware components, since increased error
rates due to early aging/wear-out limit the lifetime of a device.
As such, we look at changes in error behavior over time for our
DRAM population, breaking it down by age, platform, technology, correctable, and UEs.
Figure 5 shows normalized CE rates as a function of age for
all platforms that have been in production for long enough to
study age-related affects. We find that age clearly affects the
CE rates for all platforms, and we observe similar trends also
if we break the data further down by platform, manufacturer,