3
figure 4. the effect of utilization: the normalized monthly ce rate as a function of cPu utilization (left), and while controlling for
temperature (right).
2
CPU high
CPU low
Normalized monthly CE rate
Platform A
Platform B
Platform C
Platform D
Normalized monthly CE rate
10− 1 0
100
Normalized CPU utilization
101
100 0.4
101
Normalized temperature
102
and capacity (graphs included in full paper20).
For a more fine-grained view of the effects of aging and to
identify trends, we study the mean cumulative function (MCF)
of errors. While our full paper20 includes several MCF plots, for
lack of space we only summarize the results here. In short, we
find that age severely affects CE rates: We observe an increasing incidence of errors as DIMMs get older, but only up to a
certain point, when the incidence becomes almost constant
(few DIMMs start to have CEs at very old ages). The age when
errors first start to increase and the steepness of the increase
vary per platform, manufacturer, and DRAM technology, but
is generally in the 10–18 month range. We also note the lack
of infant mortality for almost all populations. We attribute
this to the weeding out of bad DIMMs that happens during
the burn-in of DIMMs prior to putting them into production.
4. 4. Dimm capacity and chip size
Since the amount of memory used in typical server systems
keeps growing from generation to generation, a commonly
asked question, when projecting for future systems, is how
an increase in memory affects the frequency of memory
errors. In this section, we focus on one aspect of this question.
We ask how error rates change, when increasing the capacity
of individual DIMMs.
To answer this question we consider all DIMM types (type
being defined by the combination of platform and manufacturer) that exist in our systems in two different capacities.
Typically, the capacities of these DIMM pairs are either 1GB
and 2GB, or 2GB and 4GB. Figure 6 shows for each of these
pairs the factor by which the monthly probability of CEs, the
CE rate, and the probability of UEs changes, when doubling
capacity.
Figure 6 indicates a trend toward worse error behavior for
increased capacities, although this trend is not consistent.
While in some cases the doubling of capacity has a clear negative effect (factors larger than 1 in the graph), in others it has
hardly any effect (factor close to 1 in the graph). For example,
for Platform A, Mfg1 doubling the capacity increases UEs,
but not CEs. Conversely, for Platform D, Mfg- 6 doubling the
4
figure 5. the effect of age: the normalized monthly rate of
experiencing a ce as a function of age by platform.
Normalized monthly CE rate
3
3. 5
2
2. 5
1
1. 5
0.5
Platform A
Platform B
Platform C
Platform D
3
0
5
10 15 20
Age (months)
25 30 35
capacity affects CEs, but not UEs.
The difference in how scaling capacity affects errors might
be due to differences in how larger DIMM capacities are
built, since a given DIMM capacity can be achieved in multiple ways. For example, a 1Gb DIMM with ECC can be manufactured with 36 256-Mb chips, or 18 512-Mb chips or with 9
1-Gb chips.
We studied the effect of chip sizes on correctable and
UEs, controlling for capacity, platform (dimm technology),
and age. The results are mixed. When two chip configurations were available within the same platform, capacity and
manufacturer, we sometimes observed an increase in average
CE rates and sometimes a decrease. This either indicates that
chip size does not play a dominant role in influencing CEs or
there are other, stronger confounders in our data that we did
not control for.
In addition to a correlation of chip size with error rates, we
Some bars are omitted, as we do not have data on UEs for Platform B and
data on CEs for Platform E.