conditions that could proactively fail
a HDD. Making the trip levels more
sensitive (trip at lower levels) runs the
risk of failing HDDs with a few errors
that really aren’t progressing to the
point of failure. The HDD may simply
have had a series of reallocations, say,
that went smoothly, mapping out the
problematic area of the HDD. Integrators must assess the HDD manufacturer’s implementation of SMART and
see if there are other more instructive
calculations. Integrators must at least
understand the SMART data collection
and analysis process at a very low level,
then assess their specific usage pattern
to decide whether the implementation
of SMART is adequate or whether the
SMART decisions need to be moved up
to the system (RAID group) level.
Head games and electronics. Most
head failures result from changes in
the magnetic properties, not electrical characteristics. ESD (electrostatic
discharge), high temperatures, and
physical impact from particles affect
magnetic properties. As with any highly integrated circuit, ESD can leave the
read heads in a degraded mode. Subsequent moderate to low levels of heat
may be sufficient to fail the read heads
magnetically. A recent publication
from Google didn’t find a significant
correlation between temperature and
reliability.
6 In my conversations with
numerous engineers from all the major HDD manufacturers, none has said
the temperature does not affect head
reliability, but none has published a
transfer function relating head life to
time and temperature. The read element is physically hidden and difficult
to damage, but heat can be conducted
from the shields to the read element,
affecting magnetic properties of the
reader element, especially if it is already weakened by ESD.
The electronics on an HDD are complex. Failed DRAM and cracked chip
capacitors have been known to cause
HDD failure. As the HDD capacities
increase, the buffer sizes increase and
more RAM is required to cache writes.
Is RAID at the RAM level required to assure reliability of the ever-increasing
solid-state memory?
ures disagree with the manufacturers’
specification.
1–3, 6, 7, 10, 11 More disconcerting are the realizations that the
failure rates are rarely constant; there
are significant differences across suppliers, and great differences within a
specific HDD family from a single supplier. These inconsistencies are further complicated by unexpected and
uncontrolled lot-to-lot differences.
In a population of HDDs that are all
the same model from a single manufacturer, there can be statistically significant subpopulations, each having
a different time-to-failure distribution
with different parameters. Analyses of
HDD data indicate these subpopulations are so different that they should
not be grouped together for analyses
because the failure causes and modes
are different. HDDs are a technology
that defies the idea of “average” failure rate or MTBF; inconsistency is
synonymous with variability and unpredictability.
The following are examples of unpredictability that existed to such an
extent that at some point in the product’s life, these subpopulations dominated the failure rate:
• Airborne contamination. Particles
within the enclosure tend to fail HDDs
early (scratches and head damage).
This can give the appearance of an increasing failure rate. After all the contaminated HDDs fail, the failure rate
often decreases.
•Design changes. Manufacturers
periodically find it necessary to reduce
cost, resolve a design issue discovered late in the test phase, or improve
yields. Often, the change creates an improvement in field reliability, but can
create more problems than it solves.
For example, one design change had
an immediately positive effect on reliability, but after two years another failure mode began to dominate and the
HDD reliability became significantly
worse.
• Yield changes. HDD manufacturers are constantly tweaking their processes to improve yield. Unfortunately, HDDs are so complex that these
yield enhancements can inadvertently
reduce reliability. Continuous tweaks
can result in one month’s production being highly reliable and another
month being measurably worse.
The net impact of variability in reli-
figure 2: Weibull time to failure plot for three very different populations.
0.5
β
6.0
3.0
2.0
1. 6
1. 2
0.9
0.7
99.0
90.0
η
50.0
HDD #3
10.0
Probability of Failure
5.0
1.0
0.5
HDD #1
HDD #2
0.1
0.05
0.01
10
100
Operational failure Data
In a number of studies on disk failure
rates, all mean times between fail-
1000
Time to Failure, hrs
10000
100000