conditions that could proactively fail a HDD. Making the trip levels more sensitive (trip at lower levels) runs the risk of failing HDDs with a few errors that really aren’t progressing to the point of failure. The HDD may simply have had a series of reallocations, say, that went smoothly, mapping out the problematic area of the HDD. Integrators must assess the HDD manufacturer’s implementation of SMART and see if there are other more instructive calculations. Integrators must at least understand the SMART data collection and analysis process at a very low level, then assess their specific usage pattern to decide whether the implementation of SMART is adequate or whether the SMART decisions need to be moved up to the system (RAID group) level.

Head games and electronics. Most head failures result from changes in the magnetic properties, not electrical characteristics. ESD (electrostatic discharge), high temperatures, and physical impact from particles affect magnetic properties. As with any highly integrated circuit, ESD can leave the read heads in a degraded mode. Subsequent moderate to low levels of heat may be sufficient to fail the read heads magnetically. A recent publication from Google didn’t find a significant correlation between temperature and reliability. 6 In my conversations with numerous engineers from all the major HDD manufacturers, none has said the temperature does not affect head reliability, but none has published a transfer function relating head life to time and temperature. The read element is physically hidden and difficult to damage, but heat can be conducted from the shields to the read element, affecting magnetic properties of the reader element, especially if it is already weakened by ESD.

The electronics on an HDD are complex. Failed DRAM and cracked chip capacitors have been known to cause HDD failure. As the HDD capacities increase, the buffer sizes increase and more RAM is required to cache writes. Is RAID at the RAM level required to assure reliability of the ever-increasing solid-state memory?

ures disagree with the manufacturers’ specification. 1–3, 6, 7, 10, 11 More disconcerting are the realizations that the failure rates are rarely constant; there are significant differences across suppliers, and great differences within a specific HDD family from a single supplier. These inconsistencies are further complicated by unexpected and uncontrolled lot-to-lot differences.

In a population of HDDs that are all the same model from a single manufacturer, there can be statistically significant subpopulations, each having a different time-to-failure distribution with different parameters. Analyses of HDD data indicate these subpopulations are so different that they should not be grouped together for analyses because the failure causes and modes are different. HDDs are a technology that defies the idea of “average” failure rate or MTBF; inconsistency is synonymous with variability and unpredictability.

The following are examples of unpredictability that existed to such an extent that at some point in the product’s life, these subpopulations dominated the failure rate:

Airborne contamination. Particles within the enclosure tend to fail HDDs early (scratches and head damage). This can give the appearance of an increasing failure rate. After all the contaminated HDDs fail, the failure rate often decreases.

Design changes. Manufacturers periodically find it necessary to reduce cost, resolve a design issue discovered late in the test phase, or improve yields. Often, the change creates an improvement in field reliability, but can create more problems than it solves. For example, one design change had an immediately positive effect on reliability, but after two years another failure mode began to dominate and the HDD reliability became significantly worse.

Yield changes. HDD manufacturers are constantly tweaking their processes to improve yield. Unfortunately, HDDs are so complex that these yield enhancements can inadvertently reduce reliability. Continuous tweaks can result in one month’s production being highly reliable and another month being measurably worse.

The net impact of variability in reli-

figure 2: Weibull time to failure plot for three very different populations.

0.5

β
6.0
3.0
2.0
1. 6
1. 2
0.9
0.7

99.0

90.0

η

50.0

HDD #3

10.0

Probability of Failure

5.0

1.0

0.5

HDD #1

HDD #2

0.1

0.05

0.01

10

100

Operational failure Data In a number of studies on disk failure rates, all mean times between fail-

1000
Time to Failure, hrs

10000

100000

References:

Archives