figure 1: fault tree for hDD read failures.

cannot read data

or

Operational Failures

Latent Failures

cannot find data

data missing

or

or

bad servo track

SMART limit exceeded

error during writing

written but destroyed

bad
electronics

bad
read head

or

or

can’t stay on track

bad media

thermal asperities

inherent bit errors

corrosion

high-fly write

scratched media

 

Two major categories of HDD failure can prevent access to data: those that fail the entire HDD and those that leave the HDD functioning but corrupt the data. Each of these modes has significantly different causes, probabilities, and effects. The first type of failure, which I term operational, is rather easy to detect, but has lower rates of occurrence than the data corruptions or latent defects that are not discovered until data is read. Figure 1 is a fault tree for the inability to read data—the topmost event in the tree— showing the two basic reasons that data cannot be read.

Operational failures:
Cannot find Data

Operational failures occur in two ways: first, data cannot be written to the HDD; second, after data is written correctly and is still present on the HDD uncorrupted, electronic or mechanical malfunction prevents it from being retrieved.

Bad servo track. Servo data is written at regular intervals on every data track of every disk surface. The servo data is used to control the positioning of the read/write heads. Servo data is

required for the heads to find and stay on a track, whether executing a read, write, or seek command. Servo-track information is written only during the manufacturing process and can be neither reconstructed using RAID nor rewritten in the field. Media defects in the servo-wedges cause the HDD to lose track of the heads’ locations or where to move the head for the next read or write. Faulty servo tracks result in the inability to access data, even though the data is written and uncorrupted. Particles, contaminants, scratches, or thermal asperities can damage servo data.

Can’t stay on track. Tracks on an HDD are not perfectly circular; some are actually spiral. The head position is continuously measured and compared with where it should be. A PES (position error signal) repositions the head over the track. This repeatable run-out is all part of normal HDD head positioning control. NRRO ( nonre-peatable run-out) cannot be corrected by the HDD firmware since it is nonre-peatable. Caused by mechanical tolerances from the motor bearings, actuator arm bearings, noise, vibration, and servo-loop response errors, NRRO can

make the head positioning take too long to lock onto a track and ultimately produce an error. This mode can be induced by excessive wear and is exacerbated by high rotational speeds. It affects both ball and fluid-dynamic bearings. The insidious aspect of this type of problem is that it can be intermittent. Specific HDD usage conditions may cause a failure while reading data in a system, but under test conditions the problem might not recur.

Two very interesting examples of inability to stay on track are caused by audible noise. A video file available on You Tube shows a member of Sun’s Fishworks team yelling at his disk drives and monitoring the latency in disk operations. 5 The vibrations from his yelling induce sufficient NRRO that the actuator cannot settle for over 520 ms. While most (some) of us don’t yell at our HDDs, vibrations induced by thermal alarms (warning buzzers) have also been noted to induce NRRO and cause excessive latency and time-outs.

SMART limits exceeded. Today’s HDDs collect and analyze functional and performance data to predict impending failure using SMART ( self-monitoring analysis reporting technology). In general, sector reallocations are expected, and many spare sectors are available on each HDD. If an excessive number occurs in a specific time interval, however, the HDD is deemed unreliable and is failed out.

SMART isn’t really that smart. One trade-off that HDD manufacturers face during design is the amount of RAM available for storing SMART data and the frequency and method for calculating SMART parameters. When the RAM containing SMART data becomes full, is it purged, then refilled with new data? Or are the most recent percentages (x%) of data preserved and the oldest ( 1–x)% purged? The former method means that a rate calculation such as read-error-rate can be erroneous if the memory fills up during an event that produces many errors. The errors before filling RAM may not be sufficient to trigger a SMART event, nor may the errors after the purge, but had the purge not occurred, the error conditions may easily have resulted in a SMART trip.

In general, the SMART thresholds are set very low, missing numerous

References:

Archives