figure 1: fault tree for hDD read failures.
cannot read data
or
Operational
Failures
Latent
Failures
cannot find data
data missing
or
or
bad servo
track
SMART limit
exceeded
error during
writing
written but
destroyed
bad
electronics
bad
read head
or
or
can’t stay
on track
bad
media
thermal
asperities
inherent
bit errors
corrosion
high-fly
write
scratched
media
Two major categories of HDD failure can prevent access to data: those
that fail the entire HDD and those that
leave the HDD functioning but corrupt the data. Each of these modes has
significantly different causes, probabilities, and effects. The first type
of failure, which I term operational,
is rather easy to detect, but has lower
rates of occurrence than the data corruptions or latent defects that are not
discovered until data is read. Figure 1
is a fault tree for the inability to read
data—the topmost event in the tree—
showing the two basic reasons that
data cannot be read.
Operational failures:
Cannot find Data
Operational failures occur in two
ways: first, data cannot be written to
the HDD; second, after data is written correctly and is still present on the
HDD uncorrupted, electronic or mechanical malfunction prevents it from
being retrieved.
Bad servo track. Servo data is written at regular intervals on every data
track of every disk surface. The servo
data is used to control the positioning
of the read/write heads. Servo data is
required for the heads to find and stay
on a track, whether executing a read,
write, or seek command. Servo-track
information is written only during
the manufacturing process and can
be neither reconstructed using RAID
nor rewritten in the field. Media defects in the servo-wedges cause the
HDD to lose track of the heads’ locations or where to move the head for
the next read or write. Faulty servo
tracks result in the inability to access
data, even though the data is written
and uncorrupted. Particles, contaminants, scratches, or thermal asperities
can damage servo data.
Can’t stay on track. Tracks on an
HDD are not perfectly circular; some
are actually spiral. The head position
is continuously measured and compared with where it should be. A PES
(position error signal) repositions the
head over the track. This repeatable
run-out is all part of normal HDD head
positioning control. NRRO (
nonre-peatable run-out) cannot be corrected
by the HDD firmware since it is nonre-peatable. Caused by mechanical tolerances from the motor bearings, actuator arm bearings, noise, vibration, and
servo-loop response errors, NRRO can
make the head positioning take too
long to lock onto a track and ultimately produce an error. This mode can be
induced by excessive wear and is exacerbated by high rotational speeds.
It affects both ball and fluid-dynamic
bearings. The insidious aspect of this
type of problem is that it can be intermittent. Specific HDD usage conditions may cause a failure while reading
data in a system, but under test conditions the problem might not recur.
Two very interesting examples of
inability to stay on track are caused
by audible noise. A video file available
on You Tube shows a member of Sun’s
Fishworks team yelling at his disk
drives and monitoring the latency in
disk operations.
5 The vibrations from
his yelling induce sufficient NRRO that
the actuator cannot settle for over 520
ms. While most (some) of us don’t yell
at our HDDs, vibrations induced by
thermal alarms (warning buzzers) have
also been noted to induce NRRO and
cause excessive latency and time-outs.
SMART limits exceeded. Today’s
HDDs collect and analyze functional
and performance data to predict impending failure using SMART (
self-monitoring analysis reporting technology). In general, sector reallocations
are expected, and many spare sectors
are available on each HDD. If an excessive number occurs in a specific time
interval, however, the HDD is deemed
unreliable and is failed out.
SMART isn’t really that smart. One
trade-off that HDD manufacturers
face during design is the amount of
RAM available for storing SMART data
and the frequency and method for calculating SMART parameters. When
the RAM containing SMART data becomes full, is it purged, then refilled
with new data? Or are the most recent
percentages (x%) of data preserved and
the oldest ( 1–x)% purged? The former
method means that a rate calculation
such as read-error-rate can be erroneous if the memory fills up during an
event that produces many errors. The
errors before filling RAM may not be
sufficient to trigger a SMART event,
nor may the errors after the purge, but
had the purge not occurred, the error
conditions may easily have resulted in
a SMART trip.
In general, the SMART thresholds
are set very low, missing numerous