reading of the data, the corruption is
corrected using the parity disk and
no data is lost. If one HDD, however,
has experienced an operational failure
and the RAID group is in the process of
reconstruction when the latent defect
is discovered, that data is lost. Since
latent defects persist until discovered
(read) and corrected, their rate of occurrence is an extremely important aspect of RAID reliability.
One study concludes that the BER
is fairly inconsequential in terms of
creating corrupted data,
4 while another claims the rate of data corruption
is five times the rate of HDD operating failures.
8 Analyses of corrupted
data identified by specific SCSI error
codes and subsequent detailed failure analyses show that the rate of data
corruption for all causes is significant
and must be included in the reliability
model.
NetApp (Network Appliance) completed a study in late 2004 on 282,000
HDDs used in RAID architecture.
The RER (read-error rate) over three
months was 8x10–14 errors per byte
read. At the same time, another analysis of 66,800 HDDs showed an RER
of approximately 3.2x10–13 errors per
byte. A more recent analysis of 63,000
HDDs over five months showed a
much-improved 8x10–15 errors per byte
read. In these studies, data corruption
is verified by the HDD manufacturer
as an HDD problem and not a result of
the operating system controlling the
RAID group.
While Jim Gray of Microsoft Research asserted that it is reasonable to
transfer 4.32x1012 bytes/day/HDD, the
study of 63,000 HDDs read 7.3x1017
bytes of data in five months, an approximate read rate of 2.7x1011 bytes/
day/HDD. 4 Using combinations of the
RERs and number of bytes read yields
the hourly read failure rates shown in
the table here.
Latent defects do not occur at a
constant rate, but in bursts or adjacent physical (not logical) locations.
Although some latent defects are created by wear-out mechanisms, data is
not available to discern wear-out from
those that occur randomly at a constant rate. These rates are between 2
and 100 times greater than the rates
for operational failures.
that operational failure rates are not
increased.
Frequent scrubbing can affect performance, but too infrequent scrubbing makes the n+ 1 RAID group highly
susceptible to double disk failures.
Scrubbing, as with full HDD data reconstruction, has a minimum time
to cover the entire HDD. The time to
complete the scrub is a random variable that depends on HDD capacity
and I/O activity. The operating system
may invoke a maximum time to complete scrubbing.
Potential Value of Data scrubbing
Latent defects (data corruptions) can
occur during almost any HDD activity:
reading, writing, or simply spinning. If
not corrected, these latent defects will
result in lost data when an operational
failure occurs. They can be eliminated, however, by background scrubbing, which is essentially preventive
maintenance on data errors. During
scrubbing, which occurs during times
of idleness or low I/O activity, data is
read and compared with the parity. If
they are consistent, no action is taken.
If they are inconsistent, the corrupted
data is recovered and rewritten to the
HDD. If the media is defective, the recovered data is written to new physical
sectors on the HDD and the bad blocks
are mapped out.
If scrubbing does not occur, the period of time to accumulate latent defects starts when the HDD begins operation in the system. Since scrubbing
requires reading and writing data, it
can act as a time-to-failure accelerator
for HDD components with usage-de-pendent time-to-failure mechanisms.
The optimal scrub pattern, rate, and
time of scrubbing is HDD-specific and
must be determined in conjunction
with the HDD manufacturer to assure
Range of average read error rates.
Bytes Read per hour
Low rate ( 1. 35 × 109) high rate ( 1. 35 × 1010)
1.08 × 10–5 err/hr 1.08 × 10–4 err/hr
Read
errors
per Byte
per hDD
Low
( 8.0 × 10–15)
medium
( 8.0 × 10–14)
high
( 3. 2 × 10–13)
1.08 × 10–4 err/hr
1.08 × 10–3 err/hr
44. 32 × 10–4 err/hr
44. 32 × 10–3 err/hr
future technology and trade-Offs
How are those failure modes going to
impact future HDDs that have more
than one-terabyte capacity? Certainly,
all the failure mechanisms that occur
in the 1TB drive will persist in higher
density drives that use perpendicular
magnetic recording (PMR) technology. PMR uses a “thick,” somewhat
soft underlayer making it susceptible
to media scratching and gouging. The
materials that cause media damage
include softer metals and compositions that were not as great a problem
in older, longitudinal magnetic recording. Future higher density drives
are likely to be even more susceptible
to scratching because the track width
will be narrower.
Another PMR problem that will
persist as density increases is side-track erasure. Changing the direction
of the magnetic grains also changes
the direction of the magnetic fields.
PMR has a return field that is close to
the adjacent tracks and can potentially erase data in those tracks. In general, the track spacing is wide enough
to mitigate this mechanism, but if a
particular track is written repeatedly,
the probability of side-track erasure
increases. Some applications are optimized for performance and keep the
head in a static position (few tracks).
This increases the chances of not only
lube buildup (high fly writes) but also
erasures.
One concept being developed to
increase bit-density is heat assisted
magnetic recording (HAMR).
9 This
technology requires a laser within the
write head to heat a very small area on
the media to enable writing. High-sta-bility media using iron-platinum alloys allow bits to be recorded on much