reading of the data, the corruption is corrected using the parity disk and no data is lost. If one HDD, however, has experienced an operational failure and the RAID group is in the process of reconstruction when the latent defect is discovered, that data is lost. Since latent defects persist until discovered (read) and corrected, their rate of occurrence is an extremely important aspect of RAID reliability.

One study concludes that the BER is fairly inconsequential in terms of creating corrupted data, 4 while another claims the rate of data corruption is five times the rate of HDD operating failures. 8 Analyses of corrupted data identified by specific SCSI error codes and subsequent detailed failure analyses show that the rate of data corruption for all causes is significant and must be included in the reliability model.

NetApp (Network Appliance) completed a study in late 2004 on 282,000 HDDs used in RAID architecture. The RER (read-error rate) over three months was 8x10–14 errors per byte read. At the same time, another analysis of 66,800 HDDs showed an RER of approximately 3.2x10–13 errors per byte. A more recent analysis of 63,000 HDDs over five months showed a much-improved 8x10–15 errors per byte read. In these studies, data corruption is verified by the HDD manufacturer as an HDD problem and not a result of the operating system controlling the RAID group.

While Jim Gray of Microsoft Research asserted that it is reasonable to transfer 4.32x1012 bytes/day/HDD, the study of 63,000 HDDs read 7.3x1017 bytes of data in five months, an approximate read rate of 2.7x1011 bytes/ day/HDD. 4 Using combinations of the

RERs and number of bytes read yields the hourly read failure rates shown in the table here.

Latent defects do not occur at a constant rate, but in bursts or adjacent physical (not logical) locations. Although some latent defects are created by wear-out mechanisms, data is not available to discern wear-out from those that occur randomly at a constant rate. These rates are between 2 and 100 times greater than the rates for operational failures.

that operational failure rates are not increased.

Frequent scrubbing can affect performance, but too infrequent scrubbing makes the n+ 1 RAID group highly susceptible to double disk failures. Scrubbing, as with full HDD data reconstruction, has a minimum time to cover the entire HDD. The time to complete the scrub is a random variable that depends on HDD capacity and I/O activity. The operating system may invoke a maximum time to complete scrubbing.

Potential Value of Data scrubbing

Latent defects (data corruptions) can occur during almost any HDD activity: reading, writing, or simply spinning. If not corrected, these latent defects will result in lost data when an operational failure occurs. They can be eliminated, however, by background scrubbing, which is essentially preventive maintenance on data errors. During scrubbing, which occurs during times of idleness or low I/O activity, data is read and compared with the parity. If they are consistent, no action is taken. If they are inconsistent, the corrupted data is recovered and rewritten to the HDD. If the media is defective, the recovered data is written to new physical sectors on the HDD and the bad blocks are mapped out.

If scrubbing does not occur, the period of time to accumulate latent defects starts when the HDD begins operation in the system. Since scrubbing requires reading and writing data, it can act as a time-to-failure accelerator for HDD components with usage-de-pendent time-to-failure mechanisms. The optimal scrub pattern, rate, and time of scrubbing is HDD-specific and must be determined in conjunction with the HDD manufacturer to assure

Range of average read error rates.

Bytes Read per hour

Low rate ( 1. 35 × 109) high rate ( 1. 35 × 1010)

1.08 × 10–5 err/hr 1.08 × 10–4 err/hr

Read errors per Byte per hDD

Low
( 8.0 × 10–15)
medium
( 8.0 × 10–14)
high
( 3. 2 × 10–13)

1.08 × 10–4 err/hr

1.08 × 10–3 err/hr

44. 32 × 10–4 err/hr

44. 32 × 10–3 err/hr

future technology and trade-Offs

How are those failure modes going to impact future HDDs that have more than one-terabyte capacity? Certainly, all the failure mechanisms that occur in the 1TB drive will persist in higher density drives that use perpendicular magnetic recording (PMR) technology. PMR uses a “thick,” somewhat soft underlayer making it susceptible to media scratching and gouging. The materials that cause media damage include softer metals and compositions that were not as great a problem in older, longitudinal magnetic recording. Future higher density drives are likely to be even more susceptible to scratching because the track width will be narrower.

Another PMR problem that will persist as density increases is side-track erasure. Changing the direction of the magnetic grains also changes the direction of the magnetic fields. PMR has a return field that is close to the adjacent tracks and can potentially erase data in those tracks. In general, the track spacing is wide enough to mitigate this mechanism, but if a particular track is written repeatedly, the probability of side-track erasure increases. Some applications are optimized for performance and keep the head in a static position (few tracks). This increases the chances of not only lube buildup (high fly writes) but also erasures.

One concept being developed to increase bit-density is heat assisted magnetic recording (HAMR). 9 This technology requires a laser within the write head to heat a very small area on the media to enable writing. High-sta-bility media using iron-platinum alloys allow bits to be recorded on much

References:

Archives