smaller areas than today’s standard media without being limited by su-per-paramagnetism. Controlling the amount and location of the heat are, of course, significant concerns.

RAID is designed to accommodate corrupted data from scratches, smears, pits, and voids. The data is re-created from the parity disk and the corrupted data is reconstructed and rewritten. Depending on the size of the media defect, this may be a few blocks or hundreds of blocks. As the areal density of the HDDs increases, the same physical size of the defect will affect more blocks or tracks and require more time for re-creation of data. One trade-off is the amount of time spent recovering corrupted data. A desktop HDD (most ATA drives) is optimized to find the data no matter how long it takes. In a desktop there is no redundancy and it is (correctly) assumed that the user would rather wait 30–60 seconds and eventually retrieve the data than to have the HDD give up and lose data.

Each HDD manufacturer has a proprietary set of recovery algorithms it employs to recover data. If the data cannot be found, the servo controller will move the heads a little to one side of the nominal center of the track, then to the other side. This off-track reading may be performed several times at different off-track distances. This is a very common process used by all HDD manufacturers, but how long can a RAID group wait for this recovery?

Some RAID integrators may choose to truncate these steps with the knowledge that the HDD will be considered failed even though it is not an operational failure. On the other hand, how long can a RAID group response be delayed while one HDD is trying to recover data that is readily recoverable using RAID? Also consider what happens when a scratch is encountered. The process of recovery for a large number of blocks, even if the process is truncated, may result in a time-out condition. The HDD is off recovering data or the RAID group is reconstructing data for so long that the performance comes to a halt; a time-out threshold is exceeded and the HDD is considered failed.

One option is quickly to call the offending HDD failed, copy all the data

to a spare HDD (even the corrupted data), and resume recovery. A copy command is much quicker than reconstructing the data based on parity, and if there are no defects, little data will be corrupted. This means that reconstruction of this small amount of data will be fast and not result in the same time-out condition. The offending HDD can be (logically) taken out of the RAID group and undergo detailed diagnostics to restore the HDD and map out bad sectors.

In fact, a recent analysis shows the true impact of latent defects on the frequency of double disk failures.

1

Early RAID papers stated that the only failures of concern were operational failures because, once written, data does not change except by bit-rot.

improving Reliability

Hard-disk drives don’t just fail catastrophically. They may also silently corrupt data. Unless checked or scrubbed, these data corruptions result in double disk failures if a catastrophic failure also occurs. Data loss resulting from these events is the dominant mode of failure for an n+ 1 RAID group. If the reliability of RAID groups is to increase, or even keep up with technology, the effects of undiscovered data corruptions must be mitigated or eliminated. Although scrubbing is one clear answer, other creative methods to deal with latent defects should be explored.

Multi-terabyte capacity drives using perpendicular recording will be available soon, increasing the probability of both correctable and uncorrectable errors by virtue of the narrowed track widths, lower flying heads, and susceptibility to scratching by softer particle contaminants. One mitigation factor is to turn uncorrectable errors into correctable errors through greater error-correcting capability on the drive (4KB blocks rather than 512- or 520-byte blocks) and by using the complete set of recovery steps. These will decrease performance, so RAID architects must address this trade-off.

Operational failure rates are not constant. It is necessary to analyze field data, determine failure modes and mechanisms, and implement corrective actions for those that are most problematic. The operating system

should consider optimizations around these high-probability events and their effects on the RAID operation.

Only when these high-probability events are included in the optimization of the RAID operation will reliability improve. Failure to address them is a recipe for disaster.

 

Related articles on queue.acm.org

You Don’t Know Jack about Disks

Dave Anderson

http://queue.acm.org/detail.cfm?id=864058

CTO Roundtable: Storage

http://queue.acm.org/detail.cfm?id=1466452

A Conversation with Jim Gray

http://queue.acm.org/detail.cfm?id=864078

References

1. elerath, J.g. reliability model and assessment of redundant arrays of inexpensive disks (raiD) incorporating latent defects and non-homogeneous poisson process events. Ph.D. dissertation, Department of mechanical engineering, university of maryland, 2007.

2. elerath, J.g. and Pecht, m. enhanced reliability modeling of raiD storage systems. in Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, (edinburgh, uk, June 2007).

3. elerath, J.g. and shah, s. server class disk drives: how reliable are they? in Proceedings of the Annual Reliability and Maintainability Symposium, (January 2004), 151–156.

4. gray, J. and van ingen, c. empirical measurements of disk failure rates and error rates. microsoft research technical report, msr-tr-2005-166, December 2005.

5. gregg, b. shouting in the datacenter, 2008; http:// www.youtube.com/watch?v=tDacjrsceq4.

6. Pinheiro, e., Weber, W.-D., and barroso, L.a. failure trends in a large disk drive population. in Proceedings of the Fifth Usenix Conference on File and Storage Technologies (FAS T), (february 2007).

7. schroeder, b. and gibson, g. Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you? in Proceedings of the Fifth Usenix Conference on File and Storage Technologies (FAST), (february 2007).

8. schwarz, t.J.e., et al. Disk scrubbing in large archival storage systems. in Proceedings of the IEEE Computer Society Symposium (2004), 1161–1170.

9. seigler, m. and mcDaniel, t. What challenges remain to achieve heat-assisted magnetic recording?

Solid State Technology (sept. 2007); http://www. solid-state.com/display_article/304597/5/artcL/ none/none/What-challenges-remain-to-achieve-heat-assisted-magnetic-recording?/.

10. shah, s. and elerath, J.g. Disk drive vintage and its affect on reliability. in Proceedings of the Annual Reliability and Maintainability Symposium, (January 2004), 163–167.

11. sun, f. and Zhang, s. Does hard-disk drive failure rate enter steady-state after one year? in Proceedings of The Annual Reliability and Maintainability Symposium, ieee, (January 2007).

 

Jon Elerath is a staff reliability engineer at solfocus. he has focused on hard-disk drive reliability for more than half his 35-plus-year career, which includes positions at netapp, general electric, tegal, tandem computers, compaq, and ibm.

© 2009 acm 0001-0782/09/0600 $10.00

References:

http://queue.acm.org

http://queue.acm.org/detail.cfm?id=864058

http://queue.acm.org/detail.cfm?id=1466452

http://queue.acm.org/detail.cfm?id=864078

http://www.youtube.com/watch?v=tDacjrsceq4

http://www.youtube.com/watch?v=tDacjrsceq4

http://www.solid-state.com/display_article/304597/5/ARTCL/none/none/What-challenges-remain-to-achieve-heat-assisted-magnetic-recording?/

http://www.solid-state.com/display_article/304597/5/ARTCL/none/none/What-challenges-remain-to-achieve-heat-assisted-magnetic-recording?/

http://www.solid-state.com/display_article/304597/5/ARTCL/none/none/What-challenges-remain-to-achieve-heat-assisted-magnetic-recording?/

http://www.solid-state.com/display_article/304597/5/ARTCL/none/none/What-challenges-remain-to-achieve-heat-assisted-magnetic-recording?/

Archives