smaller areas than today’s standard
media without being limited by su-per-paramagnetism. Controlling the
amount and location of the heat are,
of course, significant concerns.
RAID is designed to accommodate corrupted data from scratches,
smears, pits, and voids. The data is
re-created from the parity disk and
the corrupted data is reconstructed
and rewritten. Depending on the size
of the media defect, this may be a few
blocks or hundreds of blocks. As the
areal density of the HDDs increases,
the same physical size of the defect
will affect more blocks or tracks and
require more time for re-creation of
data. One trade-off is the amount of
time spent recovering corrupted data.
A desktop HDD (most ATA drives) is
optimized to find the data no matter
how long it takes. In a desktop there is
no redundancy and it is (correctly) assumed that the user would rather wait
30–60 seconds and eventually retrieve
the data than to have the HDD give up
and lose data.
Each HDD manufacturer has a proprietary set of recovery algorithms it
employs to recover data. If the data
cannot be found, the servo controller
will move the heads a little to one side
of the nominal center of the track, then
to the other side. This off-track reading may be performed several times at
different off-track distances. This is a
very common process used by all HDD
manufacturers, but how long can a
RAID group wait for this recovery?
Some RAID integrators may choose
to truncate these steps with the knowledge that the HDD will be considered
failed even though it is not an operational failure. On the other hand, how
long can a RAID group response be
delayed while one HDD is trying to recover data that is readily recoverable
using RAID? Also consider what happens when a scratch is encountered.
The process of recovery for a large
number of blocks, even if the process
is truncated, may result in a time-out
condition. The HDD is off recovering
data or the RAID group is reconstructing data for so long that the performance comes to a halt; a time-out
threshold is exceeded and the HDD is
One option is quickly to call the offending HDD failed, copy all the data
to a spare HDD (even the corrupted
data), and resume recovery. A copy
command is much quicker than reconstructing the data based on parity,
and if there are no defects, little data
will be corrupted. This means that reconstruction of this small amount of
data will be fast and not result in the
same time-out condition. The offending HDD can be (logically) taken out of
the RAID group and undergo detailed
diagnostics to restore the HDD and
map out bad sectors.
In fact, a recent analysis shows the
true impact of latent defects on the
frequency of double disk failures.
Early RAID papers stated that the only
failures of concern were operational
failures because, once written, data
does not change except by bit-rot.
Hard-disk drives don’t just fail catastrophically. They may also silently
corrupt data. Unless checked or
scrubbed, these data corruptions result in double disk failures if a catastrophic failure also occurs. Data loss
resulting from these events is the
dominant mode of failure for an n+ 1
RAID group. If the reliability of RAID
groups is to increase, or even keep
up with technology, the effects of undiscovered data corruptions must be
mitigated or eliminated. Although
scrubbing is one clear answer, other
creative methods to deal with latent
defects should be explored.
Multi-terabyte capacity drives using
perpendicular recording will be available soon, increasing the probability of both correctable and uncorrectable errors by virtue of the narrowed
track widths, lower flying heads, and
susceptibility to scratching by softer
particle contaminants. One mitigation factor is to turn uncorrectable
errors into correctable errors through
greater error-correcting capability on
the drive (4KB blocks rather than 512-
or 520-byte blocks) and by using the
complete set of recovery steps. These
will decrease performance, so RAID
architects must address this trade-off.
Operational failure rates are not
constant. It is necessary to analyze
field data, determine failure modes
and mechanisms, and implement corrective actions for those that are most
problematic. The operating system
should consider optimizations around
these high-probability events and their
effects on the RAID operation.
Only when these high-probability
events are included in the optimization of the RAID operation will reliability improve. Failure to address
them is a recipe for disaster.
You Don’t Know Jack about Disks
CTO Roundtable: Storage
A Conversation with Jim Gray
1. elerath, J.g. reliability model and assessment
of redundant arrays of inexpensive disks (raiD)
incorporating latent defects and non-homogeneous
poisson process events. Ph.D. dissertation,
Department of mechanical engineering, university of
2. elerath, J.g. and Pecht, m. enhanced reliability
modeling of raiD storage systems. in Proceedings of
the 37th Annual IEEE/IFIP International Conference
on Dependable Systems and Networks, (edinburgh,
uk, June 2007).
3. elerath, J.g. and shah, s. server class disk drives:
how reliable are they? in Proceedings of the Annual
Reliability and Maintainability Symposium, (January
4. gray, J. and van ingen, c. empirical measurements of
disk failure rates and error rates. microsoft research
technical report, msr-tr-2005-166, December
5. gregg, b. shouting in the datacenter, 2008; http://
6. Pinheiro, e., Weber, W.-D., and barroso, L.a. failure
trends in a large disk drive population. in Proceedings
of the Fifth Usenix Conference on File and Storage
Technologies (FAS T), (february 2007).
7. schroeder, b. and gibson, g. Disk failures in the real
world: What does an mttf of 1,000,000 hours mean
to you? in Proceedings of the Fifth Usenix Conference
on File and Storage Technologies (FAST), (february
8. schwarz, t.J.e., et al. Disk scrubbing in large archival
storage systems. in Proceedings of the IEEE
Computer Society Symposium (2004), 1161–1170.
9. seigler, m. and mcDaniel, t. What challenges remain
to achieve heat-assisted magnetic recording?
Solid State Technology (sept. 2007); http://www.
10. shah, s. and elerath, J.g. Disk drive vintage and its
affect on reliability. in Proceedings of the Annual
Reliability and Maintainability Symposium, (January
11. sun, f. and Zhang, s. Does hard-disk drive failure rate
enter steady-state after one year? in Proceedings of
The Annual Reliability and Maintainability Symposium,
ieee, (January 2007).
Jon Elerath is a staff reliability engineer at solfocus.
he has focused on hard-disk drive reliability for more
than half his 35-plus-year career, which includes
positions at netapp, general electric, tegal, tandem
computers, compaq, and ibm.
© 2009 acm 0001-0782/09/0600 $10.00