practice
DOi: 10.1145/1516046.1516059
such as mirroring, RAID- 4 and RAID-
Article development led by
queue.acm.org
5, and the n+ 2 configuration, RAID- 6,
which increases storage system reliability using two redundant disks (dual
parity). Additionally, reliability at the
RAID group level has been favorably
enhanced because HDD reliability has
been improving as well.
Several manufactures produce one-
By JOn eLeRath
terabyte HDDs and higher capacities
are being designed. With higher areal
hard-Disk lower fly-heights (the distance between
the head and the disk media), and per-densities (also known as bit densities),
pendicular magnetic recording technology, can HDD reliability continue to
Drives: the improve? The new technology required
to achieve these capacities is not without concern. Are the failure mechanisms or the probability of failure any
Good, the Bad,
different from predecessors? Not only
are there new issues to address stemming from the new technologies, but
also failure mechanisms and modes
and the ugly vary by manufacturer, capacity, interface, and production lot.
How will these new failure modes
affect system designs? Understanding
failure causes and modes for HDDs using technology of the current era and
the near future will highlight the need
for design alternatives and trade-offs
that are critical to future storage systems. Software developers and RAID architects can not only better understand
the effects of their decisions, but also
know which HDD failures are outside
their control and which they can manage, albeit with possible adverse performance or availability consequences.
Based on technology and design, where
must the developers and architects
place the efforts for resiliency?
This article identifies significant
HDD failure modes and mechanisms,
their effects and causes, and relates
them to system operation. Many failure mechanisms for new HDDs remain
unchanged from the past, but the insidious undiscovered data corruptions
(latent defects) that have plagued all
HDD designs to one degree or another
will continue to worsen in the near future as areal densities increase.
iLLustration by su Perbrothers
New drive technologies and increased
capacities create new categories of failure
modes that will influence system designs.
harD-DIsK DrIvEs (hDDs)
are like the bread in a peanut
butter and jelly sandwich—seemingly unexciting
pieces of hardware necessary to hold the software.
They are simply a means to an end. HDD reliability,
however, has always been a significant weak link,
perhaps the weak link, in data storage. In the late
1980s people recognized that HDD reliability
was inadequate for large data storage systems so
redundancy was added at the system level with some
brilliant software algorithms, and RAID (redundant
array of independent disks) became a reality. RAID
moved the reliability requirements from the HDD
itself to the system of data disks. Commercial
implementations of RAID include n+ 1 configurations