clusion about the relative rates of capacity and throughput growth for hard
drives of all types—there’s obviously
no exponential law governing hard-drive throughput. By dividing capacity
by throughput, we can compute the
amount of time required to fully scan
or populate a drive. It is this duration
that dictates how long a RAID group is
operating without full parity protection. Figure 5 shows the duration such
an operation would take for the various drive types over the years.
When RAID systems were developed in the 1980s and 1990s, reconstruction times were measured in
minutes. The trend for the past 10
years is quite clear regardless of the
drive speed or its market segment: the
time to perform a RAID reconstruction is increasing exponentially as capacity far outstrips throughput. At the
extreme, rebuilding a fully populated
2TB 7200-RPM SATA disk—today’s capacity champ—after a failure would
take four hours operating at the theoretical optimal throughput. It is rare
to achieve those data rates in practice;
in the context of a heavily used system
the full bandwidth can’t be dedicated
exclusively to RAID repair without
adversely affecting performance. If
figure 6. Projected relative reliability of single- and double-parity RAiD.
raId- 5
raId- 6
Annual Probability of Data loss
2009
2011
2013
2015
2017
2019
one assumes that only 10%–50% of
the total system throughput is available for reconstruction, the minutes-long RAID rebuild times of the 1990s
balloon to multiple hours or days in
practice. RAID systems operate in this
degraded state for far longer than they
once did and as a consequence are at
higher risk for data loss.
Latent data on hard drives can acquire defects over time—a process
blithely referred to as bit rot. To mitigate this, RAID systems typically perform background scrubbing in which
data is read, verified, and corrected
as needed to eradicate correctable
failures before they become uncorrectable.
5 The phenomenon of scrub-
none of the existing RAiD classifications apply for triple-parity RAiD. one option
would be to extend the existing RAiD- 6 definition, but this could be confusing, as many
RAiD- 6 systems exist today. the next obvious choice is RAiD- 7, but rather than applying
the designation merely to RAiD with triple-parity protection, RAiD- 7 should be a catchall for any RAiD technique that can be extended to an arbitrary number of parity disks.
specific techniques or deployments that fix the number of parity disks at n should use
the RAiD- 7.n nomenclature with RAiD- 7. 3 referring to triple-parity RAiD, and RAiD- 5
and RAiD- 6 effectively as the degenerate forms RAiD- 7. 1 and RAiD- 7. 2, respectively.
A Classification for
Triple-Parity RAID
figure 7. Projected relative reliability of single-, double-, and triple-parity RAiD.
raId- 5
raId- 6
raId- 7. 3
Annual Probability of Data loss
2009
2011
2013
2015
2017
2019
bing data necessarily impacts system
performance, but the time required
for a full scrub is a significant component of the reliability of the total
system. A natural tension results between how priorities are assigned to
scrubbing versus other system activity. As throughput is dwarfed by capacity, either the percentage of resources dedicated to scrubbing must
increase, or the time for a complete
scrub must increase. With the trends
noted previously, storage pools will
easily take weeks or months for a full
scrub regardless of how high a priority
scrubbing is given, further reducing
the reliability of the total system as it
becomes more likely that RAID reconstructions will encounter latent data
corruption.
Given the growing disparity between
the capacity growth of hard drives and
improvements to their performance,
the long-term prospects of RAID- 6
must be reconsidered. The time to repair a failed drive is increasing, and at
the same time the lengthening duration of a scrub means that errors are
more likely to be encountered during
the repair. In Figure 6, we have chosen
reasonable values for the bit error rate
and annual failure rate, and a relatively modest rate of capacity growth
(doubling every three years). This is
meant to approximate the behavior
of low-cost, high-density, 7200-RPM
drives. Different values would change
the precise position of the curves, but
not their relative shapes.
RAID- 5 reached a threshold 15 years
ago at which it no longer provided adequate protection. The answer then
was RAID- 6. Today RAID- 6 is quickly
approaching that same threshold. In
about 10 years, RAID- 6 will provide only
the level of protection that we get from