This behavior is precisely what we want
from an archival storage system: it can
gracefully handle many failure events
without losing data. Even though we
captured fewer data points for the triple inter-parity configuration, we believe the reported MTTDL is a reasonable approximation.” 39
Although the Pergamum team’s effort to obtain “a reasonable approximation” to the MTTDL of its system
is praiseworthy, there are a number of
reasons to believe it overestimates the
reliability of the system in practice:
˲ ˲ The model draws its failures from
exponential distributions. The team
thus assumes that both disk and sector failures are uncorrelated, although
all observations of actual failures5, 42
report significant correlations. Correlated failures greatly increase the probability of data loss. 6, 13
˲ ˲ Other than a small reduction in
disk lifetime from each power-on
event, the Pergamum team assumes
that failure rates observed in always-on
disk usage translate to the mostly off
environment. A study43 published after
the Pergamum paper reports a quantitative accelerated life test of data retention in almost-always-off disks. It
shows that some of the 3.5-inch disks
anticipated by the Pergamum team
have data life dramatically worse in
this usage mode than 2.5-inch disks
using the same head and platter technology.
˲ ˲ The team assumes that disk and
sector failures are the only failures
contributing to the system failures,
although a study17 shows that other
hardware components contribute significantly.
˲ ˲ It assumes that its software is bug-free, despite several studies of file and
storage implementations14, 20, 31 that
uniformly report finding bugs capable of causing data loss in all systems
studied.
˲ ˲ It also ignores all other threats
to stored data34 as possible causes of
data loss. Among these are operator error, insider abuse, and external attack.
Each of these has been the subject of
anecdotal reports of actual data loss.
What can such models tell us?
Their results depend on both of the
following:
˲ ˲ The details of the simulation of
the system being studied, which, one
the more data we
keep, and the longer
we keep it, the
greater the chance
that some of it will
be unrecoverable
when we need it.
hopes, accurately reflect its behavior.
˲ ˲ The data used to drive the simulation, which, one hopes, accurately
reflects the behavior of the system’s
components.
Under certain conditions, it is reasonable to use these models to compare different storage-system technologies. The most important condition is
that the models of the two systems use
the same data. A claim that modeling
showed system A to be more reliable
than system B when the data used to
model system A had much lower failure rates for components such as disk
drives would not be credible.
These models may well be the best
tools available to evaluate different
techniques for preventing data loss,
but they aren’t good enough to answer our question. We need to know
the maximum rate at which data will
be lost. The models assume things,
such as uncorrelated errors and bug-free software, that all real-world studies show are false. The models exclude
most of the threats to which stored
data is subject. In those cases where
similar claims, such as those for disk
reliability, 30, 35 have been tested, they
have been shown to be optimistic. The
models thus provide an estimate of the
minimum data loss rate to be expected.
metrics
Even if we believed the models, the
MTTDL number does not tell us how
much data was lost in the average data-loss event. Is petabyte system A with an
MTTDL of 106 years better than a sim-ilar-size system B with an MTTDL of
103 years? If the average data-loss event
in system A loses the entire petabyte,
where the average data-loss event in
system B loses a kilobyte, it would be
easy to argue that system B was 109
times better.
Mean time to data loss is not a useful metric for how well a system stores
bits through time, because it relates to
time but not to bits. Nor is the UBER
(unrecoverable bit error rate) typically
quoted by disk manufacturers; it is the
probability that a bit will be read incorrectly regardless of how long it has
been sitting on the disk. It relates to
bits but not to time. Thus, we see that
we lack even the metric we would need
to answer our question.
Let us oversimplify the problem to