get a clearer picture. Suppose we had
eliminated all possible sources of correlated data loss, from operator error
to excess heat. All that remained would
be bit rot, a process that randomly flips
the bits the system stores with a constant small probability per unit time.
In this model we can treat bits as radioactive atoms, so that the time after
which there is a 50% probability that a
bit will have flipped is the bit half-life.
The requirement of a 50% chance
that a petabyte will survive for a century translates into a bit half-life of 8×1017
years. The current estimate of the age
of the universe is 1. 4×1010 years, so this
is a bit half-life approximately 6× 107
times the age of the universe.
This bit half-life requirement clearly
shows the high degree of difficulty of
the problem we have set for ourselves.
Suppose we want to know whether a
system we are thinking of buying is
good enough to meet the 50% chance
of keeping a petabyte for a century.
Even if we are sublimely confident that
every source of data loss other than
bit rot has been totally eliminated, we
still have to run a benchmark of the
system’s bit half-life to confirm it is
longer than 6× 107 times the age of the
universe. And this benchmark has to
be complete in a year or so; it can’t take
a century.
So we take 103 systems just like the
one we want to buy, write a petabyte of
data into each so we have an exabyte of
data altogether, wait a year, read the exabyte back, and check it. If the system
is just good enough, we might see five
bit flips. Or, because bit rot is a random
process, we might see more, or less. We
would need either a lot more than an
exabyte of data or a lot more than a year
to be reasonably sure the bit half-life
was long enough for the job. But even
an exabyte of data for a year costs 10
times as much as the system we want
to buy.
What this thought-experiment tells
us is we are now dealing with such
large numbers of bits for such a long
time that we are never going to know
whether the systems we use are good
enough:
˲ ˲ The known causes of data loss are
too various and too highly correlated
for models to produce credible projections.
˲ ˲ Even if we ignore all those causes,
our inability to
compute how many
backup copies we
need to achieve a
reliability target
is something we
are just going to
have to live with.
We are not going
to have enough
backup copies, and
stuff will get lost or
damaged.
the experiments that would be needed
to be reasonably sure random bit rot
is not significant are too expensive, or
take too long, or both.
measuring failures
It wasn’t until 2007 that researchers
started publishing studies of the reliability that actual large-scale storage
systems were delivering in practice.
Enterprises such as Google9 and institutions such the Sloan Digital Sky Survey37 and the Large Hadron Collider8
were collecting petabytes of data with
long-term value that had to remain
online to be useful. The annual cost
of keeping a petabyte online was more
than $1 million. 27 It is easy to see why
questions of the economics and reliability of storage systems became the
focus of researchers’ attention.
Papers at the 2007 File and Storage
Technologies (FAST) conference used
data from NetApp35 and Google30 to
study disk-replacement rates in large
storage farms. They showed that the
manufacturer’s MTTF numbers were
optimistic. Subsequent analysis of the
NetApp data17 showed that all other
components contributed to the storage
system failures and that:
“Interestingly, [the earlier studies]
found disks are replaced much more
frequently ( 2–4 times) than vendor-
specified [replacement rates]. But as
this study indicates, there are other
storage subsystem failures besides
disk failures that are treated as disk
faults and lead to unnecessary disk re-
placements.” 17
Two studies, one at CERN (European
Organization for Nuclear Research) 18
and one using data from NetApp, 5
greatly improved on earlier work using
data from the Internet Archive. 6, 36 They
studied silent data corruption—events
in which the content of a file in storage
changes with no explanation or recorded errors—in state-of-the-art storage
systems.
The NetApp study looked at the incidence of silent storage corruption in
individual disks in RAID arrays. The
data was collected over 41 months
from NetApp’s filers in the field, covering more than 1. 5× 106 drives. The
study found more than 4× 105 silent
corruption incidents. More than 3× 104
of them were not detected until RAID
restoration and could thus have caused