table 2. the four cases of message digest comparison.
Digest
unchanged
Changed
match
data oK
deliberate alteration
no match
data bad
data and/or digest bad
scratch is infeasible. This approach
will not scale to the forthcoming gen-
eration:
“…it is anticipated that exascale
systems will experience various kinds
of faults many times per day. It is also
anticipated that the current approach
for resilience, which relies on auto-
matic or application-level checkpoint-
restart, will not work because the time
for checkpointing and restarting will
exceed the mean time to failure of a full
system. …
“Some projections estimate that,
with the current technique, the time to
checkpoint and restart may exceed the
mean time to interrupt of top super-
computers before 2015. This not only
means that a computation will do little
progress; it also means that fault-han-
dling protocols have to handle multi-
ple errors—current solutions are often
designed to handle single errors.” 7
Just as with storage, the numbers
of components and interconnections
are so large that the incidence of failures is significant, and the available
bandwidths are relatively so low that
recovering from the failures is time
consuming enough that multiple
failure situations have to be handled.
There is no practical, affordable way
to mask the failures from the applications. Application programmers will
need to pay much more attention to
detecting and recovering from errors
in their environment. To do so they
will need both the APIs and the system
environments implementing them to
become much more failure-aware.
cases, as shown in Table 2, depending
on whether the digest (b) is unchanged
or not. The four cases illustrate two
problems:
˲ ˲ The bits forming the digest are
no different from the bits forming the
data; neither is magically incorruptible. A malign or malfunctioning service could return bad data with a digest in the ETag header that matched
the data but was not the digest originally computed. Applications need
to know whether the digest has been
changed. A system for doing so without incorruptible storage is described
in Haber et al. 15
˲ ˲Given the pricing structure for
cloud storage services such as Amazon S3, it is too expensive to extract the
entire data at intervals to confirm it is
being stored correctly. Some method
in which the service computes the
digest of the data is needed, but simply asking the service to return the
digest of a stored object is not an adequate check. 33 The service must be
challenged to prove its object is good.
The simplest way to do this is to ask
the service to compute the digest of
a nonce (a random string of bits) and
the object; because the service cannot
predict the nonce, a correct response
requires access to the data after the
request is received. Systems using this
technique are described in Maniatis et
al. 21 and Shah et al. 38
Early detection is a good thing:
the shorter the time between detection and repair, the smaller the risk
that a second error will compromise
the repair. But detection is only part
of the solution; the system also has to
be able to repair the damaged data. It
can do so only if it has replicated the
data elsewhere—and some dedupli-cation layer has not optimized away
this replication.
aPi enhancements
Storage APIs are starting to move in this
direction. Recent interfaces to storage
services2 allow the application’s write
call to provide not just a pointer to the
data and a length, but also, optionally,
the application’s message digest of the
data. This allows the storage system
to detect whether the data was dam-
aged during its journey from the ap-
plication to the device, or while it was
sitting in the storage device, or being
copied back to the application. Recent
research has shown the memory buf-
fers44 and data paths17 between the ap-
plication and the storage devices con-
tribute substantially to errors.
conclusion
It would be nice to end on an upbeat
note, describing some technological
fix that would allow applications to ignore the possibility of failures in their
environment, and specifically in long-term storage. Unfortunately, in the real
world, failures are inevitable. As systems scale up, failures become more
frequent. Even throwing money at the
problem can only reduce the incidence
of failures, not exclude them entirely.