like that—which is to say they basically
try to hide that latency since they know
the system underneath isn’t really all
that great.
The guys who built Gmail went to a
multihomed model, so if one instance
of your Gmail account got stuck, you
would basically just get moved to another data center. Actually, that capability was needed anyway just to ensure
availability. Still, part of the motivation
was that they wanted to hide the GFS
problems.
mCKuSICK: I think it’s fair to say that,
by moving to a distributed-master file
system, you’re definitely going to be able
to attack some of those latency issues.
QuInLAn: That was certainly one of our
design goals. Also, BigTable itself is a
very failure-aware system that tries to respond to failures far more rapidly than
we were able to before. Using that as our
metadata storage helps with some of
those latency issues as well.
The engineers who worked on the earliest versions of GFS weren’t particularly
shy about departing from traditional
choices in file-system design whenever
they felt the need to do so. It just so happens that the approach taken to consistency is one of the aspects of the system
where this is particularly evident.
Part of this, of course, was driven by
necessity. Since Google’s plans rested
largely on massive deployments of
commodity hardware, failures and
hardware-related faults were a given.
Beyond that, according to the original
GFS paper, there were a few compatibil-
ity issues. “Many of our disks claimed
to the Linux driver that they supported
a range of IDE protocol versions but
in fact responded reliably only to the
more recent ones. Since the protocol
versions are very similar, these drives
mostly worked but occasionally the
mismatches would cause the drive and
the kernel to disagree about the drive’s
state. This would corrupt data silently
due to problems in the kernel. This
problem motivated our use of check-
sums to detect data corruption.”
That didn’t mean just any check-
summing, however, but instead rigor-
ous end-to-end checksumming, with an
eye to everything from disk corruption
to TCP/IP corruption to machine back-
plane corruption.
Interestingly, for all that checksum-
the engineers
who worked on
earliest versions
of GfS weren’t
shy about departing
from traditional
choices in file-
system design
whenever they felt
the need to do so.
It just so happens
that the approach
to consistency is
one aspect of the
system where
this is evident.
ming vigilance, the GFS engineering
team also opted for an approach to
consistency that’s relatively loose by
file-system standards. Basically, GFS
simply accepts that there will be times
when people will end up reading slightly stale data. Since GFS is used mostly
as an append-only system as opposed
to an overwriting system, this generally means those people might end up
missing something that was appended
to the end of the file after they’d already
opened it. To the GFS designers, this
seemed an acceptable cost (although
it turns out that there are applications
for which this proves problematic).
Also, as Gobioff explained, “The risk
of stale data in certain circumstances is
just inherent to a highly distributed ar-
chitecture that doesn’t ask the master
to maintain all that much information.
We definitely could have made things a
lot tighter if we were willing to dump a
lot more data into the master and then
have it maintain more state. But that
just really wasn’t all that critical to us.”
Perhaps an even more important is-
sue here is that the engineers making
this decision owned not just the file sys-
tem but also the applications intended
to run on the file system. According
to Gobioff, “The thing is that we con-
trolled both the horizontal and the
vertical—the file system and the appli-
cation. So we could be sure our applica-
tions would know what to expect from
the file system. And we just decided to
push some of the complexity out to the
applications to let them deal with it.”
Still, there are some at Google who
wonder whether that was the right call
if only because people can sometimes
obtain different data in the course of
reading a given file multiple times,
which tends to be so strongly at odds
with their whole notion of how data
storage is supposed to work.
mCKuSICK: Let’s talk about consistency.
The issue seems to be that it presumably
takes some amount of time to get everything fully written to all the replicas. I
think you said something earlier to the
effect that GFS essentially requires that
this all be fully written before you can
continue.
QuInLAn: That’s correct.
mCKuSICK: If that’s the case, then how
can you possibly end up with things that
aren’t consistent?