QuInLAn: Client failures have a way of
fouling things up. Basically, the model
in GFS is that the client just continues
to push the write until it succeeds. If the
client ends up crashing in the middle of
an operation, things are left in a bit of
an indeterminate state.
Early on, that was sort of considered
to be OK, but over time, we tightened
the window for how long that inconsistency could be tolerated, and then
we slowly continued to reduce that.
Otherwise, whenever the data is in that
inconsistent state, you may get different lengths for the file. That can lead to
some confusion. We had to have some
backdoor interfaces for checking the
consistency of the file data in those instances. We also have something called
RecordAppend, which is an interface
designed for multiple writers to append
to a log concurrently. There the consistency was designed to be very loose. In
retrospect, that turned out to be a lot
more painful than anyone expected.
mCKuSICK: What exactly was loose?
If the primary replica picks what the
offset is for each write and then makes
sure that actually occurs; I don’t see
where the inconsistencies are going to
QuInLAn: What happens is that the
primary will try. It will pick an offset, it
will do the writes, but then one of them
won’t actually get written. Then the primary might change, at which point it
can pick a different offset. RecordAppend does not offer any replay protection either. You could end up getting the
data multiple times in the file.
There were even situations where you
could get the data in a different order.
It might appear multiple times in one
chunk replica, but not necessarily in all
of them. If you were reading the file, you
could discover the data in different ways
at different times. At the record level,
you could discover the records in different orders depending on which chunks
you happened to be reading.
mCKuSICK: Was this done by design?
QuInLAn: At the time, it must have
seemed like a good idea, but in retro-
spect I think the consensus is that it
proved to be more painful than it was
worth. It just doesn’t meet the expecta-
tions people have of a file system, so they
end up getting surprised. Then they had
to figure out work-arounds.
mCKuSICK: In retrospect, how would
you handle this differently?
QuInLAn: I think it makes more sense
to have a single writer per file.
mCKuSICK: All right, but what happens
when you have multiple people wanting
to append to a log?
QuInLAn: You serialize the writes
through a single process that can ensure the replicas are consistent.
mCKuSICK: There’s also this business
where you essentially snapshot a chunk.
Presumably, that’s something you use
when you’re essentially replacing a
replica, or whenever some chunkserv-er goes down and you need to replace
some of its files.
QuInLAn: Actually, two things are going on there. One, as you suggest, is the
recovery mechanism, which definitely
involves copying around replicas of the
file. The way that works in GFS is we basically revoke the lock so the client can’t
write it anymore, and this is part of that
latency issue we were talking about.
There’s also a separate issue, which
is to support the snapshot feature of
GFS. GFS has the most general-purpose
snapshot capability you can imagine.
You could snapshot any directory somewhere, and then both copies would be
entirely equivalent. They would share
the unchanged data. You could change
either one and you could further snapshot either one. So it was really more of
a clone than what most people think of
as a snapshot. It’s an interesting thing,
but it makes for difficulties—especially
as you try to build more distributed systems and you want potentially to snapshot larger chunks of the file tree.
I also think it’s interesting that the
snapshot feature hasn’t been used
more since it’s actually a very powerful feature. That is, from a file-system
point of view, it really offers a pretty
nice piece of functionality. But putting
snapshots into file systems, as I’m sure
you know, is a real pain.
mCKuSICK: I know. I’ve done it. It’s excruciating—especially in an overwriting
QuInLAn: Exactly. This is a case where
we didn’t cheat, but from an implementation perspective, it’s hard to create true snapshots. Still, it seems that
in this case, going the full deal was the
right decision. Just the same, it’s an interesting contrast to some of the other
decisions that were made early on in
terms of the semantics.
All in all, the report card on GFS nearly
10 years later seems positive. There
have been problems and shortcomings, to be sure, but there’s surely no
arguing with Google’s success and GFS
has without a doubt played an important role in that. What’s more, its staying power has been nothing short of
remarkable given that Google’s operations have scaled orders of magnitude
beyond anything the system had been
designed to handle, while the application mix Google currently supports is
not one that anyone could have possibly imagined back in the late 1990s.
Still, there’s no question that GFS
faces many challenges now. For one
thing, the awkwardness of supporting
an ever-growing fleet of user-facing,
latency-sensitive applications on top
of a system initially designed for batch-system throughput is something that’s
obvious to all.
The advent of BigTable has helped
somewhat in this regard. As it turns out,
however, Big Table isn’t actually all that
great a fit for GFS. In fact, it just makes
the bottleneck limitations of the system’s single-master design more apparent than would otherwise be the case.
For these and other reasons, engineers at Google have been working for
much of the past two years on a new distributed master system designed to take
full advantage of Big Table to attack some
of those problems that have proved particularly difficult for GFS.
Accordingly, it now seems that beyond all the adjustments made to ensure
the continued survival of GFS, the newest branch on the evolutionary tree will
continue to grow in significance over the
years to come.
A Conversation with
Jeff Bonwick and Bill Moore
The Five-Minute Rule 20 Years Later:
and how Flash Memory Changes the Rules
Standardizing Storage Clusters
Garth Goodson, Sai Susharla, Rahul Iyer