be used to handle key-value pairs and
that sort of thing, how is that different
from Big Table?
QuInLAn: The main difference is that
SSTables are immutable, while Big Table
provides mutable key value storage, and
a whole lot more. Big Table itself is actually built on top of logs and SSTables.
Initially, it stores incoming data into
transaction log files. Then it gets
compacted—as we call it—into a series of
SSTables, which in turn get compacted
together over time. In some respects,
it’s reminiscent of a log-structure file
system. Anyway, as you’ve observed, logs
and SSTables do seem to be the two data
structures underlying the way we structure most of our data. We have log files
for mutable stuff as it’s being recorded.
Then, once you have enough of that, you
sort it and put it into this structure that
has an index.
Even though GFS does not provide
a Posix interface, it still has a pretty generic file-system interface, so people
are essentially free to write any sort of
data they like. It’s just that, over time,
the majority of our users have ended up
using these two data structures. We also
have something called protocol buffers,
which is our data description language.
The majority of data ends up being protocol buffers in these two structures.
Both provide for compression and
checksums. Even though there are
some people internally who end up reinventing these things, most people
are content just to use those two basic
Because GFS was designed initially to
enable a crawling and indexing system,
throughput was everything. In fact, the
original paper written about the sys-
tem makes this quite explicit: “High
sustained bandwidth is more impor-
tant than low latency. Most of our tar-
get applications place a premium on
processing data in bulk at a high rate,
while few have stringent response-time
requirements for an individual read
But then Google either developed or
embraced many user-facing Internet
services for which this is most definite-
ly not the case.
One GFS shortcoming that this im-
mediately exposed had to do with the
original single-master design. A single
point of failure may not have been a di-
saster for batch-oriented applications,
but it was certainly unacceptable for
latency-sensitive applications, such as
video serving. The later addition of au-
tomated failover capabilities helped,
but even then service could be out for
up to a minute.
mCKuSICK: It’s well documented that
the initial emphasis in designing GFS
was on batch efficiency as opposed to
low latency. Now that has come back to
cause you trouble, particularly in terms
of handling things such as videos. How
are you handling that?
QuInLAn: The GFS design model
from the get-go was all about achieving
throughput, not about the latency at
which that might be achieved. To give
you a concrete example, if you’re writing a file, it will typically be written in
triplicate—meaning you’ll actually be
writing to three chunkservers. Should
one of those chunkservers die or hiccup
for a long period of time, the GFS master will notice the problem and schedule what we call a pullchunk, which
means it will basically replicate one of
those chunks. That will get you back up
to three copies, and then the system will
pass control back to the client, which
will continue writing.
When we do a pullchunk we limit it to
something on the order of 5MB–10MB
a second. So, for 64MB, you’re talking
about 10 seconds for this recovery to
take place. There are lots of other things
like this that might take 10 seconds to a
minute, which works just fine for batch-type operations. If you’re doing a large
MapReduce operation, you’re OK just
so long as one of the items is not a real
straggler, in which case you’ve got yourself a different sort of problem. Still,
generally speaking, a hiccup on the order of a minute over the course of an
hour-long batch job doesn’t really show
up. If you are working on Gmail, however, and you’re trying to write a mutation
that represents some user action, then
getting stuck for a minute is really going
to mess you up.
We’ve had similar issues with our
master failover. Initially, GFS had no
provision for automatic master failover.
It was a manual process. Although it
didn’t happen a lot, whenever it did, the
cell might be down for an hour. Even
our initial master-failover implementa-
tion required on the order of minutes.
Over the past year, however, we’ve taken
that down to something on the order of
tens of seconds.