uring out how we could combine some
number of underlying objects into
larger files. In the case of the logs, that
wasn’t exactly rocket science, but it did
require a lot of effort.
mCKuSICK: That sounds like the old
days when IBM had only a minimum
disk allocation, so it provided you with
a utility that let you pack a bunch of files
together and then create a table of contents for that.
QuInLAn: Exactly. For us, each application essentially ended up doing that
to varying degrees. That proved to be
less burdensome for some applications
than for others. In the case of our logs,
we hadn’t really been planning to delete
individual log files. It was more likely
that we would end up rewriting the logs
to anonymize them or do something
else along those lines. That way, you
don’t get the garbage-collection problems that can come up if you delete only
some of the files within a bundle.
For some other applications, however, the file-count problem was more
acute. Many times, the most natural design for some application just wouldn’t
fit into GFS—even though at first glance
you would think the file count would
be perfectly acceptable, it would turn
out to be a problem. When we started
using more shared cells, we put quotas
on both file counts and storage space.
The limit that people have ended up
running into most has been, by far, the
file-count quota. In comparison, the underlying storage quota rarely proves to
be a problem.
mCKuSICK: What longer-term strategy
have you come up with for dealing with
the file-count issue? Certainly, it doesn’t
seem that a distributed master is really
going to help with that—not if the master still has to keep all the metadata in
memory, that is.
QuInLAn: The distributed master certainly allows you to grow file counts,
in line with the number of machines
you’re willing to throw at it. That certainly helps.
One of the appeals of the distributed
multimaster model is that if you scale ev-
erything up by two orders of magnitude,
then getting down to a 1MB average file
size is going to be a lot different from
having a 64MB average file size. If you
end up going below 1MB, then you’re
also going to run into other issues that
you really need to be careful about. For
example, if you end up having to read
10,000 10KB files, you’re going to be do-
ing a lot more seeking than if you’re just
reading 100 1MB files.
With the recent emergence within
Google of BigTable, a distributed storage system for managing structured
data, one potential remedy for the file-count problem—albeit perhaps not the
very best one—is now available.
The significance of BigTable goes
far beyond file counts, however. Specifically, it was designed to scale into
the petabyte range across hundreds or
thousands of machines, as well as to
make it easy to add more machines to
the system and automatically start taking advantage of those resources without reconfiguration. For a company
predicated on the notion of employing
the collective power, potential redundancy, and economies of scale inherent
in a massive deployment of commodity
hardware, these rate as significant advantages indeed.
Accordingly, Big Table is now used in
conjunction with a growing number of
Google applications. Although it represents a departure of sorts from the past,
it also must be said that BigTable was
built on GFS, runs on GFS, and was consciously designed to remain consistent
with most GFS principles. Consider it,
therefore, as one of the major adaptations made along the way to help keep
GFS viable in the face of rapid and widespread change.
mCKuSICK: You now have this thing called
Big Table. Do you view that as an application in its own right?
QuInLAn: From the GFS point of view,
it’s an application, but it’s clearly more
of an infrastructure piece.
m CKu SICK: If I understand this correctly, Big Table is essentially a lightweight
QuInLAn: It’s not really a relational database. I mean, we’re not doing SQL and
it doesn’t really support joins and such.
But Big Table is a structured storage system that lets you have lots of key-value
pairs and a schema.
mCKuSICK: Who are the real clients of
QuInLAn: Big Table is increasingly being used within Google for crawling
and indexing systems, and we use it a
lot within many of our client-facing applications. The truth of the matter is
that there are tons of Big Table clients.
Basically, any app with lots of small
data items tends to use Big Table. That’s
especially true wherever there’s fairly
mCKuSICK: I guess the question I’m really trying to pose here is: Did Big Table
just get stuck into a lot of these applications as an attempt to deal with the
small-file problem, basically by taking
a whole bunch of small things and then
aggregating them together?
QuInLAn: That has certainly been one
use case for Big Table, but it was actually
intended for a much more general sort
of problem. If you’re using Big Table in
that way—that is, as a way of fighting
the file-count problem where you might
have otherwise used a file system to
handle that—then you would not end
up employing all of Big Table’s functionality by any means. Big Table isn’t really
ideal for that purpose in that it requires
resources for its own operations that are
nontrivial. Also, it has a garbage-collection policy that’s not super-aggressive,
so that might not be the most efficient
way to use your space. I’d say that the
people who have been using BigTable
purely to deal with the file-count problem probably haven’t been terribly happy, but there’s no question that it is one
way for people to handle that problem.
mCKuSICK: What I’ve read about GFS
seems to suggest that the idea was to
have only two basic data structures: logs
and SSTables (Sorted String Tables).
Since I’m guessing the SSTables must