single master was actually one of the
very first decisions, mostly just to simplify the overall design problem. That is,
building a distributed master right from
the outset was deemed too difficult and
would take too much time. Also, by going with the single-master approach, the
engineers were able to simplify a lot of
problems. Having a central place to control replication and garbage collection
and many other activities was definitely
simpler than handling it all on a distributed basis. So the decision was made to
centralize that in one machine.
mCKuSICK: Was this mostly about being able to roll out something within a
reasonably short time frame?
QuInLAn: Yes. In fact, some of the engineers who were involved in that early
effort later went on to build BigTable,
a distributed storage system, but that
effort took many years. The decision to
build the original GFS around the single
master really helped get something out
into the hands of users much more rapidly than would have otherwise been
possible.
Also, in sketching out the use cases
they anticipated, it didn’t seem the single-master design would cause much of
a problem. The scale they were thinking
about back then was framed in terms of
hundreds of terabytes and a few million
files. In fact, the system worked just fine
to start with.
mCKuSICK: But then what?
QuInLAn: Problems started to occur
once the size of the underlying storage
increased. Going from a few hundred
terabytes up to petabytes, and then up
to tens of petabytes…that really required
a proportionate increase in the amount
of metadata the master had to maintain. Also, operations such as scanning
the metadata to look for recoveries all
scaled linearly with the volume of data.
So the amount of work required of the
master grew substantially. The amount
of storage needed to retain all that information grew as well.
In addition, this proved to be a bot-
tleneck for the clients, even though the
clients issue few metadata operations
themselves—for example, a client talks
to the master whenever it does an open.
When you have thousands of clients all
talking to the master at the same time,
given that the master is capable of do-
ing only a few thousand operations a
second, the average client isn’t able to
command all that many operations per
second. Also bear in mind that there are
applications such as MapReduce, where
you might suddenly have a thousand
tasks, each wanting to open a num-
ber of files. Obviously, it would take a
long time to handle all those requests,
and the master would be under a fair
amount of duress.
mCKuSICK: Now, under the current
schema for GFS, you have one master
per cell, right?
QuInLAn: That’s correct.
mCKuSICK: And historically you’ve had
one cell per data center, right?
QuInLAn: That was initially the goal,
but it didn’t work out like that to a large
extent—partly because of the limita-
tions of the single-master design and
partly because isolation proved to be
difficult. As a consequence, people gen-
erally ended up with more than one cell
per data center. We also ended up do-
ing what we call a multi-cell approach,
which basically made it possible to
put multiple GFS masters on top of a
pool of chunkservers. That way, the
chunkservers could be configured to
have, say, eight GFS masters assigned
to them, and that would give you at least
one pool of underlying storage—with
multiple master heads on it, if you will.
Then the application was responsible
for partitioning data across those differ-
ent cells.
mCKuSICK: Presumably each applica-
tion would then essentially have its own
master that would be responsible for
managing its own little file system. Was
that basically the idea?
QuInLAn: Well, yes and no. Applica-
tions would tend to use either one mas-
ter or a small set of the masters. We also
have something we called Name Spaces,
which are just a very static way of parti-
tioning a namespace that people can
use to hide all of this from the actual
application. The Logs Processing Sys-
tem offers an example of this approach:
once logs exhaust their ability to use
just one cell, they move to multiple GFS
cells; a namespace file describes how
the log data is partitioned across those
different cells and basically serves to
hide the exact partitioning from the ap-
plication. But this is all fairly static.
It could be argued that managing to get
GFS ready for production in record time
constituted a victory in its own right and
that, by speeding Google to market, this
ultimately contributed mightily to the
company’s success. A team of three was
responsible for all of that—for the core
of GFS—and for the system being readied for deployment in less than a year.
But then came the price that so often
befalls any successful system—that is,
once the scale and use cases have had
time to expand far beyond what anyone could have possibly imagined. In
Google’s case, those pressures proved
to be particularly intense.
Although organizations don’t make
a habit of exchanging file-system statistics, it’s safe to assume that GFS is
the largest file system in operation (in
fact, that was probably true even before Google’s acquisition of You Tube).
Hence, even though the original architects of GFS felt they had provided adequately for at least a couple of orders
of magnitude of growth, Google quickly
zoomed right past that.
In addition, the number of applications GFS was called upon to support soon ballooned. In an interview
with one of the original GFS architects,
Howard Gobioff (conducted just prior
to his untimely death in early 2008),
he recalled, “The original consumer of
all our earliest GFS versions was basically this tremendously large crawling