major cost for Mesa and decompression performance on
read/query significantly outweighs the compression performance on write, we emphasize the compression ratio
and read/decompression times over the cost of write/
compression times when choosing the compression
algorithm.
Mesa also stores an index file corresponding to each
data file. (Recall that each data file is specific to a higher-level table index.) An index entry contains the short key
for the row block, which is a fixed size prefix of the first
key in the row block, and the offset of the compressed row
block in the data file. A naïve algorithm for querying a
specific key is to perform a binary search on the index file
to find the range of row blocks that may contain a short
key matching the query key, followed by a binary search
on the compressed row blocks in the data files to find the
desired key.
3. MESA SYSTEM ARCHITECTURE
Mesa is built using common Google infrastructure and services, including Big Table3 and Colossus. 7 Mesa runs in multiple datacenters, each of which runs a single instance. We
start by describing the design of an instance. Then we discuss how those instances are integrated to form a full multi-datacenter Mesa deployment.
3. 1. Single datacenter instance
Each Mesa instance is composed of two subsystems: update/
maintenance and querying. These subsystems are decoupled, allowing them to scale independently. All persistent
metadata is stored in Big Table and all data files are stored in
Colossus. No direct communication is required between the
two subsystems for operational correctness.
Update/maintenance subsystem. The update and maintenance subsystem performs all necessary operations to ensure the data in the local Mesa instance is correct, up to date,
and optimized for querying. It runs various background operations such as loading updates, performing table compaction, applying schema changes, and running table checksums. These operations are managed and performed by a
collection of components known as the controller/worker
framework, illustrated in Figure 4.
The controller determines the work that needs to be
done and manages all table metadata, which it persists
in the metadata Big Table. The table metadata consists of
detailed state and operational metadata for each table,
The controller can be viewed as a large-scale table meta-
data cache, work scheduler, and work queue manager. The
controller does not perform any actual table data manipu-
lation work; it only schedules work and updates the meta-
data. At startup, the controller loads table metadata from
a BigTable, which includes entries for all tables assigned
to the local Mesa instance. For every known table, it sub-
scribes to a metadata feed to listen for table updates. This
subscription is dynamically updated as tables are added
and dropped from the instance. New update metadata
received on this feed is validated and recorded. The con-
troller is the exclusive writer of the table metadata in the
Big Table.
The controller maintains separate internal queues for
different types of data manipulation work (e.g., incorporating updates, delta compaction, schema changes,
and table checksums). For operations specific to a single
Mesa instance, such as incorporating updates and delta
compaction, the controller determines what work to
queue. Work that requires globally coordinated application or global synchronization, such as schema changes
and table checksums, are initiated by other components
that run outside the context of a single Mesa instance. For
these tasks, the controller accepts work requests by RPC
and inserts these tasks into the corresponding internal
work queues.
Worker components are responsible for performing
the data manipulation work within each Mesa instance.
Mesa has a separate set of worker pools for each task
type, allowing each worker pool to scale independently.
Mesa uses an in-house worker pool scheduler that scales
the number of workers based on the percentage of idle
workers available. A worker can process a large task using
MapReduce. 9
Each idle worker periodically polls the controller to
request work for the type of task associated with its worker
type until valid work is found. Upon receiving valid work, the
worker validates the request, processes the retrieved work,
and notifies the controller when the task is completed. Each
task has an associated maximum ownership time and a
periodic lease renewal interval to ensure that a slow or dead
worker does not hold on to the task forever. The controller
is free to reassign the task if either of the above conditions
could not be met; this is safe because the controller will only
accept the task result from the worker to which it is assigned.
This ensures that Mesa is resilient to worker failures. A
garbage collector runs continuously to delete files left behind
due to worker crashes.
Since the controller/worker framework is only used for
update and maintenance work, these components can
restart without impacting external users. Also, the controller
itself is sharded by table, allowing the framework to scale.
In addition, the controller is stateless – all state information
is maintained consistently in the Big Table. This ensures that
ControllersUpdates
Update
workers
Compaction
workers
Schema
change
workers
Checksum
workers
Garbage
collector
Metadata
BigTable
Mesa data on Colossus
Figure 4. Mesa’s controller/worker framework.