We have built a prototype 21-node FAWN cluster using
500 MHz embedded CPUs. Each node can serve up to 1300
256 byte queries/s, exploiting nearly all of the raw I/O capability of their attached flash devices, and consumes under
5W when network and support hardware is taken into
account. The FAWN cluster achieves 330 queries/J—two
orders of magnitude better than traditional disk-based
clusters.
2. Wh Y FAWn?
The FAWN approach to building well-matched cluster systems has the potential to achieve high performance and
be fundamentally more energy-efficient than conventional architectures for serving massive-scale I/O and data-intensive workloads. We measure system performance in
queries per second and measure energy efficiency in queries
per Joule (equivalently, queries per second per Watt). FAWN
is inspired by several fundamental trends:
increasing Cpu-i/O gap: Over the past several decades,
the gap between CPU performance and I/O bandwidth has
continually grown. For data-intensive computing workloads,
storage, network, and memory bandwidth bottlenecks often
cause low CPU utilization.
FAWN approach: To efficiently run I/O-bound data-intensive, computationally simple applications, FAWN uses
wimpy processors selected to reduce I/O-induced idle cycles
while maintaining high performance. The reduced processor speed then benefits from a second trend.
Cpu power consumption grows super-linearly with
speed: Higher frequencies require more energy, and techniques to mask the CPU-memory bottleneck come at the
cost of energy efficiency. Branch prediction, speculative
execution, out-of-order execution and large on-chip caches
all require additional die area; modern processors dedicate as much as half their die to L2/3 caches. 9 These techniques do not increase the speed of basic computations,
but do increase power consumption, making faster CPUs
less energy efficient.
FAWN approach: A FAWN cluster’s slower CPUs dedicate proportionally more transistors to basic operations.
These CPUs execute significantly more instructions per
Joule than their faster counterparts: Multi-GHz superscalar
quad-core processors can execute approximately 100 million instructions/J, assuming all cores are active and avoid
stalls or mispredictions. Lower-frequency in-order CPUs,
in contrast, can provide over 1 billion instructions/J—an
order of magnitude more efficient while running at 1/3 the
frequency.
Worse yet, running fast processors below their full capacity
draws a disproportionate amount of power.
Dynamic power scaling on traditional systems is surprisingly inefficient: A primary energy-saving benefit of
dynamic voltage and frequency scaling (DVFS) was its ability to reduce voltage as it reduced frequency, but modern
CPUs already operate near minimum voltage at the highest
frequencies.
Even if processor energy was completely proportional
to load, non-CPU components such as memory, mother-boards, and power supplies have begun to dominate energy
102 CoMMunICATIonS oF ThE ACM | july 2011 | Vol. 54 | no. 7
consumption, 2 requiring that all components be scaled back
with demand. As a result, a computer may consume over 50%
of its peak power when running at only 20% of its capacity. 20
Despite improved power scaling technology, systems remain
most energy efficient when operating at peak utilization.
A promising path to energy proportionality is turning
machines off entirely. 6 Unfortunately, these techniques do
not apply well to FAWN-KV’s target workloads: Key-value
systems must often meet service-level agreements for query
throughput and latency of hundreds of milliseconds; the
inter-arrival time and latency bounds of the requests prevent shutting machines down (and taking many seconds to
wake them up again) during low load. 2
Finally, energy proportionality alone is not a panacea:
Systems should be both proportional and efficient at 100%
load. FAWN specifically addresses efficiency, and cluster techniques that improve proportionality should apply
universally.
3. DESIGn AnD IMPLEMEn TATIon
We describe the design and implementation of the system
components from the bottom up: a brief overview of flash
storage (Section 3. 2), the per node FAWN-DS datastore
(Section 3. 3), and the FAWN-KV cluster key-value lookup system (Section 3. 4), including replication and consistency.
3. 1. Design overview
Figure 1 gives an overview of the entire FAWN system.
Client requests enter the system at one of several front ends.
The front-end nodes forward the request to the back-end
FAWN-KV node responsible for serving that particular key.
The back-end node serves the request from its FAWN-DS
datastore and returns the result to the front end (which in
turn replies to the client). Writes proceed similarly.
The large number of back-end FAWN-KV storage nodes
is organized into a ring using consistent hashing. As in systems such as Chord, 18 keys are mapped to the node that follows the key in the ring (its successor). To balance load and
reduce failover times, each physical node joins the ring as a
small number (V) of virtual nodes, each virtual node representing a virtual ID (“VID”) in the ring space. Each physical
node is thus responsible for V different (noncontiguous) key
ranges. The data associated with each virtual ID is stored on
flash using FAWN-DS.
Figure 1. FAWn-KV architecture.
FAWN back-end
FAWN-DS
E2
A1
B2
B1
Requests
F2
Front-end
Switch
D1
Front-end
D2
A2
E1
Responses
F1