per node) and four SATA disks.
is very similar to the BigDisk EC2 configuration in Table 1. We therefore
repeated the TeraSort scalability measurements on the BigDisk cluster. The
results for p = 2, 3, 5, and 10 clusters are
compared in Figure 8.
Consistent with Figure 5, BigMem
speedup values in Figure 8a are superlinear, whereas the BigDisk nodes
in Figure 8b unexpectedly exhibit
speedup values that are either linear or
sublinear. The superlinear effect has
essentially been eliminated by increasing the number of local spindles from
one to four per cluster node. In other
words, increasing nodal IO bandwidth
leads to the counterintuitive result that
scalability is degraded from superlinear to sublinear.
In an attempt to explain why the
superlinear effect has diminished, we
formed a working hypothesis by identifying the key performance differences
between BigMem and BigDisk.
BigMem has the larger memory
configuration, which possibly provides more CentOS buffer caching
for the TeraSort data, and that could
be thought of as being the source of
the capacity boost associated with the
negative USL contention coefficient.
Incremental memory growth in proportion to cluster size is a common explanation for superlinear speedup.
Increasing memory size, however, is
probably not the source of the capacity boost in Hadoop TeraSort. If the
buffer cache fills to the point where it
needs to be written to disk, it will take
longer because there is only a single
local disk per node on BigMem. A sin-gle-disk DataNode in Figure 3 implies
all disk IO is serialized. In this sense,
when disk writes (including replications) occur, TeraSort is IO bound—
most particularly in the single-node
case. As the cluster configuration gets
larger, this latent IO constraint becomes less severe since the amount
of data per node that must be written
to disk is reduced in proportion to the
number of nodes. Successive cluster
sizes therefore exhibit runtimes that
are shorter than the single-node case,
and that results in the superlinear
speedup values shown in Figure 8a.
Conversely, although BigDisk has
a smaller amount of physical mem-
ory per node, it has quad disks per
in Figure 5 and occurs at p = 95 nodes.
The measured values of the speedup
differ from the original USL prediction,
not because the USL is wrong but be-
cause there is now more information
available than previously. Moreover,
this confirms the key USL prediction
that superlinear speedup reaches a
maximum value and then rapidly de-
clines into the payback region.
Based on USL analysis, the scalability curve is expected to cross the linear
bound at p× nodes given by equation 3:
p×= |σ| κ ( 3)
For the dashed curve in Figure 7,
the crossover occurs at p× = 65 nodes,
whereas for the solid curve it occurs
at p× = 99 nodes. Like predicting Smax,
the difference in the two p×
predictions comes from the difference in the
amount of information contained in
the two sets of measurements.
Hunting the Superlinear Snark
After the TeraSort data was validated
against the USL model, a deeper performance analysis was needed to determine the cause of superlinearity. Let’s
start with a closer examination of the
actual runtime measurements for each
EC2 cluster configuration.
Runtime data analysis. To make
a statistical determination of the error in the runtime measurements, we
performed some runs with a dozen
repetitions per node configuration.
From that sample size a reasonable
estimate of the uncertainty can be
calculated based on the standard
error, or the relative error, which is
For each of the runtimes in Table 2,
the number before the ± sign is the
sample mean, while the error term
following the ± sign is derived from
the sample variance. The relative error
(r.e.) is the ratio of the standard error
to the mean value reported as a percentage.
What is immediately evident from
this numerical analysis is the significant variation in the relative errors
with a range from 3%, which is nominal, to 9%, which likely warrants further attention. This variation in the
measurement error does not mean the
measurement technique is unreliable;
rather, it means there is a higher degree of dispersion in the runtime data
for reasons that cannot be discerned at
this level of analysis.
Nor is this variation in runtimes peculiar to our EC2 measurements. The
Yahoo TeraSort benchmark team also
noted significant variations in their
execution times, although they did not
quantify them. “Although I had the 910
nodes mostly to myself, the network core
was shared with another active 2000
node cluster, so the times varied a lot depending on the other activity.”
Some of the Yahoo team’s sources of
variability may differ from ours (for example, the 10x larger cluster size is likely responsible for some of the Yahoo
variation). “Note that in any large cluster
and distributed application, there are a
lot of moving pieces and thus we have seen
a wide variation in execution times.”
A surprising hypothesis. The physical cluster configuration used by the
Yahoo benchmark team comprised
nodes with two quad-core Xeon processors (that is, a total of eight cores
Figure 8. Hadoop TeraSort superlinearity is eliminated by increasing nodal disk IO bandwidth.