constraints), then any of these steps
could be helpful in ameliorating superlinear effects.
The large number of controlled measurements performed by running
Hadoop TeraSort on Amazon EC2 exposed the underlying cause of superlinearity that would otherwise be difficult to resolve in the field. Fitting our
speedup data to the USL performance
model produced a negative contention coefficient as a telltale sign of superlinearity on BigMem clusters.
The subtractive effect of negative σ
introduces a point of inflection in the
convex superlinear curve that causes
it ultimately to become concave, thus
crossing over the linear bound at p× in
equation 3. At that point, Hadoop TeraSort superlinear scalability returns to
being sublinear in the payback region.
The cluster size p× provides an estimate
of the minimal node capacity needed
to ameliorate superlinear speedup on
Although superlinearity is a bona
fide phenomenon, just like perpetual
motion it is ultimately a performance
illusion. For TeraSort on BigMem, the
apparent capacity boost can be traced
to successively relaxing the latent IO
bandwidth constraint per node as
the cluster size grows. This IO bottleneck induces stochastic failures of
the HDFS pipeline in the Reduce task.
That causes the Hadoop framework
to restart the Reduce task file-write,
which stretches the measured runtimes. If runtime stretching is greatest for T1, then successive speedup
measurements will be superlinear.
Increasing the IO bandwidth per
node, as we did with BigDisk clusters,
diminishes or eliminates superlinear
speedup by reducing T1 stretching.
This USL analysis suggests superlinear scalability is not peculiar to TeraSort on Hadoop but may arise with any
MapReduce application. Superlinear
speedup has also been observed in relational database systems.
2, 12 For high-performance computing applications,
however, superlinear speedup may
arise differently from the explanation
4, 14, 20
Superlinearity aside, the more im-
portant takeaway for many readers may
be the following. Unlike most software-
engineering projects, Hadoop applica-
tions require only a fixed development
effort. Once an application is demon-
strated to work on a small cluster, the
Hadoop framework facilitates scaling
it out to an arbitrarily large number
of nodes with no additional effort. For
many MapReduce applications, scale-
out may be driven more by the need for
disk storage than compute power as
the growth in data volume necessitates
more Maps. The unfortunate term flat
scalability has been used to describe
Although flat scalability may be a
reasonable assumption for the initial
development process, it does not guarantee that performance goals—such
as batch windows, traffic capacity, or
service-level objectives—will be met
without significant additional effort.
The unstated assumption behind the
flat-scalability precept is that Hadoop
applications scale linearly (Figure
2a) or near-linearly (Figure 2b). Any
shuffle-exchange processing, however, will induce a peak somewhere
in the scalability profile (Figure 2d).
The Hadoop cluster size at which the
peak occurs can be predicted by applying the USL to small-cluster measurements. The performance-engineering
effort needed to temper that peak will
typically far exceed the flat-scalability
assumption. As this article has endeavored to show, the USL provides a
valuable tool for the software engineer
to analyze Hadoop scalability.
We thank Comcast Corporation for
supporting the acquisition of Hadoop
data used in this work.
Hazy: Making it Easier to Build
and Maintain Big-Data Analytics
Arun Kumar, Feng Niu, and Christopher Ré
Condos and Clouds
1. Apache Whirr; https://whirr.apache.org.
2. Calvert, C. and Kulkarni D. Essential LINQ. Pearson
Education, Boston, MA, 2009.
3. Cloudera Hadoop; http://www.cloudera.com/content/
4. Eijkhout, V. Introduction to High Performance
Scientific Computing. Lulu.com, 2014.
5. Feynman, R. P. Papp perpetual motion engine; http://
6. Gunther, N.J. A simple capacity model of massively
parallel transaction systems. In Proceedings of
International Computer Measurement Group
7. Gunther, N. J. A general theory of computational
scalability based on rational functions, 2008;
8. Gunther, N. J. Guerrilla Capacity Planning: A Tactical
Approach to Planning for Highly Scalable Applications
and Services. Springer, New York, NY, 2007.
9. Gunther, N.J. Performance and scalability models for
a hypergrowth e-commerce Web site. Performance
Engineering. R.R. Dumke, C. Rautenstrauch, A.
Schmietendorf, and A. Scholz, eds. A. Lecture Notes in
Computer Science 2047 (2001). Springer-Verlag 267-282.
10. Gunther, N.J. PostgreSQL scalability analysis
deconstructed. The Pith of Performance, 2012; http://
11. Gunther, N. J., Subramanyam, S. and Parvu, S. Hidden
scalability gotchas in memcached and friends.
VELOCI T Y Web Performance and Operations
12. Haas, R. Scalability, in graphical form, analyzed, 2011;
13. Hadoop Log Tools; https://github.com/melrief/Hadoop-Log-Tools.
14. Hennessy, J.L. and Patterson, D.A. Computer
Architecture: A Quantitative Approach. Second edition.
Morgan Kaufmann, Waltham, MA, 1996.
15. O’Malley, O. TeraByte sort on Apache Hadoop, 2008;
16. O’Malley, O., Murthy, A. C. 2009. Winning a 60-second
dash with a yellow elephant; http://sortbenchmark.
17. Performance Dynamics Company. How to quantify
scalability, 2014; http://www.perfdynamics.com/
18. Schwartz, B. Is VoltDB really as scalable as they
claim? Percona MySQL Performance Blog; http://
19. sFlow. SDN analytics and control using sFlow
standard — Superlinear; http://blog.sflow.
20. Stackoverflow. Where does superlinear speedup come
21. Sun Fire X2270 M2 superlinear scaling of Hadoop
TeraSort and CloudBurst benchmarks, 2010; https://
22. Sutter, H. Going superlinear. Dr. Dobb’s J. 33,
3 (2008); http://www.drdobbs.com/cpp/going-
23. Sutter, H. Super linearity and the bigger machine.
Dr. Dobb’s J. 33, 4 (2008); http://www.drdobbs.
24. White, T. Hadoop: The Definitive Guide, third edition.
O’Reilly Media, 2012.
25. Yahoo! Hadoop Tutorial; https://developer.yahoo.com/
Neil J. Gunther ( http://perfdynamics.blogspot.com;
tweets as @DrOz) is a researcher and teacher at
Performance Dynamics where he developed the USL and
the PDQ open source performance analyzer.
Paul Puglia ( firstname.lastname@example.org) has been working in
IT for more than 20 years doing Python programming,
system administration, and performance testing. He has
authored an R package, SATK, for fitting performance
data to the USL, and contributed to the PDQ open source
Kristofer Tomasette ( email@example.com) is a
senior software engineer on the Platforms & APIs team
at Comcast Corporation. He has built software systems
involving warehouse management, online banking,
telecom, and most recently cable TV.
Copyright held by authors.
Publication rights licensed to ACM. $15.00.