performed on, the platform organizes the computation around a
set of programming abstractions
substantially different from those
of the normal desktop environment. Analysts trained on the
desktop environment have to learn
these new abstractions and plan
their computation around them,
often facing a new set of engineering trade-offs and failure modes.
A completely new part of designing a data analysis for the cloud is
planning for the economic impact
of the design choices. With cloud
computing, nearly every choice
about computation, uploading/
downloading data, and storage has
a direct dollar cost. In addition,
each choice has an effect on how
long a job will take to execute.
Planning and monitoring these
costs is unfamiliar and poorly supported for end users, and making
mistakes can be quite expensive.
Many of these decisions need to
be made before the first byte is
uploaded to the cloud and before
the first line of code is written.
Using cloud computing can support a broad selection of VMs; for
more computation, you simply
pay more, by buying either more
machines or larger ones. Doubling
the memory or computation speed
of a machine, however, does not
double its speed; it can impose
non-linear costs as communication overhead, storage, and other
aspects change. For example,
in certain systems, a developer
who takes a larger-scale VM gets
access to lower-level systems of the
machine and better guarantees of
performance; smaller VMs do not
get this access.
There is no support for estimat-
ing the cost or the duration of a
computation before performing it.
Programmers end up iteratively
re-running their application to
adjust the number of VMs, the
size of queues, and so on, incur-
ring larger bills while empirically
finding the right time-cost balance
point. When working with VMs
in a shared cloud, their perfor-
mance characteristics may even
vary over repeated experiments.
May + June 2012