then eliminating random look-ups
in the user table can improve performance greatly. Although this inevitably requires much more storage and,
more importantly, more data to be
read from disk in the course of the
analysis, the advantage gained by doing all data access in sequential order
is often enormous.
hard Limits
Another major challenge for data
analysis is exemplified by applications with hard limits on the size of
data they can handle. Here, one is
dealing mostly with the end-user analytical applications that constitute
the last stage in analysis. Occasionally the limits are relatively arbitrary;
consider the 256-column, 65,536-row
bound on worksheet size in all versions of Microsoft Excel prior to the
most recent one. Such a limit might
have seemed reasonable in the days
when main RAM was measured in
megabytes, but it was clearly obsolete
by 2007 when Microsoft updated Excel to accommodate up to 16,384 columns and one million rows. Enough
for anyone? Excel is not targeted at users crunching truly huge datasets, but
the fact remains that anyone working
with a one million-row dataset (a list
of customers along with their total
purchases for a large chain store, perhaps) is likely to face a two million-row dataset sooner or later, and Excel
has placed itself out of the running
for the job.
In designing applications to handle
ever-increasing amounts of data, developers would do well to remember
that hardware specs are improving
too, and keep in mind the so-called
ZOI (zero-one-infinity) rule, which
states that a program should “allow
none of foo, one of foo, or any number
of foo.”
11 That is, limits should not be
arbitrary; ideally, one should be able
to do as much with software as the
hardware platform allows.
Of course, hardware—chiefly
memory and CPU limitations—is often a major factor in software limits
on dataset size. Many applications are
designed to read entire datasets into
memory and work with them there;
a good example of this is the popular
statistical computing environment
R.
7 Memory-bound applications natu-
Data replicated
to improve the
efficiency of
different kinds
of analyses can
also provide
redundancy
against the
inevitable
node failure.
rally exhibit higher performance than
disk-bound ones (at least insofar as
the data-crunching they carry out advances beyond single-pass, purely sequential processing), but requiring
all data to fit in memory means that
if you have a dataset larger than your
installed RAM, you’re out of luck. On
most hardware platforms, there’s a
much harder limit on memory expansion than disk expansion: the mother-board has only so many slots to fill.
The problem often goes further
than this, however. Like most other
aspects of computer hardware, maximum memory capacities increase with
time; 32GB is no longer a rare configuration for a desktop workstation,
and servers are frequently configured
with far more than that. There is no
guarantee, however, that a memory-bound application will be able to use
all installed RAM. Even under modern
64-bit operating systems, many applications today (for example, R under
Windows) have only 32-bit executables and are limited to 4GB address
spaces—this often translates into a 2-
or 3GB working set limitation.
Finally, even where a 64-bit binary
is available—removing the absolute
address space limitation—all too often relics from the age of 32-bit code
still pervade software, particularly in
the use of 32-bit integers to index array elements. Thus, for example, 64-bit
versions of R (available for Linux and
Mac) use signed 32-bit integers to represent lengths, limiting data frames
to at most 231–1, or about two billion
rows. Even on a 64-bit system with sufficient RAM to hold the data, therefore,
a 6.75-billion-row dataset such as the
earlier world census example ends up
being too big for R to handle.
Distributed computing as
a strategy for Big Data
Any given computer has a series of absolute and practical limits: memory
size, disk size, processor speed, and
so on. When one of these limits is exhausted, we lean on the next one, but
at a performance cost: an in-memory
database is faster than an on-disk one,
but a PC with 2GB RAM cannot store a
100GB dataset entirely in memory; a
server with 128GB RAM can, but the
data may well grow to 200GB before
the next generation of servers with