to be fewer in number than the total
world population. Even in scientific
datasets, a practical limit on cardinal-ities is often set by such factors as the
number of available sensors (a state-of-the-art neurophysiology dataset,
for example, might reflect 512 channels of recording5) or simply the number of distinct entities that humans
have been able to detect and identify
(the largest astronomical catalogs,
for example, include several hundred
million objects8).
What makes most big data big is
repeated observations over time and/
or space. The Web log records millions of visits a day to a handful of
pages; the cellphone database stores
time and location every 15 seconds for
each of a few million phones; the retailer has thousands of stores, tens of
thousands of products, and millions
of customers but logs billions and
billions of individual transactions in
a year. Scientific measurements are
often made at a high time resolution
(thousands of samples a second in
neurophysiology, far more in particle
physics) and really start to get huge
when they involve two or three dimensions of space as well; fMRI neuroim-aging studies can generate hundreds
or even thousands of gigabytes in a
single experiment. Imaging in general is the source of some of the biggest
big data out there, but the problems
of large image data are a topic for an
article by themselves; I won’t consider
them further here.
The fact that most large datasets
have inherent temporal or spatial
dimensions, or both, is crucial to
understanding one important way
that big data can cause performance
problems, especially when databases
are involved. It would seem intuitively
obvious that data with a time dimension, for example, should in most
cases be stored and processed with
at least a partial temporal ordering to
preserve locality of reference as much
as possible when data is consumed in
time order. After all, most nontrivial
analyses will involve at the very least
an aggregation of observations over
one or more contiguous time intervals. One is more likely, for example,
to be looking at the purchases of a
randomly selected set of customers
over a particular time period than of
Here’s the big
truth about big
data in traditional
databases: It’s
easier to get the
data in than out.
40 COMMUNICATIONS OF THE ACM | AUGUST 2009 | VOL. 52 | NO. 8
a “contiguous range” of customers
(however defined) at a randomly selected set of times.
The point is even clearer when we
consider the demands of time-series
analysis and forecasting, which aggregate data in an order-dependent
manner (for example, cumulative and
moving-window functions, lead and
lag operators, among others). Such
analyses are necessary for answering
most of the truly interesting questions
about temporal data, broadly: “What
happened?” “Why did it happen?”
“What’s going to happen next?”
The prevailing database model
today, however, is the relational database, and this model explicitly ignores the ordering of rows in tables.1
Database implementations that follow this model, eschewing the idea of
an inherent order on tables, will inevitably end up retrieving data in a nonsequential fashion once it grows large
enough that it no longer fits in memory. As the total amount of data stored
in the database grows, the problem
only becomes more significant. To
achieve acceptable performance for
highly order-dependent queries on truly
large data, one must be willing to consider abandoning the purely relational
database model for one that recognizes the concept of inherent ordering
of data down to the implementation
level. Fortunately, this point is slowly
starting to be recognized in the analytical database sphere.
Not only in databases, but also in
application programming in general,
big data greatly magnifies the performance impact of suboptimal access patterns. As dataset sizes grow,
it becomes increasingly important to
choose algorithms that exploit the efficiency of sequential access as much
as possible at all stages of processing. Aside from the obvious point that
a 10:1 increase in processing time
(which could easily result from a high
proportion of nonsequential accesses) is far more painful when the units
are hours than when they are seconds,
increasing data sizes mean that data
access becomes less and less efficient.
The penalty for inefficient access patterns increases disproportionately
as the limits of successive stages of
hardware are exhausted: from processor cache to memory, memory to local