the distributions than to embark on a
full-table sort.
PostgreSQL’s difficulty here was
in analyzing the stored data, not in
storing it. The database didn’t blink
at loading or maintaining a database
of a billion records; presumably there
would have been no difficulty storing
the entire 6.75-billion-row, 10-col-
umn table had I had sufficient free
disk space.
Here’s the big truth about big data
in traditional databases: it’s easier to
get the data in than out. Most DBMSs
are designed for efficient transaction
processing: adding, updating, searching for, and retrieving small amounts
of information in a large database.
Data is typically acquired in a transactional fashion: imagine a user logging into a retail Web site (account
data is retrieved; session information
is added to a log), searching for products (product data is searched for and
retrieved; more session information
is acquired), and making a purchase
(details are inserted in an order database; user information is updated). A
fair amount of data has been added
effortlessly to a database that—if it’s
a large site that has been in operation
for a while—probably already constitutes “big data.”
There is no pathology here; this story is repeated in countless ways, every
second of the day, all over the world.
The trouble comes when we want to
take that accumulated data, collected
over months or years, and learn something from it—and naturally we want
the answer in seconds or minutes!
The pathologies of big data are primarily those of analysis. This may be a
slightly controversial assertion, but I
would argue that transaction processing and data storage are largely solved
problems. Short of LHC-scale science,
few enterprises generate data at such
a rate that acquiring and storing it
pose major challenges today.
In business applications, at least,
data warehousing is ordinarily regarded as the solution to the database
problem (data goes in but doesn’t
come out). A data warehouse has been
classically defined as “a copy of transaction data specifically structured for
query and analysis,”
4 and the general
approach is commonly understood
to be bulk extraction of the data from
To understand
how to avoid the
pathologies of big
data, whether
in the context of
a data warehouse
or in the physical
or social sciences,
we need to consider
what really makes
it “big.”
an operational database, followed by
reconstitution in a different database
in a form that is more suitable for
analytical queries (the so-called “
extract, transform, load,” or sometimes
“extract, load, transform” process).
Merely saying, “We will build a data
warehouse” is not sufficient when
faced with a truly huge accumulation
of data.
How must data be structured for
query and analysis, and how must
analytical databases and tools be designed to handle it efficiently? Big
data changes the answers to these
questions, as traditional techniques
such as RDBMS-based dimensional
modeling and cube-based OLAP (
online analytical processing) turn out
to be either too slow or too limited to
support asking the really interesting
questions about warehoused data.
To understand how to avoid the pathologies of big data, whether in the
context of a data warehouse or in the
physical or social sciences, we need to
consider what really makes it “big.”
Dealing with Big Data
Data means “things given” in Latin—
although we tend to use it as a mass
noun in English, as if it denotes a
substance—and ultimately, almost
all useful data is given to us either
by nature, as a reward for careful observation of physical processes, or by
other people, usually inadvertently
(consider logs of Web hits or retail
transactions, both common sources
of big data). As a result, in the real
world, data is not just a big set of
random numbers; it tends to exhibit
predictable characteristics. For one
thing, as a rule, the largest
cardinali-ties of most datasets—specifically,
the number of distinct entities about
which observations are made—are
small compared with the total number of observations.
This is hardly surprising. Human beings are making the observations, or being observed as the case
may be, and there are no more than
6. 75 billion of them at the moment,
which sets a rather practical upper
bound. The objects about which we
collect data, if they are of the human
world—Web pages, stores, products,
accounts, securities, countries, cities,
houses, phones, IP addresses—tend