figure 1. calculating the median age by sex and country
over the entire world population in a matter of minutes.
Record layout:
7b age
1b
sex
32b
income
13b
ethnicity
13b
language
13b
religion
1b
hat
8b
country
16b
place
24b
locator
6. 75 billion rows
To find median age by sex and country,
then
int age, sex, country;
int cnt[ 2][256][128];
int tot,acc;
byte r[ 16];
fill cnt with 0;
do
read 16 bytes into r;
age = r[0] & 01111111b;
sex = r[ 1] & 10000000b;
ctry = r[ 11] & 11111111b;
cnt[sex][ctry][age] += 1;
until end of file;
for sex = 0 to 1 do
for ctry = 0 to 255 do
output ctry, sex;
tot = sum9cnt[sex][ctry][age];
acc = 0;
for age = 0 to 127 do
acc += cnt[sex][ctry][age];
if(acc >= tot/2)
output age;
go to next ctry;
end if;
next age;
next ctry;
next sex;
median age by using a counting strategy: simply create 65,536 buckets—
one for each combination of age, sex,
and country—and count how many
records fall into each. We find the
median age by determining, for each
sex and country group, the cumulative
count over the 128 age buckets: the
median is the bucket where the count
reaches half of the total. In my tests,
this algorithm was limited primarily
by the speed at which data could be
fetched from disk: a little over 15 minutes for one pass through the data at a
typical 90MB/s sustained read speed,
9
shamefully underutilizing the CPU
the whole time.
In fact, our table of “all the people
in the world” will fit in the memory of
a single, $15K Dell server with 128GB
RAM. Running off in-memory data,
my simple median-age-by-sex-and-country program completed in less
than a minute. By such measures, I
would hesitate to call this “big data,”
particularly in a world where a single
research site, the LHC (Large Hadron
Collider) at CERN (European Organization for Nuclear Research), is expected to produce 150,000 times as
much raw data each year.
10
For many commonly used applications, however, our hypothetical
6.75-billion-row dataset would in fact
pose a significant challenge. I tried
loading my fake 100GB world census
into a commonly used enterprise-grade database system (PostgreSQL6)
running on relatively hefty hardware
(an eight-core Mac Pro workstation
with 20GB RAM and two terabytes of
RAID 0 disk), but had to abort the bulk
load process after six hours as the database storage had already reached
many times the size of the original
binary dataset, and the workstation’s
disk was nearly full. (Part of this, of
course, was a result of the “
unpacking” of the data. The original file
stored fields bit-packed rather than
as distinct integer fields, but subsequent tests revealed that the database
was using three to four times as much
storage as would be necessary to store
each field as a 32-bit integer. This sort
of data “inflation” is typical of a traditional RDBMS and shouldn’t necessarily be seen as a problem, especially
to the extent that it is part of a strategy to improve performance. After all,
disk space is relatively cheap.)
I was successfully able to load subsets consisting of up to one billion
rows of just three columns: country (8-
bits, 256 possible values), age (7-bits,
128 possible values), and sex (one bit,
two values). This was only 2% of the
raw data, although it ended up consuming more than 40GB in the DBMS.
I then tested the following query, es-
sentially the same computation as the
left side of Figure 1:
SELECT country,age,sex,count(*)
FROM people GROUP BY
country,age,sex;
This query ran in a matter of seconds on small subsets of the data, but
execution time increased rapidly as
the number of rows grew past 1 million (see Figure 2). Applied to the entire billion rows, the query took more
than 24 hours, suggesting that PostgreSQL was not scaling gracefully to this
big dataset, presumably because of a
poor choice of algorithm for the given
data and query. Invoking the DBMS’s
built-in EXPLAIN facility revealed
the problem: while the query planner
chose a reasonable hash table-based
aggregation strategy for small tables,
on larger tables it switched to sorting
by grouping columns—a viable, if sub-optimal strategy given a few million
rows, but a very poor one when facing
a billion. PostgreSQL tracks statistics
such as the minimum and maximum
value of each column in a table (and I
verified that it had correctly identified
the ranges of all three columns), so it
could have chosen a hash-table strategy with confidence. It’s worth noting, however, that even if the table’s
statistics had not been known, on a
billion rows it would take far less time
to do an initial scan and determine
figure 2. PostgresQL performance on the
query seLec T country,age,sex,count(*)
fRom people GRouP BY country,age,sex.
— query time
— linear Growth
— n log n growth
— n2 growth
104
Time (seconds)
100
1
0.01
1000 104 105 106 107 108 109
number of rows
Curves of linear, linearithmic, and quadratic growth
are shown for comparison.