SQL
front-end
XQuery
front-end
SPARQL
front-end
BAT algebra
BAT “name”
0 0 John Wayne\0
1
11 Roger Moore\0
2
23 Bob Fosse\0
3
33 Will Smith\0
BAT “age”
0 1907
1 1927
2 1927
3 1968
01
12
select(age,1927)
(memory mapped) simple memory array
(virtual) dense surrogates MonetDB back-end
columns using memory-mapped files. It is optimized for
the typical situation that the surrogate column is a densely
ascending numerical identifier (0, 1, 2,…); in which case the
head array is omitted, and surrogate lookup becomes a fast
array index read in the tail. In effect, this use of arrays in virtual memory exploits the fast in-hardware address to disk-block mapping implemented by the memory management
unit (MMU) in a CPU to provide an O( 1) positional database
lookup mechanism. From a CPU overhead point of view this
compares favorably to B-tree lookup into slotted pages—the
approach traditionally used in database systems for “fast”
record lookup.
The Join and Select operators of the relational algebra take an arbitrary Boolean expression to determine the
tuples to be joined and selected. The fact that this Boolean
expression is specified at query time only, means that the
RDBMS must include some expression interpreter in the critical runtime code-path of these operators. Traditional database systems implement each relational algebra operator as
an iterator class with a next() method that returns the next
tuple; database queries are translated into a pipeline of such
iterators that call each other. As a recursive series of method
calls is performed to produce a single tuple, computational
interpretation overhead is significant. Moreover, the fact
that the next() method of all iterators in the query plan is
executed for each tuple, causes a large instruction cache footprint, which can lead to strong performance degradation
due to instruction cache misses.
1
In contrast, each BAT algebra operator has zero degrees of
freedom: it does not take complex expressions as parameter.
Rather, complex expressions are broken into a sequence of
BAT algebra operators that perform one simple operation on
an entire column of values (“bulk processing”). This allows
the implementation of the BAT algebra to forsake an expression interpreting engine; rather all BAT algebra operations
in the implementation map onto simple array operations.
For instance, the BAT algebra expression
R:bat[:oid, :oid]:=select(B:bat[:oid,:int], V:int)
80 communications of the acm | dEcEmbEr 2008 | voL. 51 | No. 12
can be implemented at the C code level like:
for (i = j = 0; i <n; i++)
if (B.tail[i] == V) R.tail[j++] = i;
The BAT algebra operators have the advantage that tight
for-loops create high instruction locality which eliminates
the instruction cache miss problem. Such simple loops are
amenable to compiler optimization (loop pipelining, blocking, strength reduction), and CPU out-of-order speculation.
A potential danger of bulk processing is that it
materializes intermediate results which in some cases may lead to
excessive RAM consumption. Although RAM sizes increase
quickly as well, there remain cases that we hit their limit as
well. In the MonetDB/X100 project3 it was shown how partial
column-wise execution can be integrated into (
nonmaterial-izing) pipelined query processing.
We can conclude that the MonetDB architecture for realizing database system functionality is radically different from
many contemporary product designs, and the reasons for its
design are motivated by opportunities for better exploiting
modern hardware features.
4. cache-conscious Joins
Among the relational algebra operators, the Join operator,
which finds all matching pairs between all tuples from
two relations according to some Boolean predicate, is the
most expensive operator—its complexity in the general
case is quadratic in input size. However, for equality join
predicates, fast (often linear) algorithms are available,
such as hash-Join, where the outer relation is scanned
sequentially and a hash-table is used to probe the inner
relation.
4. 1. Partitioned hash-join
The very nature of the hashing algorithm implies that the
access pattern to the inner relation (plus hash-table) is random. In case the randomly accessed data is too large for the
CPU caches, each tuple access will cause cache misses and
performance degrades.
Shatdal et al.
19 showed that a main-memory variant of
Grace Hash-Join, in which both relations are first partitioned on hash-number into h separate clusters, that each
fit into the L2 memory cache, performs better than normal
bucket-chained hash-join. However, the clustering operation itself can become a cache problem: their straightforward clustering algorithm that simply scans the relation
to be clustered once and inserts each tuple in one of the
clusters, creates a random access pattern that writes into
h separate locations. If h is too large, there are two factors
that degrade performance. First, if h exceeds the number of
TLB entriesd each memory reference will become a TLB miss.
Second, if h exceeds the number of available cache lines (L1
d If the relation is very small and fits the total number of TLB entries times
the page size, multiple clusters will fit into the same page and this effect will
not occur.