With these, we can calculate the number of cache misses
that occur when executing patterns ,..., ∈ , p > 1
sequentially, given an initial cache state s0, as
Mi (S0 i ,⊕( q
1,...,p ) = ∑Mi (S − 1 i , q ).
5. 2. 2. Concurrent Execution
When executing patterns concurrently, we actually have to
consider the fact that they are competing for the same cache.
We model the impact of the cache interference between concurrent patterns by dividing the cache among all patterns.
Each pattern gets a fraction 0 < .n < 1 of the cache according to its footprint size F, i.e., the number of cache lines that
it potentially revisits. The detailed formulas for F() with
∈ are given in Manegold.
We use M to denote the number of misses with only a
fraction 0 < .n < 1 of the total cache size available.
With these tools at hand, we calculate the cache misses
for concurrent execution of patterns , .. ., ∈ (p > 1)
given an initial cache state s0 as
Mi (S0 i , e (
1, . . . , p )) = ∑Mi v (S0 / i , q ).
For our radix-partitioned hash-join algorithm, Figures
4d and 8 compare the cost predicted by our cost model to
the measured execution times on an Athlon PC. An exhaustive experimental validation of our models is presented in
5. 2. 3. Query Execution Plans
With the techniques discussed, we have the basic tools at
hand to estimate the number and kind of cache misses of
complete query plans, and hence can predict their memory access costs. The various operators in a query plan are
combined in the same way the basic patterns are combined
figure 8: sample cost model validation.
Seconds (points: measured, lines: modeled)
T1 T2 L1 L2
seconds (points: measured, lines: modeled)
0 5 10 15 20
Number of radix-bits
Number of radix-bits
84 communications of the acm | dEcEmbEr 2008 | voL. 51 | No. 12
to form compound patterns. Basically, the query plan
describes, which operators are executed one after the other
and which are executed concurrently. We view pipelining as
concurrent execution of data-dependent operators. Hence,
we can derive the complex memory access pattern of a query
plan by combining the compound patterns of the operators
as discussed above. Considering the caches’ states as introduced before takes care of handling data dependencies.
6. ReLateD WoRK
The growing mismatch between the way database systems
are engineered versus hardware evolution was first brought
to light in a number of workload studies. An early study15
already showed database workloads, compared with scientific computation, to exhibit significantly more instruction
cache misses (due to a large code footprint) and more (L2)
data cache misses.
Instruction cache misses are specifically prevalent in
transaction processing workloads. The STEPS10 approach
therefore organizes multiple concurrent queries into execution teams, and evaluates each query processing operator for
all members of the team one after another, while its code is
hot. Another proposal in this direction, aimed at analysis queries, proposed to split query plans into sections whose code
fits the instruction cache, putting a so-called “Buffer” operator on the section boundary.
23 The Buffer operator repeatedly
invoke the query subsection below it, buffering the resulting
tuples without passing them on yet, such that the operators in
the subsection are executed multiple times when hot, amortizing instruction cache misses. The high locality of the BAT
algebra operators in MonetDB and materialization of results
can be seen as an extreme form of this latter strategy.
In the area of index structures, Cache-sensitive B+ Trees
17 ensure that internal nodes match the cache-line size, optimizing the number of cache-line references,
and introduce highly optimized in-node search routines for
The MonetDB work2, 12, 13 showed vertical data fragmentation (DSM8) to benefit analysis queries, due to reduced memory traffic and an increased spatial locality. Column-stores
have since received much attention for use in data warehousing environments (e.g., C-Store,
20 and the CWI follow-up
system MonetDB/X1003), introducing column-store specific
compression and query evaluation techniques.
Considering hash-join, cache-sized partitioning was
first proposed in Shatdal19 and subsequently improved in
2 as summarized in Section 4. The Radix-Cluster
algorithm was later supplemented with an inverse Radix-Decluster algorithm,
14 that allows to perform arbitrary data
permutations in a cache-efficient manner (this can be used
for sorting, as well as for postponing the propagation of join
columns to after the join phase).
An alternative hash-join approach uses software prefetching, exploiting the explicit memory-to-cache prefetching
instructions offered by modern CPUs. Group prefetching was
shown in Chen6 to perform better than cache-partitioning
and was also shown to be more resistant to interference by
other programs. Prefetching was also successfully applied
in B-tree access7 to increase the width of the nodes without