than 30 s due to excessive L1, L2, and TLB misses, but if we
join 211 = 2048 clusters of around 4000 tuples each (i.e., each
cluster fits into the Athlon’s L1 cache), performance improves
around 10-fold. The lines T2, L2, T1, and L1 indicate the clustering degree after which the inner relation (plus hash-table)
fits, respectively, the level 2 TLB, L2 data cache, level 1 TLB,
and L1 data caches on this Athlon processor. However, Figure
4(b) shows that the straightforward clustering algorithm
degrades significantly due to L1 and TLB misses after B = 8,
as it is filling 256 clusters with only 256 L1 cache lines (on this
Athlon), while for similar reasons L2 cache misses become a
serious problem after 12 bits. To keep clustering efficient, we
should therefore use multipass Radix-Cluster, as shown in
Figure 4(c). Since using more clusters improves Partitioned
Hash-Join yet degrades Radix-Cluster, the overall results in
Figure 4(d) shows a sweet spot at B = 12 (two passes).
When a user submits a query to a running database server,
its query optimizer determines a physical plan, choosing the
right order of the operators as well as choosing the physical
algorithm to use. For instance, it may compare SortMerge-with Hash-Join. Additionally, in case of Hash-Join, the optimizer must now also determine how many partitions h,
thus, radix-bits B, to use. On the one hand, it needs crucial
parameters of the unified hardware model (i.e., the cache
configurations) as derived by Calibrator (see Section 2. 1);
e.g., at DBMS startup. On the other hand, it should model
the memory access cost of query processing operators given
a value distribution estimate and tuning parameters (such
as B). The lines in Figure 4(d) represent the cost prediction
of our model for Partitioned Hash-Join, indicating that the
techniques described in Section 5 can be quite accurate.
5. moDeLinG memoRy access costs
Cache-conscious database algorithms, such as the radix-partitioned hash-join, achieve their optimal performance only if
they are carefully tuned to the hardware specifics. Predictive
and accurate cost models provide the cornerstones to automate this tuning task. We model the data access behavior in
terms of a combination of basic access patterns using the
unified hardware model from Section 2. 1.
5. 1. memory access cost
Memory access cost can be modeled by estimating the number of cache misses M and scoring them with their respective
miss latency l.
13 Akin to detailed I/O cost models we distinguish between random and sequential access. However, we
now have multiple cache levels with varying characteristics.
Hence, the challenge is to predict the number and kind of
cache misses for all cache levels. Our approach is to treat all
cache levels individually, though equally, and calculate the
total cost as the sum of the cost for all levels:
N
T = ∑(Ms ⋅ l s + Mr ⋅ l r Mem ii ii ).
i =
1
This leaves the challenge to properly estimate the number
and kind of cache misses per cache level for various database
algorithms. The task is similar to estimating the number and
kind of I/O operations in traditional cost models. However,
82 communications of the acm | dEcEmbEr 2008 | voL. 51 | No. 12
our goal is to provide a generic technique for predicting cache
miss rates, sacrificing as little accuracy as possible.
The idea is to abstract data structures as data regions and
model the complex data access patterns of database algorithms in terms of simple compounds of a few basic data
access patterns. For these basic patterns, we then provide
cost functions to estimate their cache misses. Finally, we
present rules to combine basic cost functions and to derive
the cost functions of arbitrarily complex patterns.
5. 1. 1. Basic Access Patterns
Data structures are modeled using a set of data regions .
A data region R ∈ consists of |R| data items of size R (in
bytes). We call |R| the length of region R, R its width, and
||R|| = |R| · R its size.
A database table is hence represented by a region R with
|R| being the table’s cardinality and R being the tuple size
(width). Similarly, more complex structures like trees are
modeled by regions with |R| representing the number of
nodes and R representing the size (width) of a node.
The following basic access patterns are eminent in the
majority of relational algebra implementations.
A single sequential traversal s_trav(R) sweeps over R,
accessing each data item in R exactly once (cf., Figure 5).
A single random traversal r_trav(R) visits each data item
in R exactly once. However, the data items are not accessed
in storage order, but chosen randomly (cf., Figure 6).
A repetitive sequential traversal rs_trav(r, d, R) performs r sequential traversals over R. d = uni (unidirectional)
indicates that all traversals sweep over R in the same direction. d = bi (bidirectional) indicates that subsequent traversals go in alternating directions.
A repetitive random traversal rr_trav(r, R) performs r
random traversals over R. Assuming that the permutation
orders of two subsequent traversals are independent, there is
no point in discriminating uni- and bidirectional accesses.
random access r_acc(r, R) hits r randomly chosen data
items in R after another. The choices are independent of
each other. Each data item may be hit more than once. Even
figure 5: single sequential traversal: s_trav(R).
_
R_
||R||
1
2
3
. . .
|R|
figure 6: single random traversal: r_trav(R).
_
R_
||R||
3
1
. . .
|R|