for accessing the L2 cache. Analogously, if the data is not yet
available in L2, an L2 miss occurs, further delaying the access
by memory access latency (l ) to finally load the data from
Mem
main memory. Hence, the total latency to access data that
is in neither cache is l + l + l . As mentioned above,
Mem L2 L1
all current hardware actually transfers multiple consecutive
words, i.e., a complete cache line, during this time.
bandwidth (b) is a metric for the data volume (in megabytes)
that can be transferred to the CPU per second. The different bandwidths are referred to as L2 access bandwidth (b )
L2
and memory access bandwidth (b ), respectively. Memory
Mem
bandwidth used to be simply the cache-line size divided by
the memory latency. Modern multiprocessor systems pro-
vide excess bandwidth capacity b ′ ≥ b. To exploit this, caches
need to be nonblocking, i.e., they need to allow more than one
outstanding memory load at a time. CPUs that support out-
of-order instruction execution can generate multiple con-
current loads, as the CPU does not block on a cache miss,
but continues executing (independent) instructions. The
number of outstanding memory requests is typically limited
inside the CPU. The highest bandwidth in modern hardware
is achieved if the access pattern is sequential; in which case
the automatic memory prefetcher built into modern CPUs
is activated. The difference between sequential access band-
width (b s = b′) and the respective (nonprefetched) random
access bandwidth (b r = Z/lr) can be a factor 10, which means
that DRAM has truly become a block device, very similar to
magnetic disk.
Transition lookaside buffer (Tlb). A special kind of cache,
the TLB is part of the virtual memory support built into mod-
ern CPUs: it remembers the latest translations of logical into
physical page addresses (e.g., 64). Each memory load/store
needs address translation; if the page address is in the TLB
(a TLB hit), there is no additional cost. If not, a more com-
plex lookup in a mapping table is needed; thus a TLB miss
implies a penalty. Moreover, the lookup in the (memory-
resident) TLB mapping table might generate additional CPU
cache misses. Note that with a typical page size of 4KB and
64 entries in the TLB, on many systems TLB delay already
comes into play for random access to data structures (e.g.,
hash-tables) larger than 256KB.
unified hardware model. Summarizing the above discus-
sion, we describe a computer’s memory hardware as a cas-
cading hierarchy of N levels of caches (including TLBs).
An index i ∈ { 1, . . . , N} identifies the respective value of a
specific level. Exploiting the dualism that an access to level
i + 1 is caused by a miss on level i allows some simplification
of the notation. Introducing the miss latency l = l and the
i i+ 1
respective miss bandwidth b = b yields l = Z /b . Each cache
i i+ 1 i i i
level is characterized by the parameters given in Table 1.b We
point out, that these parameters also cover the cost-relevant
characteristics of disk accesses. Hence, viewing main mem-
ory (e.g., a database system’s buffer pool) as cache (level
N + 1) for I/O operations, it is straightforward to include disk
access in this hardware model.
b Costs for L1 accesses are assumed to be included in the CPU costs, i.e., l
1
and b are not used and hence undefined.
1
table 1: cache parameters per level (i OE { 1, . . . ,N})
Description unit
cache name (level) –
cache capacity bytes
cache-line size bytes
number of cache lines –
cache associativity –
Sequential (x = s) and random (x = r) access
miss latency ns
Access latency ns
miss bandwidth bytes/ns
Access bandwidth bytes/ns
symbol
L
i
C
i
Z
i
# = C /Z
i ii
A
i
lx
i
λx =lx
i+ 1 i
bx = Z / lx
i ii
bx = bx
i+ 1 i
We developed a system-independent C program called
Calibratorc to automatically measure these parameters
on any computer hardware. The Calibrator uses carefully
designed memory access patterns to generate cache misses
in a controlled way. Comparing execution times of runs with
and without cache misses, it can derive the cache parameters and latencies listed in Table 1. A detailed description of
the Calibrator is given in Manegold.
11, 12 Sample results for
a PC with a 2.4GHz Intel Core 2 Quad Q6600 CPU look as
follows:
CPU loop + L1 access: 1.25 ns = 3 cy
Caches:
Level Size Linesize Asso. Seq-miss-latency rand-miss-latency
1 32 KB 64 byte 4-way 0.91ns = 2 cy 4.74ns = 11 cy
2 4 MB 128 byte 4-way 31.07ns = 75 cy 76.74ns = 184 cy
TLBs:
Level #entries
1 256
pagesize
4KB
miss-latency
9.00ns = 22 cy
3. monetDB aRchitectuRe
The storage model deployed in MonetDB is a significant
deviation of traditional database systems. It uses the decomposed storage model (DSM),
8 which represents relational
tables using vertical fragmentation, by storing each column
in a separate <surrogate,value> table, called binary
association table (BAT). The left column, often the surrogate
or object-identifier (oid), is called the head, and the right column tail. MonetDB executes a low-level relational algebra
called the BAT algebra. Data in execution is always stored in
(intermediate) BATs, and even the result of a query is a collection of BATs.
Figure 2 shows the design of MonetDB as a back-end that
acts as a BAT algebra virtual machine, with on top a variety
of front-end modules that support popular data models and
query languages (SQL for relational data, XQuery for XML).
BAT storage takes the form of two simple memory arrays,
one for the head and one for the tail column (variable-width
types are split into two arrays, one with offsets, and the other
with all concatenated data). Internally, MonetDB stores
c http://www.cwi.nl/�manegold/Calibrator/ calibrator.shtml