has profoundly influenced the database area and indeed our
work on MonetDB.
Another facet is that predictable array-wise processing
models have been strongly favored in a string of recent CPU
architectural innovations. While the rule “make the common
case fast” was exploited time and time again to design and
construct ever more complex CPUs, the difference in performance efficiency achieved by optimized code and intended
use (e.g., “multimedia applications”) versus nonoptimized
code and nonintended use (e.g., “legacy database applications”) has become very significant. A concrete example is
the evolution of CPUs from executing a single instruction
per clock cycle, to multi-issue CPUs that use deeply pipelined execution; sometimes splitting instructions in more
than 30 dependent stages. Program code that has a high
degree of independence and predictability (multimedia or
matrix calculations) fills the pipelines of modern CPUs perfectly, while code with many dependencies (e.g., traversing a
hash-table or B-tree) with unpredictable if-then-else checks,
leaves many holes in the CPU pipelines, achieving much
2. 1. the memory hierarchy
The main memory of computers consists of dynamic random
access memory (DRAM) chips. While CPU clock-speeds have
been increasing rapidly, DRAM access latency has hardly
improved in the past 20 years. Reading DRAM memory took
1–2 cycles in the early 1980s, currently it can take more than
300 cycles. Since typically one in three program instructions
is a memory load/store, this “memory wall” can in the worst
case reduce efficiency of modern CPUs by two orders of magnitude. Typical system monitoring tools (top, or Windows
Task manager) do not provide insight in this performance
aspect, a 100% busy CPU could be 95% memory stalled.
To hide the high DRAM latency, the memory hierarchy has been extended with cache memories (cf., Figure 1),
typically located on the CPU chip itself. The fundamental
figure 1: hierarchical memory architecture.
78 communications of the acm | dEcEmbEr 2008 | voL. 51 | No. 12
principle of all cache architectures is reference locality, i.e.,
the assumption that at any time the CPU repeatedly accesses
only a limited amount of data that fits in the cache. Only the
first access is “slow,” as the data has to be loaded from main
memory, i.e., a compulsory cache miss. Subsequent accesses
(to the same data or memory addresses) are then “fast” as
the data is then available in the cache. This is called a cache
hit. The fraction of memory accesses that can be fulfilled
from the cache is called cache hit rate.
Cache memories are organized in multiple cascading levels between the main memory and the CPU. They become
faster, but smaller, the closer they are to the CPU. In the
remainder we assume a typical system with two cache levels
(L1 and L2). However, the discussion can easily be generalized to an arbitrary number of cascading cache levels in a
In practice, cache memories keep not only the most
recently accessed data, but also the instructions that are currently being executed. Therefore, almost all systems nowadays implement two separate L1 caches, a read-only one for
instructions and a read-write one for data. The L2 cache,
however, is usually a single “unified” read-write cache used
for both instructions and data.
A number of fundamental characteristics and parameters
of cache memories are relevant for the sequel:
capacity (C). A cache’s capacity defines its total size in bytes.
Typical cache sizes range from 32KB to 4MB.
line size (Z). Caches are organized in cache lines, which represent the smallest unit of transfer between adjacent cache
levels. Whenever a cache miss occurs, a complete cache line
(i.e., multiple consecutive words) is loaded from the next
cache level or from main memory, transferring all bits in the
cache line in parallel over a wide bus. This exploits spatial
locality, increasing the chances of cache hits for future references to data that is “close to” the reference that caused a
cache miss. The typical cache-line size is 64 bytes.
associativity (A). An A-way set associative cache allows loading a line into one of A different positions. If A > 1, some
cache replacement policy chooses one from the A candidates.
Least recently used (LRU) is the most common replacement
algorithm. In case A = 1, the cache is called directly mapped.
This organization causes the least (virtually no) overhead in
determining the cache-line candidate. However, it also offers
the least flexibility and may cause a lot of so-called conflict
misses. The other extreme case is fully associative caches.
Here, each memory address can be loaded to any line in the
cache (A = #). This avoids conflict misses, and only so-called
capacity misses occur as the cache capacity gets exceeded.
However, determining the cache-line candidate in this strategy causes a relatively high overhead that increases with
the cache size. Hence, it is feasible only for smaller caches.
Current PCs and workstations typically implement two- to
eight-way set associative caches.
latency (l) is the time span from issuing a data access
until the result is available in the CPU. Accessing data that
is already available in the L1 cache causes L1 access latency
(l ), which is typically rather small ( 1 or 2 CPU cycles). In
case the requested data is not found in L1, an L1 miss occurs,
additionally delaying the data access by L2 access latency (l )