controller. Using the configuration tables stored in DRAM,
the driver tells the Flash controller what error correction
strength and cell density to use for each access. In this section we provide more details on how the cache operates and
how its settings are reconfigured on the fly.
theory of operation: In this section we summarize how
the operating system and controller interact (see Kgil11 for
a full description). The concepts are similar to ordinary disk
caches except that it is now a two-level cache. The first level
of cache resides in DRAM, and the second level consists of
Flash memory. In addition, the Flash portion of the cache
has to be reconfigured on the fly to maximize performance
and reliability. The DRAM, with fast, uniform read and
write latency, no wear-out and no density modes, is easier
to handle.
When a file read is performed, the OS searches for the
file in the primary disk cache located in DRAM. If the page
is found in DRAM, the file content is accessed directly
from the primary disk cache—no access to Flash related
data structures is required. Otherwise, the OS determines
whether the requested file currently resides in the secondary (Flash) disk cache. If the requested file is found, then
a Flash read is performed and the Flash content is transferred to DRAM.
If the data is not found in Flash, we first look for an empty
Flash page in the read cache. If there is no empty Flash
page available, we first select a block for eviction to disk,
freeing Flash pages for the newly read data. The data being
replaced is usually the “least recently used” (LRU) block so it
is unlikely to be needed again. Such an access would have to
go all the way to disk, increasing program execution time, so
the LRU algorithm reduces the likelihood of this happening.
Concurrently, a hard disk drive access is scheduled using the
device driver interface. The hard disk drive content is copied
to the primary disk cache in DRAM and also the read cache
in Flash.
If we write to a file, we typically update/access the page in
the primary disk cache and this page is periodically scheduled to be written back to the secondary disk cache and later
periodically written back to the disk drive. When writing
back to Flash, we first determine whether it already exists on
Flash. If it is found in the write region, we update the page by
doing an out-of-place write to the write cache. If it is found in
the read cache, then we move it to the write cache. If it is not
found in the Flash, we allocate a page in the write cache.
In the background, garbage collections are triggered
when the Flash-based disk cache starts to run out of space.
The cached data is also periodically flushed back to disk if it
has been modified. Concurrent with the normal cache operation, the reliability management algorithms continuously
try to adapt the Flash configuration to provide maximum
benefit. We have already seen that the configuration changes
with the application software. The next section describes the
configuration policies enforced to achieve this.
reconfiguring the flash memory controller: The Flash
Page Status Table (FPST) specifies the reliability control
settings for each page of flash. When the OS reads and
writes to/from the Flash controller, it also sends configuration bits specifying the various modes for the Flash page.
104 communicAtionS of the Acm | aPril 2009 | Vol. 52 | no. 4
Configuration policies are applied to select those modes,
maximizing performance as the application demands
change and the Flash eventually develops faulty bits.
There are two main triggers for an ECC strength or density mode change. These are ( 1) an increase in the number of
faulty bits and ( 2) a change in access (read) frequency. Each
trigger is explained below:
When new bit errors are observed and fail consistently
due to wear-out, we reconfigure the page. This is achieved
by enforcing a stronger ECC or reducing cell density from
MLC to SLC mode. We choose the option with the minimum
increase in latency using some simple heuristics. They take
into account how active that particular page is to determine
its impact on the system as a whole. It also considers the current level of wear-out for the page.
Some heavily accessed pages will benefit from being in
SLC storage simply because of its lower latency. If a page is
in MLC mode and the entry in the FPST field that keeps track
of the number of read accesses to a page reaches a limit, we
migrate that Flash page to a new empty page in SLC mode.
If there is no empty page available, a Flash block is evicted
and erased using our wear-level aware replacement policy.
Reassigning a frequently accessed page from MLC mode to
SLC mode improves performance by improving hit latency.
Because many accesses to files in a server platform have a
tailed distribution (Zipf) with hot and cold data, improving the hit latency to frequently accessed (hot) Flash pages
improves overall performance despite the minor reduction
in Flash capacity.
If a Flash page reaches the ECC strength limit and has
already been set to SLC mode, the block is removed permanently and never considered when looking for pages to allocate in a disk cache.
4. methoDoLoGY
We evaluated the Flash memory controller and Flash device
using a full system simulator called M5. The M5 simulation
2
infrastructure is used to generate access profiles for estimating system memory and disk drive power consumption
along with published access energy data. We developed a
separate Flash disk cache simulator for reliability and disk
cache miss rate experiments where very long traces are necessary, because full system simulators are slow. Given the
limitations in our simulation infrastructure, a server workload that uses a large working set of 100–1000’s of gigabytes
cannot easily be evaluated. We scaled our benchmarks, system memory size, Flash size, and disk drive size accordingly
to run on our simulation infrastructure.
We also generated micro-benchmark disk traces to model
synthetic disk access behavior. They represent typical access
distributions and approximate real disk usage. To properly
stress the system, some micro-benchmarks with uniformly
random and exponential distributions were also generated.
We used disk traces from University of Massachusetts
Trace Repository20 to model the disk behavior of enterprise
level applications like web servers, database servers, and
web search. To measure performance and power, we used
dbt2 (OLTP) and SPECWeb99 which generated representative disk/disk cache traffic.