means that legacy OS improvement efforts in the storage space are still considered worthwhile.
With the advent of multicore, enhancing concurrency is a clear approach to improving performance.
Many legacy OS storage subsystems
realize concurrency and asynchrony
through kernel-based queues serviced
by worker threads. These are typically
allocated for each processor core.
Software queues can be used to manage the mapping between application
threads running on specific cores,
and the underlying hardware queues
available on the IO device. This flexibility was introduced into the Linux
3. 13 kernel in 20125 providing greatly
improved IO scaling for multicore
and multi-queue systems. The Linux
kernel block IO architecture is aimed
at providing good performance in
the “general” case. As new IO devices
(both network and storage) reach the
realms of tens of millions of IOPS, the
generalized architecture and layering
of the software stack begin to strain.
Even state-of-the-art work on improving kernel IO performance is limited in success. 15 Furthermore, even
though the block IO layer may scale
well, layering of protocol stacks and
file systems typically increases serialization and locking, and thus impacts
To help understand the relationship between storage IO throughput
and CPU demand, Figure 3 shows IOPS
scaling for the Linux Ext4 file system.
This data is captured with the fio mi-cro-benchmarking tool configured to
perform random-writes of 4K blocks
(random-read performance is similar). No filesharing is performed (the
workload is independent). The experimental system is an Intel E5-2699 v4
two-socket server platform with 512GB
main memory DRAM. Each processor has 22 cores ( 44 hardware threads)
and the system contains x24 NVMe
Samsung 172Xa SSD 1. 5 TB PCIe devices. Total IO throughput capacity is
∼ 6.5M IOPS (25GB/s). Each device is
PCI Gen 3 x8 ( 7.8GB/s) onto the PCI bus
and a single QPI (memory bus) link is
∼ 19.2GB/s. Each processor has x40 PCI
Gen 3.0 lanes ( 39.5GB/s).
The maximum throughput achieved
is 3.2M IOPS ( 12.21GB/s). This is re-
alized at a load of ∼ 26 threads (one
per device) and 30% total CPU capac-
ity. Adding threads from 17 to 26 gives
negligible scaling. Beyond 26 worker
threads, performance begins to de-
grade and become unpredictable al-
though CPU utilization remains linear
for some time.
File systems and kernel IO process-
ing also add latency. Figure 4 shows
latency data for direct device access
(using Micron’s kernel-bypass UNVMe
framework) and the stock Ext4 file sys-
tem. This data is from a single Intel
Optane P4800X SSD. The filesystem
and kernel latency (mean 13.92μsec) is
approximately double that of the raw la-
tency of the device (mean 6.25μsec). For
applications where synchronous per-
formance is paramount and latency is
difficult to hide through pipelining, this
performance gap can be significant.
Application-specific IO subsystems.
An emerging paradigm is to enable
customization and tailoring of the IO
stack by “lifting” IO functions into user
space. This approach improves system
stability where custom IO processing
is being introduced (that is, custom
stacks can crash without jeopardizing
system stability) and allows developers
Figure 3. Ext4 file system scaling on software RAID0.
0 10 20 30 40 50 60 70 80 90
Figure 4. Ext4 vs. raw latency comparison.
rand-write (ext4) rand-write (uNVME)
40 60 80 100