conceptually Filesystem in User Space
(FUSE) technology could be used to
integrate into the kernel-based file
system hierarchy, the advantages of
performance would be lost because of
the need to still pass control into the
kernel. Evolution of the POSIX API is
needed to support hybrid kernel and
user IO. “Pure” user-space file systems
are still not broadly available.
˲Legacy file systems and protocol stacks incorporate complex
software that has taken years of development and debugging. In some
cases, this software can be integrated through “wrappers.” However,
in general this is challenging and
redeveloping the software from the
ground up is more economic.
Integration of NVDIMMs
Non-Volatile Dual Inline Memory
Modules (NVDIMMs) attach non-volatile memory directly to the memory
bus, opening the possibility of application programs accessing persistent
storage via load/store instructions.
This requires additional libraries
and/or programming language extensions5, 9 to support the coexistence
of both volatile and non-volatile
memory. The fundamental building
blocks needed are persistent memory
management (for example, pool and
heap allocators), cache management,
transactions, garbage collection, and
data structures that can operate with
persistent memory (for example, support recovery and reinstantiation).
Today, two prominent open source
projects are pushing forward the development of software support for
persistent memory. These are pmem.
io ( http://pmem.io/), driven primarily
by Intel Corporation in conjunction
with SNIA, and The Machine project
( https://www.labs.hpe.com/the-ma-chine) from HP Labs. These projects
are working to build tools and libraries that support access and management of NVDIMM. Key challenges that
are being explored by these projects
and others, 3, 8, 17, 20 include:
˲ Cross-heap pollution: Pointers to
volatile data structures should not
“leak” into the non-volatile heap. New
programming language semantics are
needed to explicitly avoid programming errors that lead to dangling and
comparable to that of the kernel (tested
against memory mapped files).
Memory flushing Linux. To optimize write-through to storage, it is also
necessary to track dirty pages, so that
only those that have been modified are
flushed out to storage. If a page has only
been read during its active mapping,
there is no need to write it back out to
storage. From the kernel’s perspective,
this function can be easily achieved by
checking the page’s dirty bit in its corresponding page table entry. However, as
noted earlier, accessing the page table
from user space is problematic. In our
own work, we have used two different
approaches to address this problem.
The first is to use a CRC checksum
over the memory to identify dirty pages.
Both Intel x86 and IBM Power architectures have CRC32 accelerator instructions that can compute a 4K checksum
in less than ∼1000 cycles. Note that
optimizations such as performing the
CRC32 on 1024 byte blocks and performing a “short circuit” of the dirty
page identification can reduce further
the cost of CRC in this context.
An alternative approach is to use a
kernel module to collect dirty page information on request from an application.
This, of course, incurs an additional
system call and page table walk. Consequently, this approach performs well
with small page tables, but is less per-formant than CRC when traversal across
many page table entries is needed.
Legacy integration. Designing
around a kernel bypass architecture is
a significant paradigm shift for application development. Consequently,
there are some practical limitations
to their adoption in legacy systems.
˲ Integration with existing applications ased on a blocking threading
model requires either considerable rewriting to adhere to an asynchronous/
polling model, or shims to bridge the
two together. The latter reduces the potential performance benefits.
˲ Sharing storage devices between
multiple processes. Network devices
handle this well via SR-IOV, but NVMe
SR-IOV has only recently been added
to NVMe specification. Hence, sharing
NVMe devices across multiple devices
must be done through software.
˲ Integration with the existing file
system structures is difficult. While
IO latency by
need to execute
for inbound IO,
for outbound IO.