of 16 or 32 flash channels, as outlined in Figure 2. Since each flash
channel can keep up with ~500MB/
sec; internally each SSD can be up to
~500MB/sec per channel X 32 channels = ~16GB/sec (see Figure 2d); and
the total aggregated in-SSD performance would be ~16GB/sec per SSD X
64 SSDs = ~1TB/sec (see Figure 2c), a
66x gap. Making SSDs programmable
would thus allow systems to fully leverage this abundant bandwidth.
In-Storage Programming
Modern SSDs combine processing—
embedded processor—and storage
components—SRAM, DRAM, and flash
memory—to carry out routine functions required for managing the SSD.
These computing resources present interesting opportunities to run general
user-defined programs. In 2013, Do et
al.
6, 17 explored such opportunities for
the first time in the context of running
selected database operations inside
a Samsung SAS flash SSD. They wrote
simple selection and aggregation operators that were compiled into the SSD
firmware and extended the execution
framework of Microsoft SQL Server
2012 to develop a working prototype
in which simple selection and aggregation queries could be run end-to-end.
That work demonstrated several
times improvement in performance
and energy efficiency by offloading
database operations onto the SSD and
highlighted a number of challenges
that would need to be overcome to
broadly adapt programmable SSDs:
First, the computing capabilities
available inside the SSD are limited by
design. The low-performance embedded processor inside the SSD without L1/L2 caches and high latency to
the in-SSD DRAM require extra careful programming to run user code in
the SSD without producing a performance bottleneck.
Moreover, the embedded software-
development process is complex and
makes programming and debugging
very challenging. To maximize perfor-
mance, Do et al. had to carefully plan
the layout of data structures used by the
code running inside the SSD to avoid
spilling out of the SRAM. Likewise, Do
et al. used a hardware-debugging tool
to debug programs running inside the
PCIe interface speed (see Figure 2a),
which is approximately 16GB/sec,
regardless of the number of SSDs
accessed in parallel. There is thus
an 8x throughput gap between the
host interface and the total aggre-
gated SSD bandwidth that could be
up to roughly ~2GB/sec per SSDc X 64
SSDs = ~128GB/sec (see Figure 2b).
More interestingly, this gap would
grow further if the internal SSD per-
formance is considered. A modern
enterprise-level SSD usually consists
c Practical sequential-read bandwidth of a com-
modity PCIe SSD.
the controller. In enterprise SSDs, large
SRAM is often used for executing the
SSD firmware, and both user data and
internal SSD metadata are cached in
external DRAM.
Interestingly, SSDs generally have
a far larger aggregate internal bandwidth than the bandwidth supported
by host I/O interfaces (such as SAS and
PCIe). Figure 2 outlines an example
of a conventional storage system that
leverages a plurality of NVM Express
(NVMe)b SSDs; 64 of them are connected to 16 PCIe switches that are
mounted to a host machine via 16
lanes of PCIe Gen3. While this storage architecture provides a commodity solution for high-capacity
b A device interface for accessing non-volatile
memory attached via a PCI Express (PCIe) bus.
Figure 1. Internal architecture of a modern flash SSD.
H
o
st
Int
e
rf
ac
e
C
ont
r
ol
l
er
DRAM
Controller
Processor SRAM
DRAM
Flash
Controller
Flash
Controller
Flash Channel
Flash Channel SSD
Controller
Flash Storage Media
Flash SSD
Figure 2. Example conventional storage server architecture with multiple NVMe SSDs.
CPU
D
R
AM
R
o
ot
c
o
mpl
ex
PCIe
Switch
PCIe
Switch
PCIe
Switch
Flash SSD
Flash SSD
Flash SSD
Flash SSD
Flash SSD
Flash SSD
Flash SSD
Flash SSD
15
1
0
D
RA
M
32 channels X ~500 MB/s
= 16 GB/s
64 SSDs X 2 GB/s
= ~128 GB/s
16 lanes of PCIe
= 16 GB/s
SSD Storage System
(a) (b) (d)
64 SSDs X 16 GB/s
= 1 TB/s
(c)