flexible with enterprise-level capabilities and resources. It comprises
a main board and a storage board.
The main board contains an ARMv8
processor, 16GB of RAM, and various
on-chip hardware accelerators (such
as 20Gbps compression/decompres-sion, 20Gbps SEC-crypto, and 10Gbps
RegEx engines). It also provides
NVMe connectivity via four PCIe Gen3
lanes, and 4x10Gbps Ethernet that
supports remote dynamic memory
access (RDMA) over converged ethernet (RoCE) protocol. It supports two
different storage boards that connect
via 2x4 PCIe Gen3 lanes: One type of
board (see Figure 5a) includes an embedded storage controller and four
memory slots where flash or other
forms of NVM can be installed; and
the second (see Figure 5b) an adapter
that hosts two M. 2 SSDs.
The ARM SoC inside the board runs
a full-fledged Ubuntu Linux, so programming the board is very similar
to programming any other Linux device. For instance, software can leverage the Linux container technology
(such as Docker) to provide isolated
environments inside the board. To
create applications running on the
board, a software development kit
(SDK) containing GNU tools to build
applications for ARM and user/ker-nel mode libraries to use the on-chip
hardware accelerators is provided, allowing a high level of programmability. The DFC can also serve as a block
device, just like regular SSDs. For this
purpose, the device is shipped with a
flash translation layer (FTL) that runs
on the main board.
The SSD industry is also moving
toward bringing compute to SSDs so
data can be processed without leaving
the place where it is originally stored.
For instance, in 2017 NGD Systemsl
announced an SSD called Catalina21
capable of running applications di-
rectly on the device. Catalina2 uses
TLC 3D NAND flash (up to 24TB),
which is connected to the onboard
ARM SoC that runs an embedded
Linux and modules for error-correct-
ing code (ECC) and FTL. On the host
server, a tunnel agent (with C/C++
libraries) runs to talk to the device
through the NVMe protocol. As anoth-
er example, ScaleFluxm uses a Xilinx
FPGA (combined with terabytes of
TLC 3D NAND flash) to compute data
for data-intensive applications. The
host server runs a software module,
providing API accesses to the device
while being responsible for FTL and
flash-management functionalities.
Academia and industry are working to establish a compelling value
proposition by demonstrating application scenarios for each of the three
pillars outlined in Figure 4. Among
them we are initially focused on exploring the benefits and challenges
of moving compute closer to storage (see Figure 4b) in the context of
big data analytics, examining large
amounts of data to uncover hidden
patterns and insights.
Big data analytics within a program-
mable SSD. To demonstrate our ap-
proach, we have implemented a C++
reader that runs on a DFC card (see
Figure 5) for Apache Optimized Row
Columnar (ORC) files. The ORC file for-
mat is designed for fast processing and
high storage efficiency of big data ana-
lytic workloads, and has been widely
adopted in the open source community
and industry. The reader running in-
side the SSD reads large chunks of ORC
streams, decompresses them, and then
evaluates query predicates to find only
necessary values. Due to the server-like
development environment—Ubuntu
and a general-purpose ARM proces-
sor—we easily ported a reference im-
plementation of the ORC readern to the
ARM SoC environment (with only a few
lines of code changes) and incorporat-
l http://www.ngdsystems.com
m http://www.scaleflux.com
n https://github.com/apache/orc
putation on the data.
SSDs with their powerful compute
capabilities can form a trusted domain for doing secure computation
on encrypted data, leveraging their internal hardware cryptographic engine
and secure boot mechanisms for this
purpose. Cryptographic keys can be
stored inside the SSD, allowing arbitrary compute to be carried out on the
stored data—after decryption if needed—while enforcing that data cannot
leave the device in cleartext form. This
allows a new, flexible, easily programmable, near-data realization of trusted
hardware in the cloud. Compared to
currently proposed solutions like Intel
Enclavesj that are protected, isolated
areas of execution in the host server
memory, this solution protects orders
of magnitude more data.
Programmable SSDs
While the concept of in-storage processing on SSDs was proposed more
than six years ago,
6 experimenting with
SSD programming has been limited
by the availability of real hardware on
which a prototype can be built to demonstrate what is possible. The recent
emergence of prototyping boards available for both research and commercial
purposes has opened new opportunities for application developers to take
ideas from conception to action.
Figure 5 shows such prototype
device, called Dragon Fire Card
(DFC),k, 3, 5 designed and manufactured by Dell EMC and NXP for research. The card is powerful and
j https://software.intel.com/en-us/sgx
k https://github.com/DFC-OpenSource
Figure 6. Preliminary results using a programmable SSD yield approximately 5x speedups
for full scans of ZLIB-compressed ORC files within the device, compared to native ORC
readers running on x86 architecture.
55. 4
272.47
0
50
100
150
200
250
300
X86 (Intel Xeon 2.3GHz) Programmable SSD
(ARM 1.8GHz + Decompression offload)
T
h
r
ou
gh
put
(mi
lli
o
n
r
ow
s/
se
co
nd
)