node that could be accessed over the
network through a simple key-value
store interface provided fault tolerance
through replication and application-specific processing (such as predicate
evaluations, substring matching and
decompression) at line rate.
Each application running in cloud
datacenters has its own, unique requirements, making it difficult to
design server nodes with the proper
balance of compute, memory, and
storage. To cope with such complexity, an approach of physically decoupling resources was proposed recently by Han et al.
9 in 2013 to allow
replacing, upgrading, or adding individual resources instead of the entire node. With the availability of fast
interconnect technologies (such as
InfiniBand, RDMA, and RoCE), it is already common in today’s large-scale
cloud datacenters to disaggregate
storage from compute, significantly
reducing the total cost of ownership
and improving the efficiency of the
storage utilization. However, storage disaggregation is a challenge15
as storage-media access latencies are
heading toward single-digit microsecond levelp compared to a disk’s millisecond latency, which is much larger
than the fast network overhead. It is
likely that, in the next few years the
network latency will become a bottleneck as new, emerging non-volatile
memories with extremely low latencies become available.
This challenge of storage disaggregation can be overcome by using programmable storage, enabling a fully
programmable storage substrate
that is decoupled from the host substrate as outlined in Figure 7. This
view of storage as a programmable
substrate allows application developers not only to leverage very low,
storage-medium access latency by
running programs inside the storage
device but also to access any remote
storage device without involving the
remote host server where the device
is physically attached (see Figure 7) by
p For example, the access latency of 3D XPoint
can take 5~ 10 µsec, while NVMe SSD and disk
takes ~50–100 µsec and 10 msec, respectively.
ed library APIs into the reader, enabling
reading data from flash and offloading
the decompression work to the ARM
SoC hardware accelerator.
Figure 6 shows preliminary bandwidth results of scanning a ZLIB-compressed, single-column integer dataset
(one billion rows) through the C++ ORC
reader running on a host x86 server vs.
inside the DFC card, respectively.o As in
the figure, we achieved approximately
5x faster scan performance inside the
device compared to running on the
host server. Given that this is a single
device performance, we should be able
to achieve much better performance
improvements by increasing the number of programmable SSDs that are
used in parallel.
In addition to scanning, filtering,
and aggregating large volumes of data
at high-throughput rates by offloading part of the computation directly to
the storage has been explored as well.
In 2016 Jo et al.
12 built a prototype
that performs very early filtering of
data through a combination of ARM
and a hardware pattern-matching engine available inside a programmable
SSD equipped with a flow-based programming model described by Gu et
7 When a query is given, the query
planner determines whether early
filtering is beneficial for the query
and chooses a candidate table as the
target if the estimated filtering ratio
is sufficiently high. Early filtering is
then performed against the target
table inside the device, and only filtered data is then fetched to the host
for residual computation. This early
filtering inside the device turns out
to be highly effective for analytic
queries; when running all 22 TPC-H
queries on a MariaDB server with the
programmable device prototyped on
a commodity NVMe SSD, a 3.6x speedup was achieved by Jo et al.
compared to a system with the same SSD
without the programmability.
Alternatively, an FPGA-based prototype design for near-data processing
inside the a storage node for database
engines was studied by István et al.
2017. In this prototype, each storage
o Note, to effectively compare data-processing
capability in each case—Intel Xeon in x86 vs.
ARM + decompression accelerator in the device—
only a single core for each processor was used.
can be viewed as
are tightly coupled