I
M
A
G
E
B
Y
K
U
D
R
Y
A
S
H
K
A
program multiple data (SPMD) par-
allelism, where the same program is
used to concurrently process many dif-
ferent parts of the input data. GPU pro-
grams typically use tens of thousands
of lightweight threads running similar
or identical code with little control-
flow variation. Conventional operating
system services (such as the POSIX file
system API) were not built with such an
execution environment in mind. In de-
veloping GPUfs, we had to adapt both
the API semantics and its implementa-
tion to support such massive parallel-
ism, allowing thousands of threads to
efficiently invoke open, close, read,
or write calls simultaneously.
To feed their voracious appetites
for data, high-end GPUs usually have
their own dedicated DRAM storage. A
massively parallel memory interface
to GPU memory offers high bandwidth
for local access by GPU code, but GPU
access to the CPU’s system memory
is an order of magnitude slower, as it
requires communication over a band-
width-constrained, higher-latency PCI
Express bus. In the increasingly com-
mon case of systems with multiple dis-
crete GPUs (standard in Apple’s Mac
Pro) each GPU has its own local memo-
ry, and accessing a GPU’s own memory
can be an order of magnitude more ef-
ficient than accessing a sibling GPU’s
memory. GPUs thus exhibit a particu-
larly extreme non-uniform memory
access (NUMA) property, making it
performance-critical for the operating
system to optimize for access local-
ity in data placement and reuse across
CPU and GPU memories; for example,
GPUfs distributes its buffer cache
across all CPU and GPU memories to
enable idioms like process pipelines
that produce and consume files across
multiple processors.
To highlight the benefits of bringing file system abstractions to GPUs,