5. 4. Multi-GPU experiments
Although our approaches leverage both coarse and fine-grained parallelism there is still more available parallelism than can be handled by a single GPU. Our methods
easily extend to multiple GPUs as well as multiple nodes.
We extend the algorithm by distributing a subset of roots
to each GPU. Since each root can be processed independently in parallel, we should expect close to perfect scaling if each GPU has a sufficient (and an evenly distributed)
amount of work.
Since the local data structures for each root are independent (and thus only need to reside on one GPU), we
replicate the data representing the graph itself across all
GPUs to eliminate communication bottlenecks. Once each
GPU has its local copy of the BC scores these local copies
are accumulated for all of the GPUs on each node. Finally,
the node-level scores are reduced into the global BC scores
by a simple call to MPI_Reduce(). Figure 6 shows how well
our algorithm scales out to multiple GPUs for delaunay,
Figure 5. Comparison of work-efficient and sampling methods.
S
p
ee
du
p
ove
r
e
dg
e
−
pa
r
all
el
(
Ji
a
e
t
al
.
)
0.
5
1
.0
2
.0
5
.0
10
.0
Work−efficient
Sampling
af_shell caida cnr amazon del20 gowalla luxem small
Figure 4. Scaling by problem size for three different types of graphs.
15 16 17 18 19 20
5
10
1
5
20
Scale: log2(Number of vertices) Scale: log2(Number of vertices) Scale: log2(Number of vertices)
lo
g2
( T
i
me
)
lo
g2
( T
i
me
)
lo
g2
( T
i
me
)
GPU−FAN
Sampling
GPU−FAN extrapolated
(a) rgg
10 12 14 16 18 20
− 5
0
5
10
1
5
20
GPU−FAN
Edge−parallel (Jia et al.)
Sampling
GPU−FAN extrapolated
(b) delaunay
16 17 18 19
10
1
2
14
1
6
GPU−FAN
Sampling
GPU−FAN extrapolated
(c) kron
rgg, and kron graphs. It shows that linear speedup is easily achievable if the problem size is sufficiently large (i.e.,
if there is sufficient work for each GPU). Linear speedups
are achievable at even smaller scales of graphs for denser
network structures. For instance, using 64 nodes provides
about a 35× speedup over a single node for scale 16
delaunay graph whereas using the same number of nodes at the
same scale for rgg and kron graphs provides over 40× and
50× speedups respectively. The scaling behavior seen in
Figure 6 is not unique to these graphs because of the vast
amount of coarse-grained parallelism offered by the algorithm. For graphs of large enough size this scalability can
be obtained independently of network structure.
6. CONCLUSION
In this article we have discussed various methods for computing Betweenness Centrality on the GPU. Leveraging
information about the structure of the graph, we present several methods that choose between two methods
of parallelism: edge-parallel and work-efficient. For high-diameter graphs using asymptotically optimal algorithms
is paramount to obtaining good performance whereas for
low-diameter graphs it is preferable to maximize memory
throughput, even if unnecessary work is completed. In
addition our methods are more scalable and general than
existing implementations. Finally, we run our algorithm
on a cluster of 192 GPUs, showing that speedup scales
almost linearly with the number of GPUs, regardless of
network structure. Overall, our single-GPU approaches
perform 2. 71× faster on average than the best previous
GPU approach.
For future work we would like to efficiently map additional graph analytics to parallel architectures. The importance of robust, high-performance primitives cannot be
overstated for the implementation of more complicated
parallel algorithms. Ideally, GPU kernels should be modular and reusable; fortunately, packages such as Thrust9
and CUB (CUDA Unbound) 18 are beginning to bridge this
gap. A software environment in which users have access
to a suite of high-performance graph analytics on the GPU