For the delaunay mesh graphs as shown in Figure 4b we
can see that the edge-parallel method and the sampling
approach both outperform GPU-FAN for all scales. The
edge-parallel approach even outperforms the sampling approach for graphs containing less than 10,000
vertices; however, it should be noted that these differences in timings are trivial as they are on the order of
milliseconds. As the graph size increases the sampling
method clearly becomes dominant and the speedup it
achieves grows with the scale of the graph. Finally, we
compare the sampling approach to GPU-FAN for kron in
Figure 4c. Although GPU-FAN is marginally faster than
the sampling approach for the smallest scale graph we
can see that the sampling approach is best at the next
scale and the trend shows the amount by which the sampling approach is best grows with scale. Furthermore,
neither of the previous implementations could support
this type of graph at larger scales whereas the sampling
method can support even larger scales.
5. 3. Benchmarks
Figure 5 provides a comparison of the various parallelization methods discussed in this article to the edge-parallel method from Jia et al. 10 For road networks and
meshes (af_shell, del20, luxem) all of the methods outperform the edge-parallel method by about 10×. The
amount of unnecessary work performed by the edge-parallel method for these graphs is severe. For the
remaining graphs (scale-free and small-world graphs)
using the work-efficient method alone performs slower
than the edge-parallel method whereas the sampling
method is either the same or slightly better. In these
cases we see the advantage of choosing our method of
In the most extreme case, the edge-parallel approach
requires more than two and half days to process the
af_shell9 graph while the sampling approach cuts this
time down to under five hours. Similarly, the edge-parallel approach takes over 48 min to process the
luxembourg.osm road network whereas the sampling approach
requires just 6 min. Overall, sampling performs 2. 71×
faster on average than the edge-parallel approach.
GeForce GTX Titan that has 14 SMs and a base clock of 837
MHz. The Titan has 6GB of GDDR5 memory and is a CUDA
compute capability 3. 5 (“Kepler”) GPU.
Multi-node experiments were run on the Keeneland
Initial Delivery System (KIDS). 24 KIDS has two Intel Xeon
X5660 CPUs running at 2. 8 GHz and three Tesla M2090
GPUs per node. Nodes are connected by an Infiniband
Quadruple Data Rate (QDR) network. The Tesla M2090
has 16 SMs, a clock frequency of 1. 3 GHz, 6GB of GDDR5
memory, and is a CUDA compute capability 2.0 (“Fermi”) GPU.
We compare our techniques to both GPU-FAN22 and Jia et
al. 10 when possible, using their implementations that have
been provided online. The graphs used for these comparisons are shown in Table 1. These graphs were taken from
the 10th DIMACS Challenge, 1 the University of Florida
Sparse Matrix Collection, 7 and the Stanford Network Analysis
Platform (SNAP). 12 These benchmarks contain both real-world and randomly generated instances of graphs that
correspond to a wide variety of practical applications and
network structures. We focus our attention on the exact
computation of BC, noting that our techniques can be trivially adjusted for approximation.
5. 2. Scaling
First we compare how well our algorithm scales with
graph size for three different types of graphs. Since the
implementation of Jia et al. cannot read graphs that contain isolated vertices, we were unable to obtain results
using this reference implementation for the random
geometric (rgg) and simple Kronecker (kron) graphs.
Additionally, since the higher scales caused GPU-FAN
to run out of memory, we simply extrapolated what we
would expect these results to look like from the results
at lower scales (denoted by dotted lines). Note that from
one scale to the next the number of vertices and number
of edges both double.
Noting the log-log scale on the axes, we can see from
Figure 4a that the sampling approach outperforms the
algorithm from GPU-FAN by over 12× for all scales of rgg.
It is interesting to note that the sampling approach only
takes slightly more time than GPU-FAN when the sampling approach processes a graph four times as large.
Table 1. Graph datasets used for this study.
Graph Vertices Edges Max degree Diameter Description
af_shell9 504,855 8,542,010 39 497 Sheet metal forming
caidaRouterLevel 192,244 609,066 1,071 25 Internet router-level
cnr-2000 325,527 2,738,969 18,236 33 Web crawl
com-amazon 334,863 925,872 549 46 Amazon product
delaunay_n20 1,048,576 3,145,686 23 444 Random triangulation
kron_g500-logn20 1,048,576 44,619,402 131,503 6 Kronecker
loc-gowalla 196,591 1,900,654 29,460 15 Geosocial
luxembourg.osm 114,599 119,666 6 1,336 Road map
rgg_n_ 2_ 20 1,048,576 6,891,620 36 864 Randomgeometric
smallworld 100,000 499,998 17 9 Small world