AUGUST 2018 | VOL. 61 | NO. 8 | COMMUNICATIONS OF THE ACM 87
particular iteration (but will be unnecessarily inspected during every iteration). Finally, the bottom portion of Figure 2
shows a work-efficient traversal iteration where each vertex
in the frontier is assigned a thread. In this case only useful work is conducted although a load imbalance may exist
among threads.
3. 2. GPU-FAN
The GPU-FAN package from Shi and Zhang was designed for
the analysis of biological networks representing protein
communications or genetic interactions. 22 Similar to the
implementation from Jia et al., GPU-FAN uses the edge-parallel
method for load balancing across threads. The GPU-FAN
package, however, focuses only on fine-grained parallelism,
using all threads from all thread blocks to traverse edges
in parallel for one source vertex of the BC computation at a
time. In contrast, the implementation from Jia et al. uses the
threads within a block traverse edges in parallel while separate thread blocks each focus on the independent roots of
the BC computation.
4. METHODOLOGY
4. 1. Work-efficient approach
Taking note of the issues mentioned in the previous section,
we now present the basis for our work-efficient implemen-
tation of betweenness centrality on the GPU. Our approach
leverages optimizations from the literature in addition to
separate kernel, or function that executes on the device. GPU
threads have access to many registers (typically 255 or so),
a small amount (typically 48KB) of programmer managed
shared memory unique to each SM, and a larger global mem-
ory that can be accessed by all SMs.
3. PRIOR GPU IMPLEMENTATIONS
Two well-known GPU implementations of Brandes’s algorithm have been published within the last few years. Jia et
al. 10 compare two types of fine-grained parallelism, showing
that one is preferable over the other because it exhibits better memory bandwidth on the GPU. Shi and Zhang present
GPU-FAN22 and report a slight speedup over Jia et al. by using
a different distribution of threads to units of work. Both
methods focus their optimizations on scale-free networks.
3. 1. Vertex and edge parallelism
Jia et al. discussed two distributions of threads to graph entities: vertex-parallel and edge-parallel. 10 The vertex-parallel
approach assigns a thread to each vertex of the graph and
that thread traverses all of the outgoing edges from that
vertex. In contrast, the edge-parallel approach assigns a
thread to each edge of the graph and that thread traverses
that edge only. In practice, the number of vertices and edges
in a graph tend to be greater than the available number of
threads so each thread sequentially processes multiple vertices or edges.
For both the shortest path calculation and the dependency accumulation stages the number of edges traversed
per thread by the vertex-parallel approach depends on the
out-degree of the vertex assigned to each thread. The difference in out-degrees between vertices causes a load imbalance between threads. For scale-free networks this load
imbalance can be a tremendous issue, since the distribution
of outdegrees follows a power law where a small number of
vertices will have a substantial number of edges to traverse. 2
The edge-parallel approach solves this problem by assigning
edges to threads directly. Both the vertex-parallel and edge-parallel approaches from Jia et al. use an inefficient O(n2 + m)
graph traversal that checks if each vertex being processed
belongs to the current depth of the search.
Figure 2 illustrates the distribution of threads to work
for the vertex-parallel and edge-parallel methods. Using the
same graph as shown in Figure 1, consider a Breadth-First
Search starting at vertex 4. During the second iteration of the
search, vertices 1, 3, 5, and 6 are in the vertex frontier, and
hence their edges need to be inspected. The vertex-parallel
method, shown in the top portion of Figure 2, distributes
one thread to each vertex of the graph even though the edges
connecting most of the vertices in the graph do not need to
be traversed, resulting in wasted work. Also note that each
thread is responsible for traversing a different number of
edges (denoted by the small squares beneath each vertex),
leading to workload imbalances. The edge-parallel method,
shown in the middle portion of Figure 2, does not have the
issue of load imbalance because each thread has one edge
to traverse. However, this assignment of threads also results
in wasted work because the edges that do not originate from
vertices in the frontier do not need to be inspected in this
122113 97
1536
9 2 1 3456 8 7
Outgoing edges that do not need to be inspected
Outgoing edges that need to be inspected
Vertex not belonging to the current frontier
Vertex belonging to the current frontier
Edge that does not need to be inspected
Edge that needs to be inspected
...
Figure 2. Illustration of the distribution of threads to units of work.
Top: Vertex-parallel. Middle: Edge-parallel. Bottom: Work-efficient.