duplicate queue entries are allowed. Since we only require
one thread for each element in Qcurr rather than one thread
for every vertex or edge in the graph, this atomic operation
experiences limited contention and thus doesn’t significantly reduce performance.
The conditional on Line 11 checks to see if the queue containing vertices for the next depth of the search is empty; if
so, the search is complete, so we break from the outermost
while loop. Otherwise, we transfer vertices from Qnext to Qcurr,
add these vertices to the end of S for the dependency accumulation, and do the appropriate bookkeeping to set the
lengths of these arrays.
Algorithm 2 shows a work-efficient dependency accumulation. We are able to eliminate the use of atomics by
checking successors rather than the predecessors of each
vertex. Rather than having multiple vertices that are currently being processed in parallel update the dependency of
their common ancestor atomically, the ancestor can update
itself based on its successors without the need for atomic
operations. 14
our own novel techniques. The most important distinction
between our approach and prior work is that we use explicit
queues for graph traversal. Since levels of the graph are processed in parallel we use two queues to distinguish vertices
that are in the current level of the search (Qcurr) from vertices
that are to be processed during the next level of the search
(Qnext). For the dependency accumulation stage we initialize
S and its length. In this case, we need to keep track of vertices at all levels of the search and hence we only use one data
structure to store these vertices. To distinguish the sections
of S that correspond to each level of the search we use the
ends array, where endslen = 1 + maxv∈V {d[v]} at the end of the
traversal. Vertices corresponding to depth i of the traversal
are located from index ends[i] to index ends[i + 1] − 1 (
inclusively) of S. This usage of the ends and S arrays is analogous
to the arrays used to store the graph in CSR format.
A work-efficient shortest path calculation stage is shown
in Algorithm 1. The queue Qcurr is initialized to contain only the
source vertex. Iterations of the while loop correspond to the
traversal of depths of the graph. The parallel for loop in Line
3 assigns one thread to each element in the queue such that
edges from other portions of the graph aren’t unnecessar-
ily traversed. The atomic Compare And Swap (CAS) opera-
tion on Line 5 is used to prevent multiple insertions of the
same vertex into Qnext. This restriction allows us to safely
allocate O(n) memory for Qnext instead of O(m) in the case that
Note that the parallel for loop in Line 3 of Algorithm 2
assigns threads only to vertices that need to accumulate
their dependency values; this is where the bookkeeping done
to keep track of separate levels of the graph traversal in the
ends array comes to fruition. Rather than naïvely assigning
a thread to each vertex or edge and checking to see if that
vertex or edge belongs to the current depth we instead can
instantly extract vertices of that depth since they are a con-
secutive block of entries within S. This strategy again pre-
vents unnecessary branch overhead and accesses to global
memory that are made by previous implementations. For fur-
ther implementation details we refer the reader to the associ-
ated conference paper. 16
4. 2. Rationale for hybrid methods
The major drawback of the approach outlined in the previous section is the potential for significant load imbalance
between threads. Although our approach efficiently assigns
1 Stage 1: Shortest Path Calculation
2 while true do
3 for v ∈ Qcurr do in parallel
4 for w ∈ neighbors(v) do
5 if atomicCAS(d[w], ∞, d[v] + 1) = ∞ then
6 t ← atomicAdd(Qnext_len, 1)
7 Qnext[t] ← w
8 if d[w] = d[v] + 1 then
9 atomicAdd(σ[w], σ[v])
10 barrier()
11 if Qnext_len = 0 then
12 depth ← d[S[Slen
13 break
14 else
15 for tid ← 0 … Qnext_len do in parallel
16 Qcurr[tid] ← Qnext[tid]
17 S[tid + Slen] ← Qnext[tid]
18 barrier()
19 ends[endslen] ← ends[endslen Qnext_len
20 endslen ← endslen + 1
21 Qcurr_len ← Qnext_len
22 Slen ← Slen + Qnext_len
23 Qnext_len ← 0
24 barrier()
1 Stage 2: Dependency Accumulation
2 while depth > 0 do
3 for tid ← ends[depth] ... ends[depth do in
parallel
4 w ← S[tid]
5 dsw ← 0
6 sw ← σ[ w]
7 for v ∈ neighbors(w) do
8 if d[v] = d[w] + 1 then
9
10 δ [w] ← dsw
11 barrier()
12 depth ← depth
Algorithm 2: Work-efficient betweenness centrality dependency accumulation.
Algorithm 1: Work-efficient betweenness centrality shortest
path calculation.