Efficient Parallelization Using
Rank Convergence in Dynamic
By Saeed Maleki, Madanlal Musuvathi, and Todd Mytkowicz
This paper proposes an efficient parallel algorithm for an
important class of dynamic programming problems that
includes Viterbi, Needleman–Wunsch, Smith–Waterman,
and Longest Common Subsequence. In dynamic programming, the subproblems that do not depend on each other, and
thus can be computed in parallel, form stages, or wavefronts.
The algorithm presented in this paper provides additional
parallelism allowing multiple stages to be computed in parallel despite dependences among them. The correctness and
the performance of the algorithm relies on rank convergence
properties of matrix multiplication in the tropical semiring,
formed with plus as the multiplicative operation and max as
the additive operation.
This paper demonstrates the efficiency of the parallel algorithm by showing significant speedups on a variety of important dynamic programming problems. In particular, the
parallel Viterbi decoder is up to 24× faster (with 64 processors)
than a highly optimized commercial baseline.
1. INTRODUC TION
Dynamic programming2 is a method to solve a variety of important optimization problems in computer science, economics,
genomics, and finance. Figure 1 describes two such examples:
23 which finds the most-likely path through a hidden-Markov model for a sequence of observations, and LCS,
which finds the longest common subsequence between two
input strings. Dynamic programming algorithms proceed
by recursively solving a series of sub-problems, usually represented as cells in a table as shown in the figure. The solution to
a subproblem is constructed from solutions to an appropriate
set of subproblems, as shown by the respective recurrence relation in the figure.
These data-dependences naturally group subproblems
into stages whose solutions do not depend on each other. For
example, all subproblems in a column form a stage in Viterbi
and all subproblems in an anti-diagonal form a stage in LCS.
A predominant method for parallelizing dynamic programming is wavefront parallelization,
15 which computes all sub-problems within a stage in parallel.
In contrast, this paper breaks data-dependences across
stages and fixes up incorrect values later in the algorithm.
Therefore, this approach exposes parallelism for a class of
dynamic programming algorithms we call linear-tropical
dynamic programming (LTDP). A LTDP computation can
be viewed as performing a sequence of matrix multiplica-
tions in the tropical semiring where the semiring is formed
with addition as the multiplicative operator and max as the
additive operator. This paper demonstrates that several
important optimization problems such as Viterbi, LCS,
Smith–Waterman, and Needleman–Wunsch (the latter two
are used in bioinformatics for sequence alignment) belong
to LTDP. To efficiently break data-dependences across
stages, the algorithm uses rank convergence, a property by
which the rank of a sequence of matrix products in the tropi-
cal semiring is likely to converge to 1.
A key advantage of our parallel algorithm is its ability to
simultaneously use both the coarse-grained parallelism
across stages and the fine-grained wavefront parallelism
within a stagea. Moreover, the algorithm can reuse existing
highly optimized implementations that exploit wavefront parallelism with little modification. As a consequence, our implementation achieves multiplicative speedups over existing
implementations. For instance, the parallel Viterbi decoder is
up to 24× faster with 64 cores than a state-of-the-art commercial baseline.
18 This paper demonstrates similar speedups for
other LTDP instances.
The original version of this paper is entitled “Parallelizing
Dynamic Programming through Rank Convergence” and
was published in ACM SIGPLAN Notices – PPoPP ‘ 14,
August 2014, ACM.
Figure 1. Dynamic programming examples with dependences
between stages. (a) the Viterbi algorithm and (b) the LCS algorithm.
Stage pi– 1, 1 ci– 1,j– 1 ci– 1,j
ci,j–1pi– 1, 2
pi– 1, 3
pi,j = max (pi– 1,k • tk,j)
ci– 1, j– 1 + di,j
ci– 1, j
Ci, j = max
pi– 1, 4
a The definition of wavefront parallelism used here is more general and includes the common usage where a wavefront performs computations across
logical iterations as in the LCS example in Figure 1a.