Cells react to stimulation by propagating signals from sensory proteins to a set of
target proteins. Given a signaling pathway of a well-studied species, it is interesting
to query it within networks of less well-studied species. QPath is an exact path query
method that is based on color coding. It extends the basic color coding formulation
by allowing one to query a weighted network for inexact matches. That is, each edge
of the network has a confidence associated with it, and the goal is to find high-scoring subnetworks that are similar to the query while allowing some flexibility in the
matches (see figure 5b). Specifically, there could be up to Nins insertions of vertices
to the match that are not aligned against the query’s proteins, and up to Ndel deletions of vertices from the query that are not aligned against vertices in the matching
subnetwork. The algorithm aims at finding an inexact match that optimizes a scoring
function that combines: (i) sequence similarity—every pair of matched proteins (q, v)
contributes a sequence similarity term s(q, v); (ii) interaction confidence—every edge
( u, v) on the matched path contributes a weight term w( u, v); (iii) insertion penalty—
cins per insertion; and (iv) deletion penalty—cdel per deletion. The relative importance
of each of these terms is learned automatically by QPath; for clarity, we assume that
all terms are equally important in the description below.
for a given coloring of the network by k + Nins colors, QPath employs a dynamic programming formulation to find the highest scoring match with up to Nins insertions and
Ndel deletions. Denote the color of a vertex v by c(v). Denote by W( i, v, S, qdel) the score of an
optimal alignment of the first i nodes of the query that ends at a vertex v in the network,
induces qdel deletions, and visits nodes of each color in S. Then,
Finding a Hairpin
in a Haystack
The best scoring path is obtained using a standard dynamic programming back-
tracking starting at
.
sponds to a pair of potentially matching
subnetworks. NetworkBLAST scores
such a subnetwork by the density of its
corresponding intra-species subnetworks versus the chance that they arise
at random, assuming a random model
that preserves the node degrees.
After constructing the alignment
graph, the algorithm proceeds to iden-
tify high-scoring subnetworks. This is
done by starting with a seed of at most
four nodes, and applying a local search
to expand it. Each node serves as the
center of a seed, along with at most
three of its neighbors. The search it-
eratively adds or removes a node that
contributes most to the score, as long as
the score increases (and up to an upper
bound of 15 nodes). The effectiveness
of this search strategy can be quantified
by comparing to an exhaustive search
when such is possible. Figure 3(a) pres-
ents such a comparison when analyzing
a single (yeast) network from Yu et al.,
39
searching the best cluster containing
each of the network’s proteins. It can be
seen that the greedy heuristic produces
near-optimal clusters (up to 20% devia-
tion in score) in about 75% of the cases,
with an average of merely 13% deviation
from the optimal score. Notably, Net-
workBLAST requires only a few minutes
to run, while the exhaustive (ILP-based)
approach took several hours while lim-
iting the solver to five minutes per seed.
For seven out of a total of 326 seeds, the
solver could not find an optimal solu-
tion within the allotted time.