paths in a network. QPath extends the
basic color coding formulation by querying weighted networks for inexact
matches (see the accompanying sidebar “Finding a Hairpin in a Haystack”).
The algorithm takes minutes to run
with length- 7 queries and up to three
insertions (unaligned match vertices)
and three deletions (unaligned query
vertices). Efficient heuristics to color
the network can be used to reduce
its time even further.
9, 21 In a follow-up work,
9 the QPath approach was
extended to handle queries of bounded
treewidth and a heuristic solution was
offered for general queries.
Our last highlighted method uses
an ILP formulation to optimally solve
a different variant of the network
querying problem, where the topology
of the query is not known. This scenario
is very common when querying for
protein complexes, where the under-
lying interaction structure is rarely
available39 but the member proteins
are assumed to be connected. Hence,
instead of searching for a particular
interaction pattern, the goal is to find
a matching subgraph that is connected.
In Bruckner et al.,
7 an ILP solution to
this problem is given. The main chal-
lenge in formulating the problem as
an ILP is to express the connectivity of
the solution sought. The Torque algo-
rithm7 solves this problem by model-
ing the solution subgraph as a flow
network, where one of its nodes is
arbitrarily designated as a sink, capa-
ble of draining k − 1 units of flow, and
all the other nodes are set as sources,
each generating one unit of flow. A
set of constraints requires that the
total flow in the system is preserved.
The detailed program is given in the
accompanying sidebar “Querying via
an Integer Linear Program.”
Notably, there is also a parameter-
ized approach to this querying prob-
lem. The approach is based on the
observation that a connected sub-
graph can be represented by its span-
ning tree, so the querying problem
translates to that of finding a tree of
k distinct vertices. The latter problem
can be solved using the color coding
technique.
6, 7 Interestingly, for most
instances, the dynamic programming
approach is empirically slower than
running the ILP formulation through a
solver, as demonstrated in Figure 3(b).
the Power of Comparative
Network Analysis
The successful application of comparative network analysis approaches
depends not only on their computational attributes but also on their
biological relevance. Here, we give
examples for the applications of several of the reviewed approaches and the
biological insights they have enabled.
We demonstrate the power of comparative network analysis approaches by
comparing their performance with that
of methods that are either sequence-based and, thus, cannot exploit the
network information, or single-species-based and as a result are more prone to
noise in the network data.
The most intuitive use of comparative network analysis is to gather
support for computational predictions from multiple species. A prime
example for this use is the inference
of protein complexes or pathways
from PPI data. For instance, in Sharan
et al.,
31 a cross-species analysis is used
to identify yeast-bacterial conserved
complexes. By comparing the inferred
complexes to known complexes in
yeast, it is shown that the comparative
analysis increases the specificity of the
predictions (compared to a yeast-only
analysis) by a significant margin, albeit
at the price of reduced sensitivity.
In Sharan et al.,
32 the local alignment
of three networks (yeast, worm, and fly)
was used to identify their conserved
protein machineries (Figure 4a); those
were used in a systematic manner to
infer protein function (Figure 4b) and
interaction (Figure 4c), showing supe-
rior performance compared to that of
a sequence-based analysis. In brief,
NetworkBLAST was used to identify
high-scoring triplets of matching sub-
networks across the three networks.
Whenever the proteins in a conserved
subnetwork were enriched for a certain
function and at least half the proteins
in the subnetwork were annotated
with that function, the rest of the sub-
network’s proteins were predicted to
have that function as well. This predic-
tion strategy yielded 58%–63% accuracy
across the three species (in a cross-val-
idation test), versus 37%–53% accuracy
for a sequence-based method that pre-
dicts the function of a protein based
on its most sequence-similar protein
in the other species. The conserved
subnetworks were further used to pre-
dict novel PPIs in the following man-
ner: a pair of proteins were predicted
to interact if two sequence-similar pro-
teins were known to interact in another
species (directly or via a common net-
work neighbor) and, additionally, if the
four proteins co-occurred in one of the
conserved subnetworks. Remarkably,
this strategy yielded >99% specificity in
cross-validation. Experimental valida-
tion of 65 of these predictions gave a
success rate in the range of 40%–52%.
In comparison, sequence-based pre-
dictions that do not use the conserved
subnetwork information yielded suc-
cess rates in the range of 16%–31%.
18, 32
Conclusion
The explosion of molecular data in
the last two decades is revolutionizing biological and, consequently,
computational research. The arising computational problems require