information with a weighted average of
the similarity scores of their neighbors.
Thus,
where a is a parameter balancing
the network-based and sequence-based
terms. R can be found efficiently using
the power method algorithm, which
iteratively updates R according to the
above equation and converges to the
analytical solution R = (I − aA)− 1( 1 − a)B.
variants of the network querying prob-
lem are amenable to fixed parameter
approaches, as the query subnetworks
can be often assumed to have a small,
bounded size. The other methodology
we demonstrate is based on reformu-
lating the problem at hand as an inte-
ger linear program and applying an
industrial solver such as CPLEX14 to
optimize it. While arriving at a solution
in a timely fashion is not guaranteed
(as integer programming is NP-hard11),
in practice, on current networks, many
of these formulations are solved in
reasonable time.
In the biological domain, where one wishes to query known protein machineries in the
network of another species, oftentimes the topology of the query is not known (only the
identity of the member proteins is known). one possible way to tackle this scenario,
applied by Torque,
7 is to assume that the query and, hence, the sought matches are
connected. Torque is based on an IlP, which expresses the connectivity requirement
by simulating a flow network, where an arbitrary node serves as a sink draining the
flow generated by all the other nodes that participate in the solution. In detail, the
IlP formulation uses the following variables: (i) a binary variable cv for each note v,
denoting whether it participates in the solution; (ii) a binary variable euv for each edge
(u, v), denoting whether it participates in the solution subnetwork; (iii) a pair of rational
variables fuv, fvu for each edge (u, v), representing both the magnitude and direction of
the flow going through it; (iv) a binary variable rv that marks the sink node; and (v) a
binary variable gvq for every pair of sequence-similar network and query nodes (v, q),
denoting whether q is matched with v.
The following set of constraints is immediately derived from the model and expresses
the requirements that (i) the solution should span k nodes; (ii) only one node may serve as
a sink; and (iii) an edge is part of the solution only when its two endpoints are:
Querying via an Integer
Linear Program
let Q denote the set of query proteins, |Q| = k, and let Φ(v) ⊆ Q denote the (possibly
empty) subset of query proteins that are sequence similar to v. To obtain an adequate
match between the solution and query proteins, the following constraints are added
to ensure that (i) a network protein may match at most one query protein; (ii) all query
proteins are matched with exactly one network protein (assuming, for simplicity, that
there are no insertions or deletions); and (iii) only nodes that are part of the solution
may be matched.
exact Approaches
In contrast to the heuristic methods
highlighted here, which do not provide any guarantee of the quality of the
obtained solution, exact approaches
guarantee optimality at the cost of
speed. Two general methodologies
for efficient, yet exact, solutions have
been common in network analysis:
fixed parameter and ILP formulations.
Fixed parameter tractable problems
can be solved in time that is, typically,
exponential in some carefully chosen parameter of the problem and
polynomial in the size of the input
networks. As we will describe, many
To maintain a legal flow, one also needs to ensure that (i) the pair of flow variables
associated with a given edge agrees on its direction and magnitude; (ii) the flow may
only pass through edges that participate in the solution; and (iii) source nodes generate
flow that is drained by the sink. These conditions are formulated by the following
constraints:
Together, the above constraints restrict the solutions to take the form of a
connected subnetwork spanning exactly k nodes that are sequence similar to their
respective matches in the query. finally, denoting the weight of edge (u, v) by w (u, v),
the objective is to maximize the weight of the solution subnetwork: