2. 4. Problem 4: efficient training
Last, but not least, we need a training algorithm that finds
the optimal w solving the quadratic program in ( 6). Since
there is a contraint for every incorrect label y –, we cannot
enumerate all constraints and simply give the optimization problem to a standard QP solver. Instead, we propose
to use the cutting-plane Algorithm 1 (or a similar algorithm
for slack-rescaling). The key idea is to iteratively construct
a working set of constraints W that is equivalent to the full
set of constraints in ( 6) up to a specified precision e. Starting
with an empty W and w = 0, Algorithm 1 iterates through the
training examples. For each example, the argmax in Line 5
finds the most violated constraint of the quadratic program
in ( 6). If this constraint is violated by more than e (Line 6),
it is added to the working set W in Line 7 and a new w is
computed by solving the quadratic program over the new
W (Line 8). The algorithm stops and returns the current w if
W did not change between iterations.
It is obvious that for any desired e, the algorithm only
terminates when it has found an e-accurate solution,
since it verifies in Line 5 that none of the constraints of
the quadratic program in ( 6) is violated by more than e.
But how long does it take to terminate? It can be shown27
that Algorithm 1 always terminates in a polynomial number of iterations that is independent of the cardinality
of the output space Y. In fact, a refined version of Algorithm 113, 14 always terminates after adding at most O(Ce − 1)
constraints to W (typically |W| << 1000). Note that the
number of constraints is not only independent of |Y |, but
also independent of the number of training examples n,
which makes it an attractive training algorithm even for
conventional SVMs.
13
While the number of iterations is small, the argmax in
Line 5 might be expensive to compute. In general, this is
true, but note that this argmax is closely related to the argmax for computing a prediction h(x). It is therefore called
the “loss-augmented” inference problem, and often the
prediction algorithm can be adapted to efficiently solve the
loss-augmented inference problem as well.
Note that Algorithm 1 is rather generic and refers to the
output structure only through a narrow interface. In particular, to apply the algorithm to a new structured prediction problem, one merely needs to supply implementations
of Y(x, y), D (y, y –), and argmaxy –ÎY{D(yi, y –) + w × Y(xi, y –)}. This
makes the algorithm general and easily transferable to new
applications. In particular, all applications discussed in
the following used the general SVMstruct implementation of
Algorithm 1 and supplied these three functions via an API.
2. 5. Related approaches
The structural SVM method is closely related to other
approaches, namely Conditional Random Fields16 and
Maximum Margin Markov Networks (M3N).
25 The crucial
link to probabilistic methods is to interpret the joint feature
map as sufficient statistics in a conditional exponential family model. This means, we define a conditional probability of
outputs given inputs by
P (y|x) = ___ 1
Z(x)
__ exp [ w × Y(x, y)] ( 7)
where Z(x) is a normalization constant. Basically Equation ( 7)
is a generalization of the well-known logistic regression
where Y(x, y) = yF(x) for y Î {− 1, 1}. The compatibility function used in structural SVMs directly governs this conditional probability. The probabilistic semantics lends itself
to statistical inference techniques like penalized (
conditional) maximum likelihood, where one maximizes Σni= 1 log
P(yi | xi) - l||w||
2. Unlike in Algorithm 1, however, here one
usually must compute the expected sufficient statistics
Σy P(y|x) Y(x, y), for instance, in order to compute a gradient
direction on the conditional log-likelihood. There are additional algorithmic challenges involved with learning CRFs
(cf. Sha, and Pereira22) and one of the key advantages of structural SVMs is that it allows an efficient dual formulation. This
enables the use of kernels instead of explicit feature representations as well as a sparse expansion of the solution (cf.
Hofmann et al.
10). A simpler, but usually less accurate learning
algorithm that can be generalized to the structured output setting is the perceptron algorithm, as first suggested in Collins.
5
The M3N approach is also taking the probabilistic view
as its starting point, but instead of maximizing a likelihood-based criterion, aims at solving Equation 6. More specifically,
M3Ns exploit the clique-based representation of P(y|x) over
a dependency graph to efficiently reparametrize the dual
problem of ( 6), effectively replacing the margin constraints
over pairs of outputs by constraints involving functions of
clique configurations. This is an interesting alternative to
our cutting-plane algorithm, but has a somewhat more limited range of applicability.
algorithm 1 for training structural SVMs (margin-rescaling).
1: Input: S = ((x1, y1), …, (xn, yn)), C, e
2: W¬0/, w =0,xi¬ 0 for alli = 1, …,n
3: repeat
4: for i= 1, …,n do
5: y ˆ← argmaxy ˆÎY {D(yi, y ˆ) + w × Y(xi, y ˆ)}
6: if w × [Y(xi, yi) - Y(xi, y ˆ)] < D(yi, y ˆ) - xi - e then
7: W ¬ W È {w × [Y(xi, yi) - Y(xi, y ˆ)] ≥ D(yi, y ˆ) - xi}
8: ( w, x) ¬ argmin
1_
2 w . w + C_n Σni=
1 xi s.t. W
w, x ³ 0 9: end if
10: end for
11: until W has not changed during iteration
12: return( w, x)
3. aPPLications
As outlined above, one needs to design and supply the following four functions when applying a structural SVM to a
new problem:
Y(x, y)
D (y, y –)
argmaxy –ÎY{ w × Y(xi, y –)}
argmaxy –ÎY {D (yi, y –) + w × Y(xi, y –)}
( 8)
( 9)
( 10)
( 11)
The following three sections illustrate how this can be
done for learning retrieval functions that provide diversified results, for aligning amino-acid sequences into