figure 1. examples of structured output prediction tasks: predicting trees in natural language parsing (left), predicting the structure of
proteins (middle), and predicting an equivalence relation over noun phrases (right).
The dog chased the cat.
S
NP
VP
N Det NP V
Det N
APPGEAYLQPGEAYLQV
[Obama]running
in the [presidental
election] has
mobilized [many
young voters].
[His][position] on
[climate change]
was well received
by [this group].
Obama
presidential election
many young voters
His
position
climate change
this group
figure 2. structured output prediction as a multiclass problem.
S
NP
N Det
Class 1
The dog chased the cat.
VP
NP V
Det N
S
NP
N Det
Class 2
VP
PP V
IN N
S
Class 3
NP
VP
NP
N Det
V
NP
N
CC
…
S
Class k
N
S
CC
VP NP
VN
S
VP NP
V
The first problem is related to finding a compact representation for large output spaces. If we allowed even just
one parameter for each class, we would already have more
parameters than we could ever hope to have enough training
data for. Second, just making a single prediction on a new
example is a computationally challenging problem, since
sequentially enumerating all possible classes may be infeasible. Third, we need a more refined notion of what constitutes a prediction error. Clearly, predicting a parse tree that
is almost correct should be treated differently from predicting a tree that is completely wrong. And, last but not least,
we need efficient training algorithms that have a run-time
complexity sublinear in the number of classes.
In the following, we will tackle these problems one by
one, starting with the formulation of the structural SVM
method.
2. 1. Problem 1: structural sVm formulation
As mentioned above, we start the derivation of the structural
SVM from the multiclass SVM.
6 These multiclass SVMs use
one weight vector wy for each class y. Each input x now has
a score for each class y via f (x, y) º wy×F(x). Here F (x) is a
vector of binary or numeric features extracted from x. Thus,
every feature will have an additively weighted influence in
98 communications of the acm | noVeMber 2009 | vol. 52 | no. 11
the modeled compatibility between inputs x and classes y.
To classify x, the prediction rule h(x) then simply chooses the
highest-scoring class
h(x) ≡ argmax f (x, y)
y ∈ Y
( 1)
as the predicted output. This will result in the correct prediction y for input x provided the weights w = (w1, …, wk) have
been chosen such that the inequalities f (x, y –) < f (x, y) hold for
all incorrect outputs y – ¹ y.
For a given training sample (x1, y1), …, (xn, yn), this leads
directly to a (hard-) margin formulation of the learning problem by requiring a fixed margin (= 1) separation of all training examples, while using the norm of w as a regularizer:
mi
w
n
1_
2
|| w||
2, s.t. f (xi, yi) - f (xi, y –) ≥ 1 ("i, y – ¹ yi) ( 2)
For a k-class problem, the optimization problem has a
total of n(k − 1) inequalities that are all linear in w, since one
can expand f (xi, yi) - f (xi, y –) = (wyi - wy –) × F(xi). Hence, it is a convex quadratic program.
The first challenge in using ( 2) for structured outputs is
that, while there is generalization across inputs x, there is
no generalization across outputs. This is due to having a separate weight vector wy for each class y. Furthermore, since
the number of possible outputs can become very large (or
infinite), naively reducing structured output prediction to
multiclass classification leads to an undesirable blowup in
the overall number of parameters.
The key idea in overcoming these problems is to extract
features from input–output pairs using a so-called joint feature map Y(x, y) instead of F(x). This yields compatibility
functions with contributions from combined properties of
inputs and outputs. These joint features will allow us to generalize across outputs and to define meaningful scores even
for outputs that were never actually observed in the training
data. At the same time, since we will define compatibility
functions via f (x, y) º w × Y(x, y), the number of parameters
will simply equal the number of features extracted via Y,
which may not depend on |Y |. One can then use the formulation in ( 2) with the more flexible definition of f via Y to
arrive at the following (hard-margin) optimization problem
for structural SVMs.