mi
w
n
1_
2
||w||
2
s.t. w × Y(xi, yi) - w × Y(xi, y –) ≥ 1 ("i, y – ¹ yi)
2. 3. Problem 3: inconsistent training data
So far we have tacitly assumed that the optimization problem
in Equation 3 has a solution, i.e., there exists a weight vector
that simultaneously fulfills all margin constraints. In practice this may not be the case, either because the training data
is inconsistent or because our model class is not powerful
enough. If we allow for mistakes, though, we must be able to
quantify the degree of mismatch between a prediction and the
correct output, since usually different incorrect predictions
vary in quality. This is exactly the role played by a loss function,
formally D: Y ´ Y → Â, where D(y, y –) is the loss (or cost) for
predicting y –, when the correct output is y. Ideally we are interested in minimizing the expected loss of a structured output
classifier, yet, as is common in machine learning, one often
uses the empirical or training loss
1 _ n Σn i=
1 D(yi, h(xi)) as a proxy for
the (inaccessible) expected loss. Like the choice of Y, defining
D is problem-specific. If we predict a set or sequence of labels,
a natural choice for D may be the number of incorrect labels
(a.k.a. Hamming loss). In the above parsing problem, a quality
measure like F1 may be the preferred way to quantify the similarity between two labeled trees in terms of common constituents (and then one may set D º 1 − F1).
Now we will utilize the idea of loss functions to refine our
initial problem formulation. Several approaches have been
suggested in the literature, all of which are variations of the
so-called soft-margin SVM: instead of strictly enforcing constraints, one allows for violations, yet penalizes them in the
overall objective function. A convenient approach is to introduce slack variables that capture the actual violation,
( 3)
In words, find a weight vector w of an input–ouput compatibility function f that is linear in some joint feature map Y
so that on each training example it scores the correct output
higher by a fixed margin than every alternative output, while
having low complexity (i.e., small norm ||w||). Note that the
number of linear constraints is still n(|Y| - 1), but we will
suggest ways to cope with this efficiently in Section 2. 4.
The design of the features Y is problem-specific, and it
is the strength of the developed methods to allow for a great
deal of flexibility in how to choose it. In the parsing example
above, the features extracted via Y may consist of counts of
ho w many times each production rule of the underlying grammar has been applied in the derivation described by a pair
(x, y). For other applications, the features can be derived from
graphical models as proposed in Lafferty et al. and Taskar
et al.,
16, 25 but more general features can be used as well.
2. 2. Problem 2: efficient prediction
Before even starting to address the efficiency of solving a quadratic programs of the form ( 3) for large n and |Y |, we need to
make sure that we can actually solve the inference problem
of computing a prediction h(x). Since the number of possible
outputs |Y | may grow exponentially in the length of the representation of outputs, a brute-force exhaustive search over Y
may not always be feasible. In general, we require some sort of
structural matching between the (given) compositional structure of the outputs y and the (designed) joint feature map Y.
For instance, if we can decompose Y into nonoverlapping
independent parts, Y = Y1´ × × × ´ Ym and if the feature extraction
Y respects this decomposition such that no feature combines
properties across parts, then we can maximize the compatibility for each part separately and compose the full output
by simple concatenation. A significantly more general case
can be captured within the framework of Markov networks as
explored by Lafferty et al. and Taskar et al.
16, 25 If we represent
outputs as a vector of (random) variables y = ( y1, …, ym), then
for a fixed input x, we can think of Y(x, y) as sufficient statistics in a conditional exponential model of P(y|x). Maximizing
f (x, y) then corresponds to finding the most probable output,
y ˆ = argmaxy P(y|x). Modeling the statistical dependencies
between output variables vis a dependency graph, one can
use efficient inference methods such as the junction tree
algorithm for prediction. This assumes the clique structure
of the dependency graph induced by a particular choice of Y
is tractable (e.g., the state space of the largest cliques remains
sufficiently small). Other efficient algorithms, for instance,
based on dynamic programming or minimal graph cuts, may
be suitable for other prediction problems.
In our example of natural language parsing, the suggested feature map will lead to a compatibility function
in which parse trees are scored by associating a weight
with each production rule and by simply combining all
weights additively. Therefore, a prediction can be computed efficiently using the CKY-Parsing algorithm (cf.
Tsochantaridis et al.
27).
f (xi, yi) - f (xi, y –) ≥ 1 - xi y –("y – ¹ yi)
( 4)
so that by choosing xiy – > 0, a smaller (even negative) separation margin is possible. All that remains to be done then is to
define a penalty for the margin violations. A popular choice is
to use
1 _ n Σn i=
1 maxy – π yi {D (yi, y –)xiy –} thereby penalizing violations more heavily for constraints associated with outputs y – that
have a high loss with regard to the observed y i . Technically, one
can convert this back into a quadratic program as follows:
( 5)
s.t. w × Y(xi, yi) - w × Y(xi, y –) ≥ 1 - xi ("i, y – ¹ yi) D (yi, y –)
Here C > 0 is a constant that trades-off constraint violations with
the geometric margin (effectively given by 1/|| w||). This has been
called the slack rescaling formulation. As an alternative, one
can also rescale the margin to upper-bound the training loss as
first suggested in Taskar et al.
25 and discussed in Tsochantaridis
et al.
27 This leads to the following quadratic program
( 6)
s.t. w × Y(xi, yi) - w × Y(xi, y –) ≥ D (yi, y –) - xi ("i, y – ¹ yi)
Since it is slightly simpler, we will focus on this margin-rescaling formulation in the following.