For all sequence pairs x in the training, validation, and test
sets both structures are known, and we use the CE structural
alignment program23 to generate “correct” alignments y.
The red bars in Figure 6 shows the Q4-scores (i.e., 100
minus Q4-loss) of the five alignment models described
above. By carefully introducing features and considering
their interactions, we can build highly complex alignment
models (with hundreds of thousands of parameters) with
very good alignment performance. Note that a conventional
substitution matrix has only 202 = 400 parameters.
SSALN20 (blue bar) uses a generative alignment model for
parameter estimation, and is trained using the same training set and features as the structural SVM algorithm. The
structural SVM model Window substantially outperforms
SSALN on Q4-score. Incorporating profile information makes
the SVM model Profile perform even better, illustrating the
benefit of being able to easily add additional features in discriminative training. The result from BLAST (green bar) is
provided as a baseline.
To get a plausible upper bound for further improvements,
we check in how far the CE alignments we used as ground
truth agree with the structural alignments computed by
TM-Align.
33 TM-Align gives a Q4-score of 85. 45 when using
the CE alignments as ground truth to compare against. This
provides a reasonable upper bound on the alignment accuracy we might achieve on this noisy dataset. However, it also
shows significant room for further research and improvement from our current alignment models.
3. 3. Binary classification with unbalanced classes
Even binary classification problems can become structured
output problems in some cases. Consider, for example, the
case of learning a binary text classification rule with the
classes “machine learning” and “other topics.” Like most text
classification tasks, it is highly unbalanced, and we might only
have, say, 1% machine-learning documents in our collection.
In such a situation, prediction error typically becomes meaningless as a performance measure, since the default classifier that always predicts “other topics” already has a great
error rate that is hard to beat. To overcome this problem, the
figure 6. Q4-score of various structural sVm alignment models
compared to two baseline models. the structural sVm using
Window or Profile features significantly outperforms the ssaLn
baseline. the number in brackets is the number of features of the
corresponding alignment model.
Simple
(1020)
75
70
65. 55
68.04 66. 86 9. 8 71. 32 67. 51
65
60
Q4 score
55
50
45
40
35
30. 66
30
25
Anova2
(49634)
Tensor
Window
Profile
SSALN BLAST
(203280)
(447016)
(448642)
Information Retrieval (IR) community has designed other performance measures, like the F1-Score and the Precision/Recall
Breakeven Point (PRBEP) (see e.g., Baeza-Yates and Ribeiro-Neto1), that are meaningful even with unbalanced classes.
What does this mean for learning? Instead of optimizing some variant of error rate during training, which is what
conventional SVMs and virtually all other learning algorithms do, it seems like a natural choice to have the learning
algorithm directly optimize, for example, the F1-Score (i.e.,
the harmonic mean of precision and recall). This is the point
where our binary classification task needs to become a structured output problem, since the F1-Score (as well as many
other IR measures) is not a function of individual examples
(like error rate), but a function of a set of examples. In particular, we arrive at the structured output problem of predicting an array of labels y = ( y1, …, yn), yi Î {− 1, + 1}, for an array
of feature vectors x = (x1, …, xn), xi Î ÂN. Each possible array
of labels y– now has an associated F1-Score F1(y, y –) w.r.t. the
true labeling y, and optimizing F1-Score on the training set
becomes a well-defined problem.
The problem of predicting an array of binary labels y for
an array of input vectors x optimizing a loss function D(y, y –)
fits neatly into the structural SVM framework. Again, we only
need to define Y(x, y) and D(y, y –) appropriately, and then find
algorithms for the two argmax problems. The choice of loss
function is obvious: when optimizing F1-Score, which ranges
from 0 (worst) to 1 (best), we will use
D (y, y –) = 1 - F1(y, y –).
For the joint feature mapping, using
can be shown to be a natural choice, since it makes the structural SVM equivalent to a conventional SVM if error rate is
used as the loss D(y, y
_
). It also makes computing a prediction
very efficient with
h(x) = arg
y
max {w × Y(x, y)} = (sign(w × x1),..., sign( w × xn)).
Computing the loss-augmented argmax necessary for training is a bit more complicated and depends on the particular
loss function used, but it can be done in time at most O(n2)
for any loss function that can be computed from the contingency table of prediction errors.
12 SVMper f is an implementation of the method for various performance measures.
experiments: Table 1 shows the prediction performance of
the structural SVM on four benchmark datasets, described
in more detail in Joachims.
12 In particular, it compares the
structural SVM that directly optimizes the measure it is evaluated on (here F1, PRBEP, and ROCArea) to a conventional
SVM that accounts for unbalanced classes with a linear cost
model. For both methods, parameters were selected to optimize the evaluation measure via cross validation. In most
cases, the structural SVM outperforms the conventional
SVM, and it never does substantially worse.
Beyond these performance gains, the structural SVM
approach has an attractive simplicity to it—direct optimization