figure 4. subtopic loss comparison when retrieving five documents.
structural sVm performance is superior with 95% confidence (using
signed rank test).
0.5
0.450.30.25 0.469 0.472 0.471 0.434 0.349 0.4
Loss
0.35
Random
Okapi
Unweighted
model
Essential pages Structural SVM
labeled subtopics. The second representation learns a
word weighting function, with goal of having the representations agree on the best solution. This setting is very
general and can be applied to other domains beyond subtopic retrieval.
3. 2. Predicting protein alignments
Proteins are sequences of 20 different amino acids joined
together by peptide bonds. They fold into various stable
structures under normal cell environments. A central question in biology is to understand how proteins fold, as their
structures relate to their functions.
One of the more successful methods for predicting protein structure from an amino acid sequence is homology
modeling. It approximates the structure of an unknown
protein by comparing it against a database of proteins with
experimentally determined structures. An important intermediate step is to align the unknown protein sequence with
known protein sequences in the database, and this is a difficult problem when the sequences have diverged too far (e.g.,
less than 20% sequence identity).
To align protein sequences accurately for homology
modeling, we need to have a good substitution cost matrix
as input to the Smith–Waterman algorithm (a dynamic programming algorithm). A common approach is to learn the
substitution matrix from a set of known good alignments
between evolutionarily related proteins.
Structural SVMs are particularly suitable for learning the substitution matrix, since they allow incorporating all available information (e.g., secondary structures,
solvent accessibility, and other physiochemical properties) in a principled way. When predicting the alignment
y = ( y1, y2, …) for two given protein sequences x = (sa, sb),
each sequence location is described by a vector of features.
The discriminative approach of structural SVMs makes it
easy to incorporate these extra features without having to
make unjustified independence assumptions as in generative models. As explained below, this enables learning a
“richer” substitution function w × F (x, yi) instead of a fixed
substitution matrix.
The Smith–Waterman algorithm uses a linear function
for scoring alignments, which allows us to decompose the
102 communications of the acm | noVeMber 2009 | vol. 52 | no. 11
joint feature vector into a sum of feature vectors for individual alignment operations (match, insertion, or deletion):
F (x, yi) is the feature vector for the ith alignment operation
in the alignment y of the sequence pair x. Below we describe
several alignment models represented by F, focusing on the
scoring of matching two amino acids (see Figure 5 for an
illustration of the features used):
(i) Simple: we only consider the substitution cost of single
features, e.g., the cost of matching a position with
amino acid “M” with another position in a loop region.
(ii) Anova2: we consider pairwise interaction of features,
e.g., the cost of matching a position with amino acid
“M” in an alpha helix with another position with
amino acid “V” in a loop region.
(iii) Tensor: we consider three-way interaction of features
among amino acid, secondary structure, and solvent
accessibility.
(iv) Window: on top of three alignment models above, we
add neighborhood features using a sliding window,
which takes into account the hydrophobicity and secondary structure in the neighborhood of a position.
(iv) Profile: on top of the alignment model Window, we
add PSI-BLAST profile scores as features.
As the loss function D(y, y –), we use Q4-loss. It is the percentage of matched amino acids in the correct alignment y that
are aligned more than four positions apart in the predicted
alignment y –. The linearity of the Q4-loss allows us to use
the Smith–Waterman algorithm also for solving the loss-augmented inference problem during training. We refer to
Yu et al.
29 for more details.
experiments: We tested our algorithm with the training and
validation sets developed in Qiu and Elber,
20 which contain
4542 and 4587 pairwise alignments of evolutionarily-related
proteins with high structural similarites. For the test set we
selected 4185 structures deposited in the Protein Data Bank
from June 2005 to June 2006, leading to 29345 test pairs.
figure 5. aligning a new protein sequence with a known protein
structure. features at the aligned positions are used to construct
substitution matrices under different alignment models.
Know protein (Position 10) amino acid: Methionine (M) 2nd structure: alpha helix solvent access: Exposed
New protein (Position 11)
amino acid: Valine (V)
2nd structure (predicted): loop
solvent access (predicted): Buried
Known
GWNAYIDN-LMAD
SWQTYVDTNLVCT
SQY
QGF