homologuous protein structures, and for optimizing nonstandard performance measures in binary classification.
3. 1. optimizing diversity in search engines
State of the art retrieval systems commonly use machine
learning to optimize their ranking functions. While effective
to some extent, conventional learning methods score each
document independently and, therefore, cannot account for
information diversity (e.g., not presenting multiple redundant results). Indeed, several recent studies in information
retrieval emphasized the need for diversity (e.g., Zhai et al.
and Swaminathan et al.
24, 32). In particular, they stressed
modeling interdocument dependencies, which is fundamentally a structured prediction problem. Given a dataset
of queries and documents labeled according to information
diversity, we used structural SVMs to learn a general model
of how to diversify.
31
What is an example (x, y) in this learning problem? For
some query, let x = {x1, …, xn} be the candidate documents
that the system needs to rank. Our ground truth labeling for
x is a set of subtopics T = { T1, …, Tn}, where Ti denotes the
subtopics covered by document xi. The prediction goal is to
find a subset y ⊂ x of size K (e.g., K = 10 for search) maximizing subtopic coverage, and therefore maximizing the information presented to the user. We define our loss function
D(T, y) to be the (weighted) fraction of subtopics not covered
by y (more weight for popular subtopics).a
Even if the ground truth subtopics were known, computing the best y can be difficult. Since documents overlap in
subtopics (i.e., i, j : Ti Ç Tj ¹ 0/), this is a budgeted maximum
coverage problem. The standard greedy method achieves a
( 1 − 1/e)-approximation bound,
15 and typically yields good
solutions. The greedy method also naturally produces a
ranking of the top documents.
Figure 3 depicts an abstract visualization of our prediction problem. The shaded regions represent candidate documents x of a query, and the area covered by each region is the
“information” (represented as subtopics T) covered by that
document. If T was known, we could use a greedy method
to find a solution with high subtopic diversity. For K = 3, the
optimal solution in Figure 3 is y = {D1, D3, D5}.
In general, however, the subtopics of new queries are
unknown. One can think of subtopic labels as a manual partitioning of the total information of a query. We do, however,
have access to an “automatic” representation of information
diversity: the words contained in the documents. Intuitively,
covering more (distinct) words should result in covering
more subtopics. Since some words are more informative
than others, a weighting scheme is required.
Following the approach of Essential Pages,
24 we measure
word importance primarily using relative word frequencies
within x. For example, the word “computer” is generally informative, but conveys almost no information for the query “ACM”
since it likely appears in most of the candidate documents. We
learn a word weighting function using labeled training data,
whereas Swaminathan et al.
24 uses a predefined function.
a The loss function (Equation 9) can be defined using any ground truth format, not necessarily the same format as the output space.
figure 3. Different documents (shaded regions) cover different information (subtopics) of a query. here, the set {D2, D3, D4} contains the
three documents that individually cover the most information, but
{D1, D3, D5} collectively covers more information.
D3
D1
D5
D2
D4
We now present a simple example of the joint feature representation Y. Let f (v, x) denote the feature vector describing the frequency of a word v amongst documents in x. For
example, we can construct f (v, x) as
.
Let V(y) denote the union of words contained in the documents of the predicted subset y. We can write Y as
Y(x, y) = Σ
v ∈V (y)
f (v, x).
Given a model vector w, the benefit of covering word v in x is
w × f(v, x), and is realized when a document in y contains v,
i.e., v OE V (y). Because documents overlap in words, this is also
a budgeted maximum coverage problem. Thus both prediction (Equation 10) and loss-augmented inference (Equation
11) can be solved effectively via the greedy method. Despite
finding an approximate most violated constraint, we can
still bound the precision of our solution.
8 Practical applications require more sophisticated Y and we refer to Yue and
Joachims31 for more details.
experiments: We tested the effectiveness of our method using
the TREC 6–8 Interactive Track queries. Relevant documents
are labeled using subtopics. For example, query 392 asked
human judges to identify applications of robotics in the world
today, and they identified 36 subtopics among the results
such as nanorobots and using robots for space missions.
Additional details can be found in Yue and Joachims.
31
We compared against Okapi,
21 Essential Pages,
24 random
selection and unweighted word selection (all words have equal
benefit). Okapi is a conventional retrieval function which does
not account for diversity. Figure 4 shows the loss on the test set
for retrieving K = 5 documents. The gains of the structural SVM
over Essential Pages are 95% significant, and only the structural SVM and Essential Pages outperformed random.
This approach can accomodate other diversity criteria,
such as diversity of clicks gathered from clickthrough logs.
We can also accommodate very rich feature spaces based on,
e.g., anchor text, URLs, and titles.
Abstractly, we are learning a mapping between two
representations of the same set covering problem. One
representation defines solution quality using manually