ily be solved once we have a score matrix assessing for each paper-reviewer
pair how well they are matched.f We
have described a range of techniques
from information retrieval and machine learning that can produce such
a score matrix. The notion of profiles
(of reviewers as well as papers) is useful here as it turns a heterogeneous
matching problem into a homogeneous one. Such profiles can be formulated against a fixed vocabulary
(bag-of-words) or against a small
set of topics. Although it is fashionable in machine learning to treat
such topics as latent variables that
can be learned from data, we have
found stability issues with latent
topic models (that is, adding a few
documents to a collection can completely change the learned topics)
and have started to experiment with
handcrafted topics (for example, encyclopedia or Wikipedia entries) that
extend keywords by allowing their
own bag-of-words representations.
A perhaps less commonly studied
area where nevertheless progress has
been achieved concerns interpretation and calibration of the intermediate output of the peer reviewing process: the aspects of the reviews that
feed into the decision making process. In their simplest form these are
scores on an ordinal scale that are often simply averaged. However, averaging assessments from different assessors—which is common in other areas
as well, for example, grading course-work—is fraught with difficulties as
it makes the unrealistic assumption
that each assessor scores on the same
scale. It is possible to adjust for differences between individual reviewers,
particularly when a reviewing history
is available that spans multiple conferences. Such a global reviewing system that builds up persistent reviewer
(and author) profiles is something
that we support in principle, although
many details need to be worked out
before this is viable.
We also believe it would be beneficial if the role of individual reviewers shifted away from being an ersatz
judge attempting to answer the ques-
f This holds for the simple version stated earlier, but further constraints might complicate
the allocation problem.
authors of accepted papers.
1 Another
example of computational support for
assembling a balanced set of reviewers
comes not from conferences but from
a U.S. funding agency, the National Science Foundation (NSF).
The NSF presides over a budget of
over $7.7 billion (FY 2016) and receives
40,000 proposals per year, with large
competitions attracting 500– 1,500 proposals; peer review is part of the NSF’s
core business. Approximately a decade
ago, the NSF developed Revaide, a data-mining tool to help them find proposal
reviewers and to build panels with expertise appropriate to the subjects of
received proposals.
22 In constructing
profiles of potential reviewers the NSF
decided against using bibliographic databases like Citeseer or Google Scholar,
for the same reasons we discussed earlier. Instead they took a closed-world
approach by restricting the set of potential reviewers to authors of past
(single-author) proposals that had been
judged ‘fundable’ by the review process. This ensured the availability of a
UID for each author and reliable metadata, including the author’s name and
institution, which facilitated conflict
of interest detection. Reviewer profiles
were constructed from the text of their
past proposal documents (including
references and résumés) as a vector of
the top 20 terms with the highest tf-idf
scores. Such documents were known to
be all of similar length and style, which
improved the relevance of the resultant
tf-idf scores. The same is also true of the
proposals to be reviewed and so profiles of the same type were constructed
for these.
For a machine learning researcher,
an obvious next step toward forming
panels with appropriate coverage for
the topics of the submissions would be
to cluster the profiles of received proposals and use the resultant clusters
as the basis for panels, for example,
matching potential reviewers against
a prototypical member of the cluster.
Indeed, prior to Revaide the NSF had
experimented with the use of auto-
mated clustering for panel formation
but those attempts had proved unsuc-
cessful for a number of reasons: the
sizes of clusters tended to be uneven;
clusters exhibited poor stability as
new proposals arrived incrementally;
there was a lack of alignment of pan-
els with the NSF organizational struc-
ture; and, similarly, no alignment
with specific competition goals, such
as increasing participation of under-
represented groups or creating results
of interest to industry. So, eschewing
clustering, Revaide instead supported
the established manual process by an-
notating each proposal with its top
20 terms as a practical alternative to
manually supplied keywords.
Other ideas for tool support in
panel formation were considered. Inspired by conference peer review, NSF
experimented with bidding but found
that reviewers had strong preferences
toward well-known researchers and
this approach failed to ensure there
were reviewers from all contributing disciplines of a multidisciplinary
proposal—a particular concern for
NSF. Again, manual processes won
out. However, Revaide did find a valuable role for clustering techniques
as a way of checking manual assignments of proposals to panels. To do
this, Revaide calculated an “average”
vector for each panel, by taking the
central point of the vectors of its panel members, and then compared each
proposal’s vector against every panel.
If a proposal’s assigned panel is not
its closest panel then the program director is warned. Using this method,
Revaide proposed better assignments
for 5% of all proposals. Using the same
representation, Revaide was also used
to classify orphaned proposals, suggesting a suitable panel. Although
the classifier was only 80% accurate,
which is clearly not good enough
for a fully automated assignment, it
played a valuable role within the NSF
workflow: so, instead of each program
director having to sift through, say,
1,000 orphaned proposals they received an initial assignment of, say,
100 of which they would need to reassign around 20 to other panels.
Conclusion and Outlook
We have demonstrated that state-of-the-art tools from machine learning
and artificial intelligence are making inroads to automate and improve
parts of the peer review process. Allocating papers (or grant proposals)
to reviewers is an area where much
progress has been made. The combinatorial allocation problem can eas-