election and representation mechanism. The current generation of CMSs
do not offer computational support for
the formation of a balanced program
committee; they assume prior existence of the list of potential reviewers
and instead concentrate on supporting
the administrative workflow of issuing
and accepting invitations.
Expert finding. This lack of tool support is surprising considering the body
of relevant work in the long-established
field of expert finding.
2, 11, 15, 34, 47 Over
the years since the first Text Retrieval
Conference (TREC) in 1992, the task of
finding experts on a particular topic has
featured regularly in this long-running
conference series and is now an active
subfield of the broader text information retrieval discipline. Expert finding
has a degree of overlap with the fields
of bibliometrics, the quantitative analysis of academic publications and other
research-related literature,
21, 38 and scientometrics, which extends the scope
to include grants, patents, discoveries,
data outputs and, in the U.K., more abstract concepts such as ‘impact.’
5 Expert
finding tends to be more profile-based
(for example, based on the text of documents) than link-based (for example,
based on cross-references between documents) although content analysis is an
active area of bibliometrics in particular
and has been used in combination with
citation properties to link research topics to specific authors.
11 Even though
by comparison with bibliometrics, scientometrics encompasses additional
measures, in practice the dominant
approach in both domains is citation
analysis of academic literature. Citation
analysis measures the properties of networks of citation among publications
and has much in common with hyper-link analysis on the Web, where these
measures employ similar graph theoretic methods designed to model reputation, with notable examples including
Hubs and Authorities, and PageRank.
Citation graph analysis, using a particle-swarm algorithm, has been used to suggest potential reviewers for a paper on
the premise that the subject of a paper
is characterized by the authors it cites.
39
Harvard’s Profiles Research Network Software (RNS)d exploits both
graph-based and text-based methods.
d http://profiles.catalyst.harvard.edu
score bias (do they tend to err on the
accepting side or rather on the reject-
ing side?) and spread (do they tend to
score more or less confidently?) we
need a representative sample of pa-
pers with a reasonable distribution in
quality. This is often problematic for
single references as the number of pa-
pers m reviewed by a single reviewer
is too small to be representative, and
there can be considerable variation in
the quality of papers among different
batches that should not be attributed
to reviewers. It is, however, possible
to get more information about re-
viewer bias and confidence by leverag-
ing the fact that papers are reviewed
by several reviewers. For SIGKDD’09
we used a generative probabilistic
model proposed by colleagues at Mi-
crosoft Research Cambridge with la-
tent (unobserved) variables that can
be inferred by message-passing tech-
niques such as Expectation Propaga-
tion.
35 The latent variables include
the true paper quality, the numerical
score assigned by the reviewer, and
the thresholds this particular review-
er uses to convert the numerical score
to the observed recommendation on
the seven-point scale. The calibration
process is described in more detail in
Flach et al.
18
An interesting manifestation of reviewer variance came to light through
an experiment with NIPS reviewing in
2014.27 The PC chairs decided to have
one-tenth (166) of the submitted papers reviewed twice, each by three reviewers and one area chair. It turned
out the accept/reject recommendations of the two area chairs differed
in about one quarter of the cases ( 43).
Given an overall acceptance rate of
22.5%, roughly 38 of the 166 double-reviewed papers were accepted following the recommendation of one
of the area chairs; about 22 of these
would have been rejected if the recommendation of the other area chair
had been followed instead (assuming
the disagreements were uniformly
distributed over the two possibilities),
which suggests that more than half
(57%) of the accepted papers would
not have made it to the conference if
reviewed a second time.
What can be concluded from what
came to be known as the “NIPS experi-
ment” beyond these basic numbers
is up for debate. It is worth pointing
out that, while the peer review proc-
ess eventually leads to a binary accept/
reject decision, paper quality most
certainly is not: while a certain frac-
tion of papers clearly deserves to be
accepted, and another fraction clearly
deserves to be rejected, the remaining
papers have pros and cons that can be
weighed up in different ways. So if two
reviewers assign different scores to
papers this doesn’t mean that one of
them is wrong, but rather they picked
up on different aspects of the paper in
different ways.
We suggest a good way forward is to
think of the reviewer’s job as to “
profile” the paper in terms of its strong
and weak points, and separate the
reviewing job proper from the eventual accept/reject decision. One could
imagine a situation where a submitted paper could go to a number of
venues (including the ‘null’ venue),
and the reviewing task is to help decide which of these venues is the most
appropriate one. This would turn the
peer review process into a matching
process, where publication venues
have a distinct profile (whether it accepts theoretical or applied papers,
whether it puts more value on novelty
or on technical depth, among others)
to be matched by the submission’s
profile as decided by the peer review
process. Indeed, some conferences
already have a separate journal track
that implies some form of reviewing
process to decide which venue is the
most suitable one.c
Assembling Peer Review Panels
The formation of a pool of reviewers,
whether for conferences, journals, or
funding competitions, is a non-trivial
process that seeks to balance a range
of objective and subjective factors. In
practice, the actual process by which
a program chair assembles a program
committee varies from, at one extreme,
inviting friends and co-authors plus
their friends and co-authors, through
to the other extreme of a formalized
c For example, the European Conference on
Machine Learning and Principles and Practice
of Knowledge Discovery in Databases (ECML-
PKDD) has a journal track where accepted pa-
pers are presented at the conference but pub-
lished either in the Machine Learning journal
or in Data Mining and Knowledge Discovery.