their cooperative counterparts. On the
other hand, having more players work
on the same input is wasteful in terms
of “computational efficiency,” an important criterion for evaluating the
utility of a given game.
GWaP evaluation
How might a game’s performance be
judged successful? Given that two different GWAPs solve the same problem,
which is best? We describe a set of metrics for determining GWAP success, including throughput, lifetime play, and
expected contribution.
Game efficiency and expected contribution. If we treat games as if they
were algorithms, efficiency would be a
natural metric of evaluation. There are
many possible algorithms for any given
problem, some more efficient than others. Similarly, many possible GWAPs
are available for any given problem. In
order to choose the best solution to a
problem we need a way to compare
the alternatives in terms of efficiency.
Efficiency of standard algorithms is
measured by counting atomic steps.
For instance, QuickSort is said to run
in O(n log n) time, meaning it sorts a list
of n elements in roughly n log n computational steps. In the case of GWAPs,
the notion of what constitutes a computational step is less clear. Therefore,
we must be able to define efficiency
through other means.
First, we define the throughput of a
GWAP as the average number of problem instances solved, or input-output
mappings performed, per human-hour. For example, the throughput of
the ESP Game is roughly 233 labels per
human-hour. 22 This is calculated by examining how many individual inputs,
or images, are matched with outputs,
or labels, over a certain period of time.
Learning curves and variations in
player skill must be considered in calculating throughput. Most games involve a certain type of learning, meaning that with repeated game sessions
over time, players become more skilled
at the game. For the game templates
we described earlier, such learning can
result in faster game play over time. To
account for variance in player skill and
changes in player speed over time as a
result of learning, we define throughput as the average number of problem
instances solved per human-hour. This
average is taken over all game sessions
through a reasonably lengthy period of
time and over all players of the game.
Games with higher throughput
should be preferred over those with
lower throughput. But throughput
is not the end of the story. Because a
GWAP is a game, “fun” must also be
included. It does not matter how many
problem instances are addressed by
a given game if nobody wants to play.
The real measure of utility for a GWAP
is therefore a combination of throughput and enjoyability.
Enjoyability is difficult to quantify
and depends on the precise implementation and design of each game. Even
seemingly trivial modifications to a
game’s user interface or scoring system
can significantly affect how enjoyable
it is to play. Our approach to quantifying this elusive measure is to calculate
and use as a proxy the “average lifetime
play” (ALP) for a game. ALP is the overall amount of time the game is played
by each player averaged across all people who have played it. For instance, on
average, each player of the ESP Game
plays for a total of 91 minutes.
“Expected contribution” is our summary measure of GWAP quality. Once a
game developer knows on average how
many problems are solved per human-hour spent in the game (throughput)
and how much time each player can
be expected to spend in a game (ALP),
these metrics can be combined to assess each player’s expected contribution. Expected contribution indicates
the average number of problem instances a single human player can be
expected to solve by playing a particular game. Developers can then use this
measure as a general way of evaluating
GWAPs. We define the three GWAP
metrics this way:
Throughput = average number of
problem instances solved per human-hour;
ALP = average (across all people who
play the game) overall amount of time
the game will be played by an individual player; and
Expected contribution = throughput
multiplied by ALP.
Although this approach does not
capture certain aspects of games (such
as “popularity” and contagion, or word
of mouth), it is a fairly stable measure
of a game’s usefulness. Previous work
in the usability tradition on measuring fun and game enjoyment has suggested the usefulness of self-report
questionnaire measures. 7, 14 However, a
behavioral measure (such as throughput) provides a more accurate direct
assessment of how much people play
the game and, in turn, how useful the
game is for computational purposes.
Finally, a GWAP’s developers must
verify that the game’s design is indeed
correct; that is, that the output of the
game maps properly to the particular
inputs that were fed into it. One way to
do this (as with the ESP Game, Peekaboom, Phetch, and Verbosity) is to analyze the output with the help of human
volunteers. We have employed two techniques for this kind of output verification: comparing the output produced in
the game to outputs generated by paid
participants (rather than game players) 22
and having independent “raters” evaluate the quality of the output produced in
the game. 22 Output from a GWAP should
be of comparable quality to output produced by paid subjects.
conclusion
The set of guidelines we have articulated for building GWAPs represents
the first general method for seamlessly
integrating computation and gameplay, though much work remains to be
done. Indeed, we hope researchers will
improve on the methods and metrics
we’ve described here.
Other GWAP templates likely exist
beyond the three we have presented,
and we hope future work will identify
them. We also hope to better understand problem-template fit, that is,
whether certain templates are better
suited for some types of computational
problems than others.
The game templates we have developed thus far have focused on similarity as a way to ensure output correctness; players are rewarded for thinking
like other players. This approach may
not be optimal for certain types of
problems; in particular, for tasks that
require creativity, diverse viewpoints
and perspectives are optimal for generating the broadest set of outputs. 17
Developing new templates for such tasks
could be an interesting area to explore.
We would also like to understand
what kinds of problems, if any, fall outside the GWAP approach. The games