However, these mechanisms are
less effective or absent in online micro-task markets such as Mechanical Turk.
Social norms and sanctions are largely
absent as workers are interchangeable
and usually identified only by their
worker ID—a random string of numbers and letters.
Interestingly, there are social sanctions of employers by the workers in
Mechanical Turk, who use forums
such as turkernation.com to alert other
workers about bad employers (for more,
see “Ethics and Tactics of Professional
Crowdwork,” page 39). This is partially
in response to the asymmetrical power
balance in Mechanical Turk: employers can reject work for any reason with
no recourse for the worker whose has
already put in the effort but does not
get paid. Unification sites like Turker-nation let workers sanction employers
who consistently do not approve legitimate work.
New identities are relatively easily
created, though they do require some
external validity, such as credit card
verification. Monitoring is essentially
non-existent since the worker may be
in any physical location in the world.
Furthermore, employers don’t know
whether a worker who has accepted
the task is actually engaged in it or is
multitasking or watching TV.
Workers can choose between employers and jobs easily with no switching costs. Although some rudimentary
reputation systems do exist (such as
tracking the proportion of work rejected), workers with even low reputations
can often find jobs to complete. There
are no explicit contracts. Even after a
worker has accepted a job, she can return it any time for any reason without
consequence.
In the absence of external mechanisms for enforcing quality responses
in subjective tasks, we turned to the
design of the task itself. Specifically,
we had two key criteria for task design.
First, we wanted it to take the same
amount of effort for a worker to enter
an invalid but believable response as a
valid one written in good faith. Second,
we wanted to signal to the workers that
their output would be monitored and
evaluated.
To meet these criteria, we altered
the rating task. Instead of subjective
“Instead of subjective
ratings followed by
subjective feedback
about what could
be improved, we
required turkers
to complete three
simple questions
that had verifiable,
quantitative
answers.”
ratings followed by subjective feedback about what could be improved,
we required turkers to complete three
simple questions that had verifiable,
quantitative answers, such as how
many references/images/sections the
article had. We also asked turkers to
provide between four and six key words
summarizing the article. Importantly,
we selected these questions to align
with what Wikipedia experts claimed
they used when rating articles (such
as examining the references or the
article structure), with the goal that
by answering these questions, they
would have a reasonable judgment of
the quality of the article. We placed
the verifiable questions before the
subjective questions so workers would
have the opportunity to develop this
judgment before even having to think
about subjective questions. Finally,
since these questions have concretely
verifiable answers, they signal that
workers’ responses can and will be
evaluated—preventing gaming in the
first place and potentially increasing
effort (criteria 2).
Re-running our experiment with
the new task design led to dramatical-
ly better results. The percent of invalid
comments dropped from 49 percent
to 3 percent, improving by more than
a factor of 10. Time spent on the task
also more than doubled, suggesting
increased effort. This was borne out by
a positive and statistically significant
correlation bet ween turker ratings and
those of expert Wikipedians. Finally,
we found that we tapped a more di-
verse group, with more users contrib-
uting and a more even spread of con-
tributions across users. (Details of the
study can be found in “Crowdsourcing
User Studies With Mechanical Turk,”
in Proceedings of the ACM Conference on
Human-factors in Computing Systems,
2008.)
COLLABORATIVE CRO WDSOURCING
One common assumption about Mechanical Turk is that turkers must
work independently of each other.
Most tasks involve turkers each making an independent judgment about
an object (such as providing a label for
an image) with their judgments aggregated after ward.
However, even interdependent tasks
do not involve turkers interacting with
each other. For example, the company
CastingWords accomplished podcast
transcriptions in a serial fashion: one
turker may do the initial transcription;
the transcription is automatically split
into segments; other turker workers verify or improve the segments. Throughout, turkers never have to interact with
each other despite using the results of
each others’ work. This is a reasonable
approach when a requester does not
know who will accept a task, when they
will complete it, what the quality of the
work will be, and when there are few dependencies such that work can be easily
split up and done in parallel.