94 COMMUNICATIONS OF THE ACM | SEPTEMBER 2017 | VOL. 60 | NO. 9
workers on Mechanical Turk can be recruited within a few seconds, 1, 2, 11 and engaged in continuous tasks. 21, 24, 25, 28 Recruiting
from a broader pool allows workers to be selectively chosen
for their expertise not in captioning but in the technical
areas covered in a lecture. While professional stenographers
are able to type faster and more accurately than most crowd
workers, they are not necessarily experts in the field they are
captioning, which can lead to mistakes that distort the meaning of transcripts of technical talks. 30 Scribe allows student
workers to serve as non-expert captionists for $8–$12 per hour
(a typical work-study pay rate). Therefore, we could hire several students for much less than the cost of one professional
Scribe makes it possible for non-experts to collaboratively caption speech in real time by providing automated
assistance in two ways. First, it assists captionists by making the task easier for each individual. It directs each
worker to type only part of the stream audio, it slows down
the portion they are asked to type so they can more easily
keep up, and it adaptively determines the segment length
based on each individual’s typing speed. Second, it solves
the coordination problem for workers by automatically
merging the partial input of multiple workers into a single
transcript using a custom version of multiple-sequence
Because captions are dynamic, readers spend far more
mental effort reading real-time captions compared to
static text. Also, regardless of method, captions require
users to absorb information that is otherwise consumed
via two senses (vision and hearing) via only one (vision).
In classroom settings, this can be particularly common,
with content appearing on the board and being referenced in speech. The effort required to track both the
captions and the material they pertain to simultaneously
is one possible reason why deaf students often lag behind
their hearing peers, even with the best accomodations. 26
To address these issues, we also explore how captions
can be best presented to users, 16 and show that controlling bookmarks in caption playback can even increase
This paper outlines the following contributions:
• Scribe, an end-to-end system that has advantages over
current state-of-the-art solutions in terms of availabil-
ity, cost, and accuracy.
• Evidence that non-experts can collectively cover speech
at rates similar to or above that of a professional.
• Methods for quickly merging multiple partial captions
to create a single, accurate stream of final results.
• Evidence that Scribe can produce transcripts that both
cover more of the input signal and are more accurate
than either ASR or any single constituent worker.
• The idea of automatically combining the real-time
efforts of dynamic groups of workers to outperform
individuals on human performance tasks.
2. CURRENT APPROACHES
We first overview current approaches for real-time cap-
tioning, introduce our data set, and define the evaluation
metrics used in this paper. Methods for producing real-time
captioning services come in three main varieties:
Computer-Aided Real-time Transcription (CART): CART
is the most reliable real-time captioning service, but is
also the most expensive. Trained stenographers type in
shorthand on a “steno” keyboard that maps multiple key
presses to phonemes that are expanded to verbatim text.
Stenography requires 2–3 years of training to consistently
keep up with natural speaking rates that average 141 WPM
and can reach 231 WPM. 13
Non-Verbatim Captioning: In response to the cost of
CART, computer-based macro expansion services like
C-Print were introduced. 30 C-Print captionists need less training, and generally charge around $60 an hour. However, they
normally cannot type as fast as the average speaker’s pace,
and cannot produce a verbatim transcript. Scribe employs
captionists with no training and compensates for slower
typing speeds and lower accuracy by combining the efforts
of multiple parallel captionists.
Automated Speech Recognition: ASR works well in ideal
situations with high-quality audio equipment, but degrades
quickly in real-world settings. ASR is has difficulty recognizing domain-specific jargon, and adapts poorly to changes,
such as when the speaker has a cold. 6 ASR systems can
require substantial computing power and special audio
equipment to work well, which lowers availability. In our
experiments, we used Dragon Naturally Speaking 11. 5 for
Re-speaking: In settings where trained typists are not
common (such as in the U.K.), alternatives have arisen. In
re-speaking, a person listens to the speech and enunciates clearly into a high-quality microphone, often in a special environment, so that ASR can produce captions with
high accuracy. This approach is generally accurate, but
cannot produce punctuation, and has considerable delay.
Additionally, re-speaking still requires extensive training,
since simultaneous speaking and listening is challenging.
3. LEGION: SCRIBE
Scribe gives users on-demand access to real-time captioning from groups of non-experts via their laptop or
mobile devices (Figure 1). When a user starts Scribe, it
immediately begins recruiting workers to the task from
Mechanical Turk, or a pool of volunteer workers, using
LegionTools. 11, 20 When users want to begin captioning
audio, they press the start button, which forwards audio
to Flash Media Server (FMS) and signals the Scribe server
to begin captioning.
Workers are presented with a text input interface
designed to encourage real-time answers and increase
global coverage (Figure 2). A display shows workers their
rewards for contributing in the form of both money and
points. In our experiments, we paid workers $0.005 for
every word the system thought was correct. As workers type,
their input is forwarded to an input combiner on the Scribe
server. The input combiner is modular to accommodate different implementations without needing to modify Scribe.
The combiner and interface are discussed in more detail
later in this article.