6s. This seems to work well in practice, but it is likely that it
is not ideal for everyone (discussed below). Our experience
suggests that keeping the in period short is preferable even
when a particular worker was able to type more than the
period because the latency of a worker’s input tended to go
up as they typed more consecutive words.
5. IMPROVING HUMAN PERFORMANCE
Even when workers are directed to small, specific portions of
the audio, the resulting partial captions are not perfect. This
is due to several factors, including bursts of increased speaking rates being common, and workers mis-hearing some content due to a particular accent or audio disruption. To make
the task easier for workers, we created Time Warp, 23 which
allows each worker to type what they hear in clips with a lower
playback rate, while still keeping up with real time and maintaining context from content they are not responsible for.
5. 1. Warping time
Time Warp manages this by balancing the play speed during in periods, where workers are expected to caption the
audio and the playback speed is reduced, and out periods,
where workers listen to the audio and the playback speed is
increased. A cycle is one in period followed by an out period.
At the beginning of each cycle, the worker’s position in the
audio is aligned with the real-time stream. To do this, we
first need to select the number of different sets of workers
N that will be used in order to partition the stream. We call
the length of the in period Pi, the length of the out period Po
and the play speed reduction factor r. Therefore, the playback rate during in periods is 1r . The amount of the real-time
stream that gets buffered while playing at the reduced speed
is compensated for by an increased playback speed of − N − r 1 N during out periods. The result is that the cycle time of the modified stream equals the cycle time of the unmodified stream.
To set the length of Pi for our experiments, we conducted
preliminary studies with 17 workers drawn from Mechanical
Turk. We found that their mean typing speed was 42. 8 WPM
on a similar real-time captioning task. We also found that
a worker could type at most 8 words in a row on average before
the per-word latency exceeded 8s (our upper bound on acceptable latency). Since the mean speaking rate is around 150 WPM, 13
workers will hear 8 words in roughly 3.2s, with an entry time
of roughly 8s from the last word spoken. We used this to set
Pi = 3.25s, Po = 9.75s, and N = 4. We chose r = 2 in our tests so that
the playback speed would be = 1 0.5 2 times for in periods, and
the play speed for out periods is − = = − 1 3 1. 5 2 NN r times.
To speed up and slow down the play speed of content
being provided to workers without changing the pitch
(which would make the content more difficult to understand for the worker), we use the Waveform Similarity
Based Overlap and Add (WSOLA) algorithm. 4 WSOLA works
by dividing the signal into small segments, then either
skipping (to increase play speed) or adding (to decrease
play speed) content, and finally stitching these segments
back together. To reduce the number of sound artifacts,
WSOLA finds overlap points with similar wave forms, then
gradually transitions between sequences during these
In the following sections, we detail the co-evolution of the
worker interface and algorithm for merging partial captions
in order to form a final transcript.
4. COORDINATING CAPTIONISTS
Scribe’s non-expert captioning interface allows contributors
to hear an audio stream of the speaker(s), and provide captions with a simple user interface (UI) (Figure 2). Captionists
are instructed to type as much as they can, but are under no
pressure to type everything they hear. If they are able, workers are asked to separate contiguous sequences of words by
pressing enter . Knowing which word sequences are likely to
be contiguous can help later when recombining the partial
captions from multiple captionists.
To encourage real-time entry of captions, the interface
“locks in” words a short time after they are typed (500ms).
New words are identified when the captionist types a space
after the word, and are sent to the server. The delay is added to
allo w workers to correct their input while adding as little additional latency as possible to it. When the captionist presses
enter (or following a 2s timeout during which they have not
typed anything), the line is confirmed and animates upward.
During the 10–15s trip to the top of the display (depending
on settings), words that Scribe determines were entered correctly (based on either spell-checking or overlap with another
worker) are colored green. When the line reaches the top, a
point score is calculated for each word based on its length
and whether it has been determined to be correct.
To recover the true speech, non-expert captions must
cover all of the words spoken. A primary reason why the partial transcriptions may not fully cover the true signal relates
to saliency, which is defined in a linguistic context as “that
quality which determines how semantic material is distributed within a sentence or discourse, in terms of the relative
emphasis which is placed on its various parts”. 7 Numerous
factors influence what is salient, and so it is likely to be difficult to detect automatically. Instead, we inject artificial
saliency adjustments by systematically varying the volume
of the audio signal that captionists hear. Scribe’s captionist
interface is able to vary the volume over a given a period with
an assigned offset. It also displays visual reminders of the
period to further reinforce this notion.
Initially, we tried dividing the audio signal into segments
that we gave to individual workers. We found several problems with this approach. First, workers tended to take longer to provide their transcriptions as it took them some time
to get into the flow of the audio. A continuous stream avoids
this problem. Second, the interface seemed to encourage
workers to favor quality over speed, whereas streaming content reminds workers of the real-time nature of the task. The
continuous interface was designed in an iterative process
involving tests with 57 remote and local users with a range
of backgrounds and typing abilities. These tests showed that
workers tended to provide chains of words rather than disjoint words, and needed to be informed of the motivations
behind aspects of the interface to use them properly.
A non-obvious question is what the period of the volume
changes should be. In our experiments, we chose to play the
audio at regular volume for 4s and then at a lower volume for