The user interface for Scribe presents streaming text
within a collaborative editing framework (see Figure 3).
Scribe’s interface masks the staggered and delayed format
of real-time captions with a more natural flow that mimics
writing. In doing this, the interface presents the merged
inputs from the crowd workers via a dynamically updating
Web page, and allows users to focus on reading, instead of
tracking changes. We have also developed methods for letting users have more control over their own caption playback, which can improve comprehension. 22 When users are
done, pressing stop will end the audio stream, but lets workers complete their current transcription task. Workers are
asked to continue working on other audio for a time to keep
them active so that response time is reduced if users need to
Though this article focuses on captioning speech from
a single person, Scribe can handle dialogues using automated speaker segmentation techniques. We use a standard convolution-based kernel method to first identify
distinct segments in a waveform. We then use a one-class
support vector machine (SVM) to classify each segment and
assign a speaker ID. 15 Prior work has shown such segmentation techniques to be accurate even in the presence of severe
noise, such as when talking on a cellphone while driving. 12
The segmentation allows us to decompose a dialogue in real-time, then caption each part individually, without burdening workers with the need to determine and annotate which
person is currently speaking.
Our solution to the transcription problem is two-fold.
First, we designed an interface that facilitates real-time captioning by non-experts and encourages covering the entire
audio signal. Second, we developed algorithms for merging
partial captions to form one final output stream. The interface and algorithm have been developed to address these
problems jointly. For instance, because determining where
each word in a partial caption fits into the final transcript
is difficult, we designed the interface to encourage workers to type continuous segments during specified periods.
Figure 1. Scribe allows users to caption audio on their mobile device. The audio is sent to multiple amateur captionists who use Scribe’s Web-based
interface to caption as much of the audio as they can in real time. These partial captions are sent to our server to be merged into a final output stream,
which is then forwarded back to the user’s mobile device. Crowd workers are optionally recruited to edit the captions after they have been merged.
we have a crystal
have a crystal that has
we have a crystal that has a two-fold axis... we have a crystal that has a two-fold axis
Captionstream Merged captions
has a two-fold axis
Figure 2. The original worker interface encourages captionists
to type quickly by locking in words soon after they are typed. To
encourage coverage of specific segments, visual and audio cues are
presented, the volume is reduced during off periods, and rewards
are increased during these periods.
Figure 3. The Web-based interface that shows users the live caption
stream returned by Scribe.