8. CONCLUSION AND FUTURE WORK
Scribe is the first system capable of making reliable, affordable captions available on-demand to deaf and hard of hearing
users. Scribe has allowed us to explore further issues related
to how real-time captions can be made more useful to end
users. For example, when captions are used, we have shown
that students’ comprehension of instructional material significantly improves when they have the ability to control when
the captions play, and track their position so that they are not
overwhelmed by using one sensory channel to absorb content
that is designed to be split between both vision and hearing.
To help address this problem, we built a tool that lets students
highlight or pause at the last position they read before looking
away from the captions to view other visual content. 22
While we have discussed how automation can be used to
effectively mediate human caption generation, advances in
ASR technologies can aid Scribe as well. By including ASR
systems as workers, we can take advantage of the affordable,
highly-scalable nature of ASR in settings where it works,
while using human workers to ensure that DHH users
always have access to accurate captions. ASR can eventually
use Scribe as an in situ training tool, resulting in systems
that are able to provide reliable captions right out of the
box using human intelligence, and scale to fully automated
solutions quicker than would otherwise be possible.
More generally, Scribe is an example of an interactive system that deeply integrates human and machine intelligence
in order to provide a service that is still beyond what computers can do alone. We believe it may serve as a model for
interactive systems that solve other problems of this type.
This work was supported by the National Science Foundation
under awards #IIS-1149709 and #IIS-1218209, the University
of Michigan, Google, an Alfred P. Sloan Foundation Fellowship,
and a Microsoft Research Ph.D. Fellowship.
.01). Mechanical Turk workers’ mean coverage (Figure 7)
increased 11.39% (t(df) = 2. 19, p < .05), precision increased
12.61% (t(df) = 3. 90, p < .001), and latency was reduced by
16.77% (t(df) = 5. 41, p < .001).
Figure 5. Precision-coverage curves for the electrical engineering (EE)
and chemistry (Chem) lectures using different combiner parameters
with 10 workers. In general, increasing coverage reduces accuracy.
Figure 6. Evaluation of different systems on using (1-WER) as an
accuracy measure (higher is better).
0.6 0.55 0.57
(c=10s, threshold= 2)
(c=15s, threshold= 2)
(c=15s, no threshold)
Figure 7. Relative improvement from no warp to warp conditions
in terms of mean and median values of coverage, precision, and
latency. We expected coverage and precision to improve. Shorter
latency was unexpected, but resulted from workers being able to
consistently type along with the audio instead of having to remember
and go back as the speech outpaced their typing.
+ 11.4% + 12.6%
+ 16.8%+ 14.4%
Coverage Precision Latency
1. Bernstein, M.S., Brandt, J.R., Miller, R. C.,
Karger, D.R. Crowds in two seconds:
Enabling realtime crowd-powered
interfaces. In Proceedings of the 24th
Annual ACM Symposium on User
Interface Software and Technology,
UIST ‘ 11 (New York, NY, USA, 2011).
2. Bigham, J.P., Jayant, C., Ji, H., Little, G.,
Miller, A., Miller, R.C., Miller, R.,
Tatarowicz, A., White, B., White, S.,
Yeh, T. Vizwiz: Nearly real-time
answers to visual questions. In
Proceedings of the 23nd Annual
ACM Symposium on User Interface
Software and Technology, UIST ‘ 10,
(New York, NY, USA, 2010). ACM,
3. Cooke, M., Green, P., Josifovski, L.,
Vizinho, A. Robust automatic speech
recognition with missing
and unreliable acoustic data.
Speech commun. 34, 3 (2001),
4. Driedger, J. Time-scale modification
algorithms for music audio signals.
Master’s thesis, Saarland University,
5. Edgar, R. Muscle: multiple sequence
alignment with high accuracy and
high throughput. Nucleic acids
research 32, 5 (2004), 1792–1797.
6. Elliot, L.B., Stinson, M.S., Easton, D.,
Bourgeois, J. College students learning
with C-print’s education software
and automatic speech recognition.
In American Educational Research
Association Annual Meeting (New
York, NY, 2008), AERA.
7. Flowerdew, J.L. Salience in the
performance of one speech act:the
case of definitions. Discource Processes
15, 2 (Apr–June 1992), 165–181.
8. Metze, F., Gaur, Y., Bigham, J. P.
Manipulating word lattices to
incorporate human corrections. In
Proceedings of INTERSPEECH, (2016).
9. Gaur, Y., Lasecki, W.S., Metze, F.,
Bigham, J.P. The effects of automatic
speech recognition quality on human
transcription latency. In Proceedings
of the 13th Web for All Conference
10. Glass, J.R., Hazen, T. J., Cyphers, D.S.,
Malioutov, I., Huynh, D., Barzilay, R.
Recent progress in the MIT spoken
lecture processing project. In
Interspeech (2007), 2553–2556.
11. Gordon, M., Bigham, J.P., Lasecki,
W.S. Legiontools: A toolkit+ UI for
recruiting and routing crowds to
synchronous real-time tasks. In
Adjunct Proceedings of the 28th