98 COMMUNICATIONS OF THE ACM | SEPTEMBER 2017 | VOL. 60 | NO. 9
with. In total, 18 workers participated, collectively achieving
78.0% coverage. The average coverage over just three workers was 59.7% (SD= 10.9%), suggesting we could be conservative
in recruiting workers and cover much of the input signal.
In our tests, workers achieved an average of 29.0% coverage, ASR achieved 32.3% coverage, CART achieved 88.5% coverage and Scribe reached 74% out of a possible 93.2% coverage
using 10 workers (Figure 4). Collectively, workers had an average latency of 2. 89 significantly improving on CART’s latency
of 4.38s. For this example, we tuned our combiner to balance
coverage and precision (Figure 5), getting an average of 66%
and 80.3% respectively. As expected, CART outperforms the
other approaches. However, our combiner presents a clear
improvement over both ASR and a single worker.
7. 2. Improved combiner results
We further improved alignment accuracy by applying a novel
weighted-A* MSA algorithm. 27 To test this, we used the same
four 5 min long audio clips as before. We tested three configurations of our algorithm: ( 1) no agreement needed with
a 15s sliding window, ( 2) two-person agreement needed with
a 10s window, and ( 3) two-person agreement needed with a
15s window. We compare the results from these three configurations to our original graph-based method, and to the
MUSCLE package (Figure 6).
The with agreement and a 15s window (the best performing setting), our algorithm achieves 57.4% average (1-WER)
accuracy, providing 29.6% improvement with respect to the
graph-based system (average accuracy 42.6%), and 35.4%
improvement with respect to the MUSCLE-based MSA system (average accuracy 41.9%). On the same set of audio clips,
we obtained 36.6% accuracy using ASR (Dragon Naturally
Speaking, version 11. 5 for Windows), which is worse than
all the crowd-powered approaches. We intentionally did not
optimize the ASR for the speaker or acoustics, since DHH
students would also not be able to do this in realistic settings.
Figure 4. Optimal coverage reaches nearly 80% when combining the input of four workers, and nearly 95% with all 10 workers, showing
captioning audio in real time with non-experts is feasible.
Number of workers
7. 3. Time Warp results
To evaluate Time Warp, we ran two studies that asked participants to caption a 2. 5 min ( 12 captioning cycles) lecture
clip. Again, we ran our experiments with both local participants and workers recruited from Mechanical Turk. Tests
were divided into two conditions: time warping on or off,
and were randomized across four possible time offsets: 0s,
3.25s, 6.5s, 9.75s.
Local participants were again generally proficient (but
non-expert) typists and had time to acquaint themselves
with the system, which may better approximate student
employees captioning a classroom lecture. We recruited
24 volunteers (mostly students) and had them practice with
our baseline interface before using the time warp interface.
Each worker was asked to complete two trials, one with
Time Warp and one without, in a random order.
We also recruited 139 Mechanical Turk workers, who
were allowed to complete at most two tasks and were
randomly routed to each condition (providing 257 total
responses). Since Mechanical Turk often contains low quality (or even malicious workers), 18 we first removed inputs
which got less than 10% coverage or precision or were outliers more than 2σ from the mean. A total of 206 tasks were
approved by this quick check. Task payment amounts were
the same as for our studies described above.
Our student captionists were able to caption a majority of the content well even without Time Warp. The mean
coverage from all 48 trials was 70.23% and the mean precision was 70.71%, compared to the 50.83% coverage and
62.23% precision for workers drawn from Mechanical
Turk. For student captionists, total coverage went up
2.02%, from 69.54% to 70.95%, and precision went up by
2.56% from 69.84% to 71.63%, but neither of these differences were detectably significant. However, there was a
significant improvement in mean latency per word, which
improved 22.46% from 4.34s to 3.36s (t(df) = 2. 78, p <