6. 3. Weighted A* search algorithm
We next developed a weighted A* search based MSA algorithm to efficiently align the partial captions. 27 To do this, we
formulate MSA as graph-traversal over a specialized lattice.
Our search algorithm then takes each node as a state, allowing us to estimates the cost function g(n) and the heuristic
function h(n) for each state.
At each step of the A* search algorithm, the node with the
smallest evaluation function is extracted from the priority
queue Q and expanded by one edge. This is repeated until a
full alignment is produced (the goal state). While weighted
A* significantly speeds the search for the best alignment, it
is still too slow for very long sequences. To counteract this,
we use fixed-size time windows to scope the exploration to
the most-likely paths.
7. EXPERIMENTAL RESULTS
We have tested our system with non-expert captionists drawn
from both local and remote crowds. As a data set, we used
lectures freely available from MIT OpenCourse Ware. These
lectures were chosen because one of the main goals of Scribe
is to provide captions for classroom activities, and because
the recording of the lectures roughly matches our target as
well—there is a microphone in the room that often captures
multiple speakers, for example, students asking questions.
We chose four 5 min segments that contained speech from
courses in electrical engineering and chemistry, and had
them professionally transcribed at a cost of $1.75 per minute.
Despite the high cost, we found a number of errors and omissions. We corrected these to obtain a completely accurate
7. 1. Core system study results
Our study used 20 local participants. Each participant captioned 23 min of aural speech over a period of approximately
30 min. Participants first took a standard typing test and
averaged a typing rate of 77.0 WPM (SD= 15. 8) with 2.05%
average error (SD= 2.31%). We then introduced participants
to the real-time captioning interface, and had them caption
a 3 min clip using it. Participants were then asked to caption
the four 5 min clips, two of which were selected to contain
saliency adjustments. We measure coverage (recall within a
10s per-word time bound), precision, and WER.
We found that saliency adjustment made a significant
difference on coverage ranges. For the electrical engineering clip, the difference was 54.7% (SD= 9.4%) for words in the
selected periods as compared to only 23.3% (SD= 6.8%) for
words outside of those periods. For the chemistry clips, the
difference was 50.4% (SD= 9.2%) of words appearing inside
the highlighted period as compared to 15.4% (SD= 4.3%) of
words outside of the period.
To see if workers on Mechanical Turk could complete this
task effectively—which would open up a large new set of workers who are available on-demand—we recruited a crowd to
caption the four clips ( 20 min of speech). Our tasks paid $0.05
and workers could make an additional $0.002 bonus per word.
We provided workers with a 40s instructional video to beign
5. 2. Integrating ASR into crowd captioning
Combining ASR into human captioning workflows can also
help improve captioning performance. By using the suggestions from an ASR system to provide an initial “
baseline” answer that crowd workers can correct, we can reduce
latency. However, above an error rate of ≥ ∼ 30% error, the
ASR input actually increases latency because of the cost of
finding and repairing mistakes. 9 The opposite integration
is also possible: by using sparse human input to provide
corrections to the word lattice of an ASR system, it is possible to reduce the error rate. 8
6. AGGREGATING PARTIAL CAPTIONS
The problem of aligning and aggregating multiple partial transcripts can be mapped to the well-studied Multiple
Sequence Alignment (MSA) problem. The basic formulation of
the problem involves some number of ordered sequences that
include at least some similar elements (coming from the same
“dictionary” of possible terms plus a “gap” term). Finding the
alignment that minimizes total distance between all pairs of
sequences is a non-trivial problem because, in the worst case,
all possible alignments of the content of each sequence—
including all possible spaces containing a gap term—may
need to be explored. This optimization problem has been
shown to be NP-complete, 31 and exact algorithms have time
complexity that is exponential in the number of sequences.
As a result, it is often necessary to apply heuristic approximations to perform MSA with in a reasonable amount of time.
In practice, MSA is a well-studied problem in the bio-informatics literature that has long been used in aligning
genome sequences, but also has applications in approximate
text matching for information retrieval, and in many other
domains. Tools like MUSCLE Edgar5 provide extremely powerful solvers for MSA problems. Accordingly, our approach
is to formulate our text-matching problem as MSA.
6. 1. Progressive alignment algorithms
Most MSA algorithms for biological sequences follow a
progressive alignment strategy that first performs pairwise alignment among the sequences, and then merges
sequences progressively according to a decreasing order of
pairwise similarity. Due to the sequential merging strategy,
progressive alignment algorithms cannot recover from the
errors made in the earlier iterations, and typically do not
work well for the caption alignment task.
6. 2. Graph-based alignment
We first explored a graph-based incremental algorithm to combine partial captions on the fly. 19 The aggregation algorithm
incrementally builds a chain graph, where each node represents a set of equivalent words entered by the workers, and the
links between nodes are adjusted according to the order of the
input words. A greedy search is performed to identify the path
with the highest confidence, based on worker input and an
n-gram language model. The algorithm is designed to be used
online, and hence has high speed and low latency. However,
due to the incremental nature of the algorithm and the lack of
a principled objective function, it is not guaranteed to find the
globally optimal alignment for the captions. http://ocw.mit.edu/courses/.