st as the histogram of a random sample of size C. Thus our
data model becomes the multinomial distribution
It is worth noting that the model generalizes in a straightforward way to situations in which multiple pitches sound
at once, simply by mixing several distributions of the forms
of Equation 3. 1. In this way our approach accommodates
anything from double stops on the violin to large ensemble
This modeling approach describes the part of the audio
spectrum due to the soloist reasonably well. However, our
actual signal will receive not only this solo contribution,
but also audio generated by our accompaniment system
itself. If the accompaniment audio contains frequency
content that is confused with the solo audio, the result
is the highly undesirable possibility of the accompaniment system following itself—in essence, chasing its own
shadow. To a certain degree, the likelihood of this outcome
can be diminished by “turning off” the score follower when
the soloist is not playing; of course we do this. However,
there is still significant potential for shadow-chasing
since the pitch content of the solo and accompaniment
parts is often similar.
Our solution is to directly model the accompaniment
contribution to the audio signal we receive. Since we know
what the orchestra is playing (our system generates this
audio), we add this contribution to the data model. More
explicitly, if qt is the magnitude spectrum of the orchestra’s
contribution in frame t, we model the conditional distribution of st using Equation 1, but with pt,n = λ pn + ( 1 – λ)qt for 0
< λ < 1 instead of pn.
This addition creates significantly better results in many
situations. The surprising difficulty in actually implementing the approach, however, is that there seems to be only
weak agreement between the known audio that our system
plays through the speakers and the accompaniment audio
that comes back through the microphone. Still, with various
averaging tricks in the estimation of qt, we can nearly eliminate the undesirable shadow-chasing behavior.
3. 2. online interpretation of audio
One of the worst things a score follower can do is report
events before they have occurred. In addition to the sheer
impossibility of producing accurate estimates in this case,
the musical result often involves the accompanist arriving at a point of coincidence before the soloist does. When
the accompanist “steps on” the soloist in this manner, the
soloist must struggle to regain control of the performance,
perhaps feeling desperate and irrelevant in the process.
Since the consequences of false positives are so great, the
score follower must be reasonably certain that a note event
has already occurred before reporting its location. The probabilistic formulation of online score following is the key to
avoiding such false positives, while navigating the accuracy-latency trade-off in a reasonable manner.
Every time we process a new frame of audio we recompute
the “forward” probabilities, p(xt|y1,..., yt), for our current
frame, t. Listen waits to detect note n until we are sufficiently
confident that its onset is in the past. That is, until
P(xt ³ startn|y1, . . . , yt) ³ t
for some constant, t. In this expression, startn represents the
initial state of the nth note model, as indicated in Figure 1,
which is either before, or after all other states in the model (xt ³
startn makes sense here). Suppose that t * is the first frame where
the above inequality holds. When this occurs, our knowledge of
the note onset time can be summarized by the function of t:
P(xt = start n|y1, . . . , yt *)
which we compute using the forward–backward algorithm.
Occasionally this distribution conveys uncertainty about the
onset time of the note, say, for instance, if it has high variance or is bimodal. In such a case we simply do not report
the onset time of the particular note, believing it is better to
remain silent than provide bad information. Otherwise, we
estimate the onset as
t ˆn = arg max P(xt = startn|y1, . . . , yt *)
t £ t*
and deliver this information to the Predict module.
Several videos demonstrating the ability of our score
following can be seen at the aforementioned web site. One of
these simply plays the audio while highlighting the locations
of note onset detections at the times they are made, thus
demonstrating detection latency—what one sees lags slightly
behind what one hears. A second video shows a rather eccentric performer who ornaments wildly, makes extreme tempo
changes, plays wrong notes, and even repeats a measure,
thus demonstrating the robustness of the score follower.
4. PReDict: MoDeLinG MusicaL tiMinG
As discussed in Section 2, we believe a purely responsive
accompaniment system cannot achieve acceptable coordination of parts in the range of common practice “
classical” music we treat, thus we choose to schedule our
accompaniment through prediction rather than response.
Our approach is based on a probabilistic model for musical timing. In developing this model, we begin with three
important traits we believe such a model must have.
1. Since our accompaniment must be constructed in real
time, the computational demand of our model must
be feasible in real time.
2. Our system must improve with rehearsal. Thus our
model must be able to automatically train its parameters to embody the timing nuances demonstrated by
the live player in past examples. This way our system
can better anticipate the future musical evolution of
the current performance.
3. If our rehearsals are to be successful in guiding the
system toward the desired musical end, the system
must “sightread” (perform without rehearsal) reason-