synchrony degrades.
Our system is known interchangeably as the “Informatics
Philharmonic,” or “Music Plus One” (MPO), due to its
alleged improvement on the play-along accompaniment
records from the Music Minus One company that inspired
our work. For several years, we have collaborated with faculty and students in the JSoM on this traditional concerto
setting, in an ongoing effort to improve the performance of
our system while exploring variations on this scenario. The
web page http://www.music.informatics.indiana.edu/papers/
icml10 contains a video of violinist Yoo-jin Cho, accompanied by our system on the first movement of the Sibelius violin concerto, taken from a lecture/concert for our Art’s Week
festival of 2007. We will present a description of the overall
architecture of our system in terms of its three basic components: Listen, Predict, and Play, including several illuminating examples. We also identify open problems or limitations
of proposed approaches that are likely to be interesting to
the Machine Learning community, and well may benefit
from their contributions.
The basic technology required for common practice classical music extends naturally to the avant garde domain. In
fact, we believe one of the greatest potential contributions
of the accompaniment system is in new music composed
specifically for human–computer partnerships. The computer offers essentially unlimited virtuosity in terms of playing fast notes and coordinating complicated rhythms. On
the other hand, at present, the computer is comparatively
weak at providing aesthetically satisfying musical interpretations. Compositions that leverage the technical ability of the accompaniment system, while humanizing the
performance through the live soloist’s leadership, provide
an open-ended musical meeting place for the twenty-first-century composition and technology. Several compositions
of this variety, written specifically for our accompaniment
system by Swiss composer and mathematician Jan Beran,
are presented at the web page referenced above.
2. oVeRVie W of Music PLus one
Our system is composed of three sub-tasks called “Listen,”
“Predict,” and “Play.” The Listen module interprets the
audio input of the live soloist as it accumulates in real time.
In essence, Listen annotates the incoming audio with
a “running commentary,” identifying note onsets with variable detection latency, using the hidden Markov model
discussed in Section 3. A moment’s thought here reveals
that some detection latency is inevitable since a note must
be heard for an instant before it can be identified. For this
reason, we believe it is hopeless to build a purely “
responsive” system—one that waits until a solo note is detected
before playing a synchronous accompaniment event: Our
detection latency is usually in the 30–90-ms range, enough
to prove fatal if the accompaniment is consistently behind
by this much. For this reason, we model the timing of our
accompaniment on the human musician, continually predicting future evolution, while modifying these predictions
as more information becomes available. The module of
our system that performs this task, Predict, is a Gaussian
graphical model quite close to a Kalman Filter, discussed in
Section 4. The Play module uses phase-vocoding5 to construct the orchestral audio output using audio from an
accompaniment-only recording. This well-known technique
warps the timing of the original audio without introducing
pitch distortions, thus retaining much of the original musical intent including balance, expression, and tone color. The
Play process is driven by the output of the Predict module, in
essence by following an evolving sequence of future targets
like a trail of breadcrumbs.
While the basic methodology of the system relies on old
standards from the ML community—HMMs and Gaussian
graphical models—the computational challenge of the
system should not be underestimated, requiring accurate
real-time two-way audio computation in musical scenarios
complex enough to be of interest in a sophisticated musical community. The system was implemented for off-the-shelf hardware in C and C++ over a period of more than 15
years by the author. Both Listen and Play are implemented
as separate threads which both make calls to the Predict
module when either a solo note is detected (Listen) or an
orchestra note is played (Play).
What follows is a more detailed look at Listen and Predict.
3. Listen: hMM-BaseD scoRe foLLo WinG
Blind music audio recognition1, 7, 13 treats the automatic
transcription of music audio into symbolic music representations, using no prior knowledge of the music to be
recognized. This problem remains completely open, especially with polyphonic (several independent parts) music,
where the state of the art remains primitive. While there
are many ways one can build reasonable data models
quantifying how well a particular audio instant matches
a hypothesized collection of pitches, what seems to be
missing is the musical language model. If phonemes and
notes are regarded as the atoms of speech and music,
there does not seem to be a musical equivalent of the word.
Furthermore, while music follows simple logic and can be
quite predictable, this logic is often cast in terms of higher-level constructs such as meter, harmony, and motivic
transformation. Computationally tractable models such
as note n-grams seem to contribute very little here, while
a computationally useful music language model remains
uncharted territory.
Our Listen module deals with the much simpler situation in which the music score is known, giving the pitches
the soloist will play along with their approximate durations. Thus, the score following problem is one of
alignment rather than recognition. Score following, otherwise
known as online alignment, is more difficult than its off-line cousin, since an online algorithm cannot consider
future audio data in estimating the times of audio events.
A score following must “hear” a little bit of a note before
the note’s onset can be detected, thus always resulting with
some degree of latency—the lag between the estimated
onset time and the time the estimate is made. One of the
principal challenges of online alignment is navigating the
trade-off between latency and accuracy. Schwarz14 gives a
nice annotated bibliography of the many contributions to
score following.