other players do. Such a situation occurs with the early-stage
accompaniment problem discussed in Section 1, as here
one cannot learn the desired musicality from the live player.
Perhaps the accompaniment antithesis of the concerto setting is the opera orchestra, in which the “accompanying”
ensemble is often on equal footing with the soloists. We
observed the nadir of our system’s performance in an opera
rehearsal where our system served as rehearsal pianist. What
these two situations have in common is that they require
an accompanist with independent musical knowledge and
goals.
How can we more intelligently model this musicality? An
incremental approach would begin by observing that our
timing model of Equations 3 and 4 is over-parametrized,
with more degrees of freedom than there are notes. We make
this modeling choice because we do not know which degrees
of freedom are needed ahead of time, so we use the training data from the soloist to help sort this out. Unnecessary
learned parameters may contribute some noise to the resulting timing model, but the overall result is acceptable.
One possible line of improvement is simply decreasing
the model’s freedom—surely the player does not wish to
change the tempo and apply tempo-independent note length
variation on every note. For instance, one alternative model
adds a hidden discrete process that “chooses,” for each note,
between three possibilities: variation of either tempo or note
length, or no variation of either kind. Of these, the choice of
neither variation would be the most likely a priori, thus biasing the model toward simpler musical interpretations. The
resulting model is a Switching Kalman Filter. 15 While exact
inference is no longer possible with such a model, we expect
that one can make approximations that will be good enough
to realize the full potential of the model.
Perhaps a more ambitious approach analyzes the musical
score itself to choose the locations requiring degrees of freedom. One can think of this approach as adding “joints” to the
musical structure so that it deforms into musically reasonable shapes as a musician applies external force. Here there
is an interesting connection with the work on expressive
synthesis, such as Widmer and Goebl, 16 in which one algorithmically constructs an expressive rendition of a previously
unseen piece of music, using ideas of machine learning. One
approach here associates various score situations, defined
in terms of local configurations of score features, with
interpretive actions. The associated interpretive actions are
learned by estimating timing and loudness parameters from
a performance corpus, over all “equivalent” score locations.
Such approaches are far more ambitious than our present
approach to musicality, as they try to understand expression
in general, rather than in a specific musical context.
The understanding and synthesis of musical expression
is one of the most interesting music-science problems, and
while progress has been achieved in recent years, it would
still be fair to call the problem “open.” One of the principal
challenges here is that one cannot directly map observable
surface-level attributes of the music, such as pitch contour
or local rhythm context, into interpretive actions, such as
delay, or tempo or loudness change. Rather, there is a murky
intermediate level in which the musician comes to some
understanding of the musical meaning, on which the inter-
pretive decisions are based. This meaning comes from sev-
eral different aspects of the music. For example, some comes
from musical structure, as in the way one might slow down at
the end of a phrase, giving a sense of musical closure. Some
meaning comes from prosodic aspects, analogous to speech,
such as a local point of arrival, which may be maybe empha-
sized or delayed. A third aspect of meaning describes an over-
all character or affect of a section of music, such as excited or
calm. While there is no official taxonomy of musical inter-
pretation, most discussions on this subject revolve around
intermediate identifications of this kind, and the interpretive
actions they require. 10
acknowledgments
This work was supported by NSF Grants IIS-0812244 and
IIS-0739563.
1. Cemgil, a.t., kappen, h.J., barber, D.
a generative model for music
transcription. IEEE Trans. Audio Speech
Lang. Process. 14, 2 (mar. 2006), 679–694.
2. Cont, a., schwarz, D., schnell, n.
From boulez to ballads: training
ircam’s score follower. In Proceedings
of the International Computer Music
Conference (2005), 241–248.
3. Dannenberg, r., mont-reynaud, b.
Following an improvisation in real
time. In Proceedings of the 1987
International Computer Music
Conference (1987), 241–248.
4. Dannenberg, r., mukaino, h. new techniques for enhanced quality of computer
accompaniment. In Proceedings of the
1988 International Computer Music
Conference (1988), 243–249.
5. Flanagan, J.l., golden, r.m. Phase
vocoder. Bell Syst. Tech. J. 45
(nov. 1966), 1493–1509.
6. Franklin, J. Improvisation and
learning. In Advances in Neural
Information Processing Systems 14.
mIt Press, Cambridge, ma, 2002.
7. klapuri, a., Davy, m. (editors). Signal
Processing Methods for Music Transcription. springer-Verlag, new york, 2006.
8. lippe, C. real-time interaction among
composers, performers, and computer
References
Christopher Raphael (craphael@indiana.
edu), school of Informatics and Computing,
Indiana university, bloomington, In.
systems. Inf. Process. Soc. Jpn. sIg
notes, 123 (2002), 1–6.
9. Pachet, F. beyond the cybernetic jam
fantasy: the continuator. IEEE Comput.
Graph. Appl. 24, 1 (2004), 31–35.
10. Palmer, C. music performance. Annu.
Rev. Psychol. 48 (1997), 115–138.
11. raphael, C. a bayesian network for
real-time musical accompaniment.
In Advances in Neural Information
Processing Systems (nIPs) 14. mIt
Press, 2002.
12. rowe, r. Interactive Music Systems.
mIt Press, 1993.
13. sagayama, t.n.s., kameoka, h.
specmurt anasylis: a piano-roll-visualization of polyphonic music
signal by deconvolution of log-frequency spectrum. In Proceedings
2004 ISCA Tutorial and Research
Workshop on Statistical and Perceptual
Audio Processing (SAPA2004) (2004).
14. schwarz, D. score following
commented bibliography, 2003.
15. shumway, r.h., stoffer, D.s. Dynamic
linear models with switching. J. Am.
Stat. Assoc. 86 (1991), 763–769.
16. widmer, g., goebl, w. Computational
models for expressive music
performance: the state of the art.
J. New Music Res. 33, 3 (2004), 203–216
© 2011 aCm 0001-0782/11/0300 $10.00