3. 1. the listen model
Our HMM approach views the audio data as a sequence of
“frames,” y1, y2,..., yT, with about 30 frames per second,
while modeling these frames as the output of a hidden
Markov chain, x1, x2,. .. , x T. The state graph for the Markov
chain, described in Figure 1, models the music as a sequence
of sub-graphs, one for each solo note, arranged so that the
process enters the start of the (n + 1)th note as it leaves the
nth note. From the figure, one can see that each note begins
with a short sequence of states meant to capture the attack
portion of the note. This is followed by another sequence of
states with self-loops meant to capture the main body of the
note, and to account for the variation in note duration we
may observe, as follows.
If we chain together m states which each either move
forward, with probability p, or remain in the current state,
with probability q = 1 − p, then the total number of state visits (audio frames), L, spent in the sequence of m states has a
negative binomial distribution
for l = m, m + 1, . . . . While convenient to represent this distribution with a Markov chain, the asymmetric nature of the
negative binomial is also musically reasonable: While it is
common for an inter-onset interval (IOI) to be much longer
than its nominal length, the reverse is much less common.
For each note, we choose the parameters m and p so that
E( T) = m/p and Var(T) = mq/p2 reflect our prior beliefs. Before
any rehearsals, the mean is chosen to be consistent with
the note value and the nominal tempo given in the score,
while the variance is chosen to be a fixed increasing function of the mean. However, once we have rehearsed a piece
a few times, we choose m and p according to the method of
moments—so that the empirical mean and variance agree
with the mean and variance from the model.
figure 1. the state graph for the hidden sequence, x1, x2, . . . , of our
hMM.
Note 1
atck atck
start1
sust sust
p
q
sust sust
m
pp
qq
sust
p
q
sust
p
p
qq
Note 2
Note 3
etc.
start2
start3
Our data model is composed of three features bt(yt),
et( yt), st( yt) assumed to be conditionally independent given
the state:
P(bt,et,st|xt) = P(bt|xt) P(et|xt) P(st|xt).
The first feature, bt, measures the local “burstiness” of
the signal, particularly useful in distinguishing between
note attacks and steady-state behavior—observe that we
distinguished between the attack portion of a note and
steady-state portion in Figure 1. The second feature, et, measures the local energy, useful in distinguishing between
rests and notes. By far, however, the vector-valued feature
st is the most important, as it is well-suited to making pitch
discriminations, as follows.
We let fn denote the frequency associated with the nominal pitch of the nth score note. As with any quasi-periodic
signal with frequency fn, we expect that the audio data from
the nth note will have a magnitude spectrum composed of
“peaks” at integral multiples of fn. This is modeled by the
Gaussian mixture model depicted in Figure 2
where hwh = 1 and N( j; m, s 2) is a discrete approximation
of a Gaussian distribution. The model captures the note’s
“spectral envelope,” describing the way energy is distributed over the frequency range. In addition, due to the logarithmic nature of pitch, frequency “errors” committed by
the player are proportional to the desired frequency. This
is captured in our model by the increasing variance of the
mixture components. We define st to be the magnitude
spectrum of yt, normalized to sum to constant value, C. If we
believe the nth note is sounding in the tth frame, we regard
figure 2. an idealized note spectrum modeled as a mixture
of Gaussians.
0.8
0.6
p
0.4
0.2
0.0
0
200
400 600
Frequency (Hz)
800
1000