Download The Time-Course of Pulse Sensation: Dynamics of Beat Induction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Time-Course of Pulse Sensation: Dynamics of Beat Induction
Petri Toiviainen, Department of Music, University of Jyväskylä
Joel Snyder, Department of Psychology, Cornell University
Introduction
The ability to infer beat and meter from music is one of the basic activities of musical
cognition. It is a rapid process: after having heard only a short fragment of music we are
able to develop a sense of beat and meter and tap our foot along with it. Even if the music
is rhythmically complex, containing a range of different time intervals and probably
syncopation, we are capable of inferring the different periodicities of it and synchronizing
to them. A rhythmic sequence usually evokes a number of different pulse sensations,
each of which has a different perceptual salience. The listener can switch the focus of
attention from one to another at will (Jones, Boltz, & Kidd 1982). Furthermore, for a
given piece of music, the most salient pulse sensation can vary between listeners.
The salience of a given pulse sensation depends on a number of factors related to
the surface and structural properties of music. These factors include the frequency of tone
onsets that coincide with the pulse (Palmer & Krumhansl, 1990) and the phenomenal
accents of these notes (Lerdahl and Jackendoff, 1983). Phenomenal accents arise from
surface properties of music such as pitch, duration, and loudness. For instance, a long
note is usually perceived as more accented than a short one (Parncutt, 1994).
In addition to the temporal structure of music, pitch information may affect the
salience of pulse sensations. Evidence for this can be found from studies on the
interaction between pitch and rhythm. For instance, memory recall of melodies is
impaired if pitch and rhythmic patterns are out of phase (Boltz & Jones, 1986; Deutsch,
1980; Monahan, Kendall, & Carterette, 1987). In addition to melodic information,
harmonic information has been found to be important in deducing the meter (Dawe, Platt,
and Racine, 1994). Snyder and Krumhansl (1999) found that pitch information affected
the mode of tapping to ragtime excerpts: when pitch information was present, the subjects
tapped more frequently on the down beat than on the up beat. The effect of pitch
information on other performance measures, however, was not significant.
A further factor that affects the salience of pulse sensation is the pulse period
(Fraisse, 1982; Parncutt, 1994; van Noorden & Moelants, 1999; Clarke, 1999). According
to these studies, the most salient pulse sensations have a period of approximately 600
msec, the region of greatest salience being between 400 and 900 msec.
Models of beat induction presented to date have been based on various
computational formalisms. These include symbolic systems (Longuet-Higgins & Lee,
1982), statistical approaches (Palmer & Krumhansl, 1990; Brown, 1993), optimization
approaches (Povel & Essens, 1985; Parncutt, 1994), control theory (Dannenberg & MontReynaud, 1987), and connectionist models (Desain & Honing 1989; Scarborough, Miller,
& Jones, 1992; Gasser, Eck, & Port, 1999), and oscillator models (Scheirer, 1998; Large
& Kolen, 1994; McAuley & Kidd, 1998). All these models, except for the one by
Scheirer (1998), rely solely on the temporal structure, thus ignoring features related to
pitch. Scheirer’s model uses audio input that is passed though a bank of band-pass filters.
The present study explores the time-course of pulse sensation and its dependence
on various musical features, such as onset time structure, pitch height, and harmonic
structure. A system of resonating oscillators is used to model the process. The stimuli
used and the performance measures obtained from the tapping experiment by Snyder and
Krumhansl (1999) were used to optimize and evaluate the model.
Model of beat induction
The beat induction model used in this study uses pitch and onset time information as
input. This information can be obtained either using a MIDI representation or by
preprocessing acoustical input. In the present study, a MIDI representation was used. The
model is based on a set of competing oscillators. Each oscillator represents a pulse
sensation evoked by the input. Oscillators are created dynamically as the music unfolds in
time: new oscillators are created at each tone onset, their initial periods being equal to the
interval between that onset and a previous onset. In principle, all possible combinations
of starting points and periods can be considered. In practice, only oscillators that
represent a pulse sensation not present up to that instant need to be taken into account. If
the music is performed and thus contains expressive timing, it is necessary to use
adaptive oscillators (Large & Kolen, 1994; Toiviainen, 1998).
The perceptual salience of each pulse sensation is modeled with the resonance
value of the respective oscillator. The contribution of each tone to the resonance of each
oscillator depends on the degree of synchrony between the tone onset and the oscillator’s
pulse, the inter-onset interval following the onset, and the pitch of the tone. To study the
effect of these different factors, three different models were used.
Model 1. Model 1 relies solely on the temporal structure of the music. The resonance
dynamics are modeled with a damped system driven by an external force. More
specifically, the resonance value of oscillator i is determined by
Ý
Ý
rÝ
i  f  c(ri  ri /  r ) ,
(1)
where f is the driving force, c is the damping constant, and  r is the time constant. The
first-order time derivative rÝ
i is included in order to smoothen the resonance function; for
the damping constant, the value c 1sec1 is used. The parameter  r models the length
of the temporal integration window. With the absence of any external force, the
resonance value decays approximately by the factor of 1/ e  0.37 during an interval of
r .
The driving force has the form
f (t)  ai (t*)e
(t t *)/  f
,t * (t)  max ti ,
t i t
(2)
where ai (t*) is the output of oscillator i at the most recent tone onset. According to
Equations 1 and 2, the oscillators that are at the peak of their output start to increase their
resonance up to the next note onset. Due to the exponential decay of the driving force, the
increase of the resonance is proportional to the perceived durational accent of the
respective tone (Parncutt, 1994). The resonance value of each oscillator is weighted
according to its oscillation period: the closer the period is to the period of most salient
pulse sensations, the higher is the weighting.
At each instant, the oscillator with the highest resonance represents the perceived
pulse. This oscillator is referred to as the winner. To model the stability in maintaining
the tapping mode observed in tapping studies, the winner is changed only when the
highest resonance value exceeds that of the winner by a switching threshold  sw . In other
words, a switch in the tapping mode occurs when
max ri  (1   sw )rwinner .
i
(3)
The model produces a tap whenever the winner oscillator has zero phase and its
resonance exceeds the tapping threshold  tap .
Model 2. Model 2 is similar to model 1, with the addition that it takes pitch height into
account. It does so by passing the tone information through a bank of Gaussian filters that
are equidistantly spaced on the pitch dimension. This filter bank divides the input to
several pitch channels, for each of which the resonance dynamics scheme is applied
separately. Therefore, the model segregates the input to a set of streams depending on the
pitch height. For each pulse mode, the resonance value is then obtained by summing the
resonance values across all the channels. Each pitch channel has an individual weight that
depends on the center pitch of the channel according to
w  e ( p 64) ,
(4)
where p is the center pitch, with 64 corresponding to C4, and  is the pitch weighting
parameter. When   0 , all channels receive an equal weighting; when   0 , low
pitches receive a higher weighting than high pitches.
Model 3. Model 3 is similar to model 2, with the addition that it weights the notes
according to their tonal significance. It assumes that tonally significant tones increase the
salience of the pulses with which they co-occur more than do less significant tones. The
model uses the key-finding algorithm by Krumhansl (1990), with the modification that it
uses an exponential time window for integrating the pitch information. For each tone, the
driving force of equation 2 is weighted by the value of the respective component of the
probe tone profile (Krumhansl & Kessler, 1982) of the current key. Whenever several
notes occur simultaneously, the average of their probe tone profile values is used.
Tapping experiment
Stimuli. The stimuli consisted of seven ragtime excerpts used in Snyder and Krumhansl
(1999). Each excerpt had a metronomic timing and an equalized MIDI velocity. Four
versions of each excerpt were used: 1) full pitched, 2) full monotonic, 3) RH pitched (a
pitched version of only the right-hand notes), and 4) RH monotonic (a monotonic version
of only the right-hand notes). A total of 28 stimuli were thus used. The length of each
stimulus was ~40 sec.
Subjects. Twelve musically experienced students participated in the tapping experiment.
Each subject was asked to tap the most comfortable pulse of each excerpt.
Performance measures. Six performance measures recorded in the tapping study were
used in the present study. These were 1) the beat to start tapping (BST), the proportion of
time spent in each of the following tapping modes: 2) on down-beat (down), 3) on upbeat
(up), 4) periodically but neither on the down-beat nor on the up-beat (neither), and 5)
aperiodically (aper); and 6) the number of switches between tapping modes.
Results. It was found that the subjects tapped significantly more on the down-beat and
less on the up-beat for the pitched than for the monotonic versions. For the other
performance measures used in this study there was, however, no significant difference
between pitched and monotonic versions. For the RH version, the subjects tapped
significantly less on the down-beat and more on neither or aperiodically than for the full
versions. Moreover, with the RH versions there were significantly more switches than
with the full versions.
Optimization of the models
Each of the three models described above was optimized with respect to the following
parameters: time constant for temporal integration,  r , pitch weighting,  , tapping
threshold,  tap , and switching threshold,  sw . The optimization was carried out using the
technique of simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) as follows. For a
given combination of parameter values, the six aforementioned performance measures
were calculated using each of the 28 stimuli as input. An error function was defined as
the sum of absolute errors between the model’s and the humans’ performance measures,
taken across the 28 stimuli and the six performance measures. This error function was
minimized with respect to the parameter values.
Each of the three models had approximately equal optimal parameter values.
These were r  4sec ,   0.03 ,  tap  0.4 , and  sw  0.2 . Thus the optimal values were
obtained using a temporal integration length of 4 seconds, weighting the pitches so that
the weighting increases by a factor of approximately 1.4 for each descending octave, and
accepting switches only when the maximum resonance exceeds that of the winner by at
least 20 percent. The meaning of the optimal tapping threshold value,  tap  0.4 , is more
difficult to interpret.
Comparison between human and model data
RMS errors. The total RMS error values for the optimized models were 1.78, 1.82, and
1.19 for models 1, 2, and 3, respectively. In terms of the total RMS error, model 3 thus
performed best, followed by model 1 and model 2, in that order. Table 1 shows the rootmean-square (RMS) errors for each performance measure and model separately. For the
performance measures BST, down, up, and switches, the lowest RMS error was obtained
with model 3. The lowest RMS errors for the performance measures neither and aper
were obtained with models 2 and 1,respectively.
TABLE 1. RMS errors between human and model data
BST
down
up
neither
aper
switches
model1
0.245
0.317
0.432
0.153
0.226
0.770
model2
0.254
0.289
0.373
0.106
0.266
0.897
model3
0.242
0.240
0.312
0.135
0.248
0.354
Correlations. Table 2 shows the correlations between the human and models data for
each performance measure and model. The average correlations, taken across the six
performance measures, are 0.417, 0.463, and 0.553 for the models 1, 2, and 3,
respectively. As can be seen, the highest correlation for all performance measures except
switches was obtained with model 3. For model 3, all the correlations except that for BST
are significant at the p<0.05 level. Figure 1 shows the performance measures of the
subjects and model 3 for each of the stimuli as scatter plots. As can be seen, model 3 can
predict the performance measures down, up, and neither relatively well. Further, the
model produces considerably less aperiodic tapping and slightly more switches between
tapping modes than do the subjects.
TABLE 2. Correlations between human and model data
BST
down
up
neither
aper
switches
model1
0.017
0.550**
0.397*
0.889***
0.300
0.347
model2
0.087
0.544**
0.423*
0.901***
0.260
0.563**
model3
0.313
0.687***
0.510**
0.942***
0.375*
0.492**
*p<0.05, **p<0.01, ***p<0.0001
Discussion. In terms of both RMS errors and correlations, model 3 fits better with the
human data than do models 1 and 2. This may suggest that, at least in ragtime music, the
tonal cues may be used in determining the phase of tapping. Models 1 and 2 performed
almost equally well in terms of the RMS error, whereas model 2 correlated with the
human data slightly better than did model 1. The main contribution to the latter difference
comes from the correlations taken from the number of switches. In terms of predicting
the tapping mode, there was thus no significant difference between models 1 and 2. This
may imply that pitch height information was not used by the subjects.
Conclusion
We studied the dependence of pulse sensation evoked by ragtime music on temporal,
pitch, and tonal factors. For this we used three different models. Model 1 relied on
temporal aspects only. Model 2 segregates the input to a set of streams using pitch height
information and weights each of the streams differently. Model 3 takes into account the
tonal significance of each note. The output of each of the models was compared with
human data obtained using the same set of stimuli. It was found that model 2 did not
perform significantly better than model 1. Consequently, the subjects may not have used
pitch height information when tapping. Model 3, on the other hand, performed
significantly better than the other two models. This suggests that tonal significance of
tones may affect the perception of pulse. More specifically, tones that are high in the
tonal hierarchy may be perceived as more accentuated.
References
Boltz, M., & Jones, M.R. (1986). Does rule recursion make melodies easier to reproduce?
If not, what does? Cognitive Psychology, 18, 389-431.
Brown, J. C. (1993). Determination of meter of musical scores by autocorrelation.
Journal of the Acoustical Society of America, 94(4), 1953-1957.
Clarke E. F. (1999). Rhythm and timing in music. In D. Deutsch (Ed.), The psychology of
music (2nd ed., pp. 473-500). New York: Academic Press.
Dannenberg, R. B. & Mont-Reynaud, B. (1987). Following a jazz improvisation in real
time. In Proceedings of the 1987 International Computer Music Conference. San
Francisco: International Computer Music Association, 241-248.
Dawe, L. A., Platt, J. R., & Racine, R. J. (1994). Inference of metrical structure from
perception of iterative pulses within time spans defined by chord changes. Music
Perception, 12(1), 57-76.
Desain, P. & Honing, H. (1989). The quantization of musical time: a connectionist
approach. Computer Music Journal, 13(3), 56-66.
Deutsch, D. (1980). The processing of structured and unstructured tonal sequences.
Perception & Psychophysics, 28, 381-389.
Fraisse, P. (1982). Rhythm and tempo. In D. Deutsch (Ed.), The psychology of music (2nd
ed., pp. 149-180). New York: Academic Press.
Gasser, M., Eck, D. ,& Port, R. (1999). Meter as mechanism: a neural network model that
learns musical patterns. Connection Science, 11, 187-215.
Jones, M. R., Boltz, M., & Kidd, G. (1982). Controlled attending as a function of melodic
and temporal context. Perception & Psychophysics, 32, 211-218.
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated
annealing. Science, 220, 671-680.
Krumhansl, C. L. & Kessler, E. J. (1982). Tracing the dynamic changes in perceived
tonal organization in a spatial representation of musical keys. Psychological
Review, 89, 334-368.
Large, E. W. & Kolen, J. F. (1994). Resonance and the perception of musical meter.
Connection Science, 6(2-3), 177-208.
Lerdahl, F. & Jackendoff, R. (1983). A generative theory of tonal music. Cambridge,
MA: MIT Press.
Longuet-Higgins, H. C. & Lee, C. S. (1982). Perception of musical rhythms. Perception,
11, 115-128.
McAuley, J. D., & Kidd, G.R. (1998). Effect of deviations from temporal expectations on
tempo discrimination of isochronous tone sequences. Journal of Experimental
Psychology: Human Perception and Performance, 24, 1786-1800.
Monahan, C. B., Kendall, R. A., & Carterette, E. C. (1987). The effect of melodic and
temporal contour on recognition memory for pitch change. Perception &
Psychophysics, 41, 576-600.
Palmer, C. & Krumhansl, C. (1990). Mental representations of musical meter. Journal of
Experimental Psychology: Human Perception and Performance, 16, 728-741.
Parncutt, R. (1994). A perceptual model of pulse salience and metrical accent in musical
rhythms. Music Perception, 11(4), 409-464.
Povel, D. J. & Essens, P. (1985). Perception of temporal patterns. Music Perception, 2(4),
411-440.
Scarborough, D. L., Miller, B. O. & Jones, J. A. (1992). On the perception of meter. In
M. Balaban, K. Ebcioglu & O. Laske (Eds.), Understanding music with AI:
perspectives in music cognition. Cambridge, MA: MIT Press, 427-447.
Scheirer, E. D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of
the Acoustical Society of America, 103(1), 588-601.
Snyder J. & and Krumhansl C. L. (1999). Cues to pulse-finding in piano ragtime music.
Society for Music Perception and Cognition Abstracts. Evanston, IL.
Toiviainen, P. (1998). An interactive MIDI accompanist. Computer Music Journal, 22(4),
63-75.
Van Noorden, L., & Moelants, D. (1999). Resonance in the perception of musical pulse.
Journal of New Music research, 28, 43-66.
0 1 20 3 4 5 6
BST/human
0.5
0
up/model
1
0.5
0
0
0.25
0.5
down/human
0
0.5
neither/human
1
0.5
0
1
0
0
0.25
aper/human
0
0.5
up/human
1
0
2
switches/human
4
4
switches/model
neither/model
1
1
down/model
6
5
4
3
2
1
aper/model
BST/model
G
are
Qraphics
uickTim
needed
decom
e™
toand
see
pressor
athis picture.
2
0
Figure 1. Scatter plots of the six performance measures taken from subjects and model 3.
Each point represents one stimulus; its abscissa and ordinate correspond to human and
model data, respectively.