Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Time-Course of Pulse Sensation: Dynamics of Beat Induction Petri Toiviainen, Department of Music, University of Jyväskylä Joel Snyder, Department of Psychology, Cornell University Introduction The ability to infer beat and meter from music is one of the basic activities of musical cognition. It is a rapid process: after having heard only a short fragment of music we are able to develop a sense of beat and meter and tap our foot along with it. Even if the music is rhythmically complex, containing a range of different time intervals and probably syncopation, we are capable of inferring the different periodicities of it and synchronizing to them. A rhythmic sequence usually evokes a number of different pulse sensations, each of which has a different perceptual salience. The listener can switch the focus of attention from one to another at will (Jones, Boltz, & Kidd 1982). Furthermore, for a given piece of music, the most salient pulse sensation can vary between listeners. The salience of a given pulse sensation depends on a number of factors related to the surface and structural properties of music. These factors include the frequency of tone onsets that coincide with the pulse (Palmer & Krumhansl, 1990) and the phenomenal accents of these notes (Lerdahl and Jackendoff, 1983). Phenomenal accents arise from surface properties of music such as pitch, duration, and loudness. For instance, a long note is usually perceived as more accented than a short one (Parncutt, 1994). In addition to the temporal structure of music, pitch information may affect the salience of pulse sensations. Evidence for this can be found from studies on the interaction between pitch and rhythm. For instance, memory recall of melodies is impaired if pitch and rhythmic patterns are out of phase (Boltz & Jones, 1986; Deutsch, 1980; Monahan, Kendall, & Carterette, 1987). In addition to melodic information, harmonic information has been found to be important in deducing the meter (Dawe, Platt, and Racine, 1994). Snyder and Krumhansl (1999) found that pitch information affected the mode of tapping to ragtime excerpts: when pitch information was present, the subjects tapped more frequently on the down beat than on the up beat. The effect of pitch information on other performance measures, however, was not significant. A further factor that affects the salience of pulse sensation is the pulse period (Fraisse, 1982; Parncutt, 1994; van Noorden & Moelants, 1999; Clarke, 1999). According to these studies, the most salient pulse sensations have a period of approximately 600 msec, the region of greatest salience being between 400 and 900 msec. Models of beat induction presented to date have been based on various computational formalisms. These include symbolic systems (Longuet-Higgins & Lee, 1982), statistical approaches (Palmer & Krumhansl, 1990; Brown, 1993), optimization approaches (Povel & Essens, 1985; Parncutt, 1994), control theory (Dannenberg & MontReynaud, 1987), and connectionist models (Desain & Honing 1989; Scarborough, Miller, & Jones, 1992; Gasser, Eck, & Port, 1999), and oscillator models (Scheirer, 1998; Large & Kolen, 1994; McAuley & Kidd, 1998). All these models, except for the one by Scheirer (1998), rely solely on the temporal structure, thus ignoring features related to pitch. Scheirer’s model uses audio input that is passed though a bank of band-pass filters. The present study explores the time-course of pulse sensation and its dependence on various musical features, such as onset time structure, pitch height, and harmonic structure. A system of resonating oscillators is used to model the process. The stimuli used and the performance measures obtained from the tapping experiment by Snyder and Krumhansl (1999) were used to optimize and evaluate the model. Model of beat induction The beat induction model used in this study uses pitch and onset time information as input. This information can be obtained either using a MIDI representation or by preprocessing acoustical input. In the present study, a MIDI representation was used. The model is based on a set of competing oscillators. Each oscillator represents a pulse sensation evoked by the input. Oscillators are created dynamically as the music unfolds in time: new oscillators are created at each tone onset, their initial periods being equal to the interval between that onset and a previous onset. In principle, all possible combinations of starting points and periods can be considered. In practice, only oscillators that represent a pulse sensation not present up to that instant need to be taken into account. If the music is performed and thus contains expressive timing, it is necessary to use adaptive oscillators (Large & Kolen, 1994; Toiviainen, 1998). The perceptual salience of each pulse sensation is modeled with the resonance value of the respective oscillator. The contribution of each tone to the resonance of each oscillator depends on the degree of synchrony between the tone onset and the oscillator’s pulse, the inter-onset interval following the onset, and the pitch of the tone. To study the effect of these different factors, three different models were used. Model 1. Model 1 relies solely on the temporal structure of the music. The resonance dynamics are modeled with a damped system driven by an external force. More specifically, the resonance value of oscillator i is determined by Ý Ý rÝ i f c(ri ri / r ) , (1) where f is the driving force, c is the damping constant, and r is the time constant. The first-order time derivative rÝ i is included in order to smoothen the resonance function; for the damping constant, the value c 1sec1 is used. The parameter r models the length of the temporal integration window. With the absence of any external force, the resonance value decays approximately by the factor of 1/ e 0.37 during an interval of r . The driving force has the form f (t) ai (t*)e (t t *)/ f ,t * (t) max ti , t i t (2) where ai (t*) is the output of oscillator i at the most recent tone onset. According to Equations 1 and 2, the oscillators that are at the peak of their output start to increase their resonance up to the next note onset. Due to the exponential decay of the driving force, the increase of the resonance is proportional to the perceived durational accent of the respective tone (Parncutt, 1994). The resonance value of each oscillator is weighted according to its oscillation period: the closer the period is to the period of most salient pulse sensations, the higher is the weighting. At each instant, the oscillator with the highest resonance represents the perceived pulse. This oscillator is referred to as the winner. To model the stability in maintaining the tapping mode observed in tapping studies, the winner is changed only when the highest resonance value exceeds that of the winner by a switching threshold sw . In other words, a switch in the tapping mode occurs when max ri (1 sw )rwinner . i (3) The model produces a tap whenever the winner oscillator has zero phase and its resonance exceeds the tapping threshold tap . Model 2. Model 2 is similar to model 1, with the addition that it takes pitch height into account. It does so by passing the tone information through a bank of Gaussian filters that are equidistantly spaced on the pitch dimension. This filter bank divides the input to several pitch channels, for each of which the resonance dynamics scheme is applied separately. Therefore, the model segregates the input to a set of streams depending on the pitch height. For each pulse mode, the resonance value is then obtained by summing the resonance values across all the channels. Each pitch channel has an individual weight that depends on the center pitch of the channel according to w e ( p 64) , (4) where p is the center pitch, with 64 corresponding to C4, and is the pitch weighting parameter. When 0 , all channels receive an equal weighting; when 0 , low pitches receive a higher weighting than high pitches. Model 3. Model 3 is similar to model 2, with the addition that it weights the notes according to their tonal significance. It assumes that tonally significant tones increase the salience of the pulses with which they co-occur more than do less significant tones. The model uses the key-finding algorithm by Krumhansl (1990), with the modification that it uses an exponential time window for integrating the pitch information. For each tone, the driving force of equation 2 is weighted by the value of the respective component of the probe tone profile (Krumhansl & Kessler, 1982) of the current key. Whenever several notes occur simultaneously, the average of their probe tone profile values is used. Tapping experiment Stimuli. The stimuli consisted of seven ragtime excerpts used in Snyder and Krumhansl (1999). Each excerpt had a metronomic timing and an equalized MIDI velocity. Four versions of each excerpt were used: 1) full pitched, 2) full monotonic, 3) RH pitched (a pitched version of only the right-hand notes), and 4) RH monotonic (a monotonic version of only the right-hand notes). A total of 28 stimuli were thus used. The length of each stimulus was ~40 sec. Subjects. Twelve musically experienced students participated in the tapping experiment. Each subject was asked to tap the most comfortable pulse of each excerpt. Performance measures. Six performance measures recorded in the tapping study were used in the present study. These were 1) the beat to start tapping (BST), the proportion of time spent in each of the following tapping modes: 2) on down-beat (down), 3) on upbeat (up), 4) periodically but neither on the down-beat nor on the up-beat (neither), and 5) aperiodically (aper); and 6) the number of switches between tapping modes. Results. It was found that the subjects tapped significantly more on the down-beat and less on the up-beat for the pitched than for the monotonic versions. For the other performance measures used in this study there was, however, no significant difference between pitched and monotonic versions. For the RH version, the subjects tapped significantly less on the down-beat and more on neither or aperiodically than for the full versions. Moreover, with the RH versions there were significantly more switches than with the full versions. Optimization of the models Each of the three models described above was optimized with respect to the following parameters: time constant for temporal integration, r , pitch weighting, , tapping threshold, tap , and switching threshold, sw . The optimization was carried out using the technique of simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) as follows. For a given combination of parameter values, the six aforementioned performance measures were calculated using each of the 28 stimuli as input. An error function was defined as the sum of absolute errors between the model’s and the humans’ performance measures, taken across the 28 stimuli and the six performance measures. This error function was minimized with respect to the parameter values. Each of the three models had approximately equal optimal parameter values. These were r 4sec , 0.03 , tap 0.4 , and sw 0.2 . Thus the optimal values were obtained using a temporal integration length of 4 seconds, weighting the pitches so that the weighting increases by a factor of approximately 1.4 for each descending octave, and accepting switches only when the maximum resonance exceeds that of the winner by at least 20 percent. The meaning of the optimal tapping threshold value, tap 0.4 , is more difficult to interpret. Comparison between human and model data RMS errors. The total RMS error values for the optimized models were 1.78, 1.82, and 1.19 for models 1, 2, and 3, respectively. In terms of the total RMS error, model 3 thus performed best, followed by model 1 and model 2, in that order. Table 1 shows the rootmean-square (RMS) errors for each performance measure and model separately. For the performance measures BST, down, up, and switches, the lowest RMS error was obtained with model 3. The lowest RMS errors for the performance measures neither and aper were obtained with models 2 and 1,respectively. TABLE 1. RMS errors between human and model data BST down up neither aper switches model1 0.245 0.317 0.432 0.153 0.226 0.770 model2 0.254 0.289 0.373 0.106 0.266 0.897 model3 0.242 0.240 0.312 0.135 0.248 0.354 Correlations. Table 2 shows the correlations between the human and models data for each performance measure and model. The average correlations, taken across the six performance measures, are 0.417, 0.463, and 0.553 for the models 1, 2, and 3, respectively. As can be seen, the highest correlation for all performance measures except switches was obtained with model 3. For model 3, all the correlations except that for BST are significant at the p<0.05 level. Figure 1 shows the performance measures of the subjects and model 3 for each of the stimuli as scatter plots. As can be seen, model 3 can predict the performance measures down, up, and neither relatively well. Further, the model produces considerably less aperiodic tapping and slightly more switches between tapping modes than do the subjects. TABLE 2. Correlations between human and model data BST down up neither aper switches model1 0.017 0.550** 0.397* 0.889*** 0.300 0.347 model2 0.087 0.544** 0.423* 0.901*** 0.260 0.563** model3 0.313 0.687*** 0.510** 0.942*** 0.375* 0.492** *p<0.05, **p<0.01, ***p<0.0001 Discussion. In terms of both RMS errors and correlations, model 3 fits better with the human data than do models 1 and 2. This may suggest that, at least in ragtime music, the tonal cues may be used in determining the phase of tapping. Models 1 and 2 performed almost equally well in terms of the RMS error, whereas model 2 correlated with the human data slightly better than did model 1. The main contribution to the latter difference comes from the correlations taken from the number of switches. In terms of predicting the tapping mode, there was thus no significant difference between models 1 and 2. This may imply that pitch height information was not used by the subjects. Conclusion We studied the dependence of pulse sensation evoked by ragtime music on temporal, pitch, and tonal factors. For this we used three different models. Model 1 relied on temporal aspects only. Model 2 segregates the input to a set of streams using pitch height information and weights each of the streams differently. Model 3 takes into account the tonal significance of each note. The output of each of the models was compared with human data obtained using the same set of stimuli. It was found that model 2 did not perform significantly better than model 1. Consequently, the subjects may not have used pitch height information when tapping. Model 3, on the other hand, performed significantly better than the other two models. This suggests that tonal significance of tones may affect the perception of pulse. More specifically, tones that are high in the tonal hierarchy may be perceived as more accentuated. References Boltz, M., & Jones, M.R. (1986). Does rule recursion make melodies easier to reproduce? If not, what does? Cognitive Psychology, 18, 389-431. Brown, J. C. (1993). Determination of meter of musical scores by autocorrelation. Journal of the Acoustical Society of America, 94(4), 1953-1957. Clarke E. F. (1999). Rhythm and timing in music. In D. Deutsch (Ed.), The psychology of music (2nd ed., pp. 473-500). New York: Academic Press. Dannenberg, R. B. & Mont-Reynaud, B. (1987). Following a jazz improvisation in real time. In Proceedings of the 1987 International Computer Music Conference. San Francisco: International Computer Music Association, 241-248. Dawe, L. A., Platt, J. R., & Racine, R. J. (1994). Inference of metrical structure from perception of iterative pulses within time spans defined by chord changes. Music Perception, 12(1), 57-76. Desain, P. & Honing, H. (1989). The quantization of musical time: a connectionist approach. Computer Music Journal, 13(3), 56-66. Deutsch, D. (1980). The processing of structured and unstructured tonal sequences. Perception & Psychophysics, 28, 381-389. Fraisse, P. (1982). Rhythm and tempo. In D. Deutsch (Ed.), The psychology of music (2nd ed., pp. 149-180). New York: Academic Press. Gasser, M., Eck, D. ,& Port, R. (1999). Meter as mechanism: a neural network model that learns musical patterns. Connection Science, 11, 187-215. Jones, M. R., Boltz, M., & Kidd, G. (1982). Controlled attending as a function of melodic and temporal context. Perception & Psychophysics, 32, 211-218. Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220, 671-680. Krumhansl, C. L. & Kessler, E. J. (1982). Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys. Psychological Review, 89, 334-368. Large, E. W. & Kolen, J. F. (1994). Resonance and the perception of musical meter. Connection Science, 6(2-3), 177-208. Lerdahl, F. & Jackendoff, R. (1983). A generative theory of tonal music. Cambridge, MA: MIT Press. Longuet-Higgins, H. C. & Lee, C. S. (1982). Perception of musical rhythms. Perception, 11, 115-128. McAuley, J. D., & Kidd, G.R. (1998). Effect of deviations from temporal expectations on tempo discrimination of isochronous tone sequences. Journal of Experimental Psychology: Human Perception and Performance, 24, 1786-1800. Monahan, C. B., Kendall, R. A., & Carterette, E. C. (1987). The effect of melodic and temporal contour on recognition memory for pitch change. Perception & Psychophysics, 41, 576-600. Palmer, C. & Krumhansl, C. (1990). Mental representations of musical meter. Journal of Experimental Psychology: Human Perception and Performance, 16, 728-741. Parncutt, R. (1994). A perceptual model of pulse salience and metrical accent in musical rhythms. Music Perception, 11(4), 409-464. Povel, D. J. & Essens, P. (1985). Perception of temporal patterns. Music Perception, 2(4), 411-440. Scarborough, D. L., Miller, B. O. & Jones, J. A. (1992). On the perception of meter. In M. Balaban, K. Ebcioglu & O. Laske (Eds.), Understanding music with AI: perspectives in music cognition. Cambridge, MA: MIT Press, 427-447. Scheirer, E. D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America, 103(1), 588-601. Snyder J. & and Krumhansl C. L. (1999). Cues to pulse-finding in piano ragtime music. Society for Music Perception and Cognition Abstracts. Evanston, IL. Toiviainen, P. (1998). An interactive MIDI accompanist. Computer Music Journal, 22(4), 63-75. Van Noorden, L., & Moelants, D. (1999). Resonance in the perception of musical pulse. Journal of New Music research, 28, 43-66. 0 1 20 3 4 5 6 BST/human 0.5 0 up/model 1 0.5 0 0 0.25 0.5 down/human 0 0.5 neither/human 1 0.5 0 1 0 0 0.25 aper/human 0 0.5 up/human 1 0 2 switches/human 4 4 switches/model neither/model 1 1 down/model 6 5 4 3 2 1 aper/model BST/model G are Qraphics uickTim needed decom e™ toand see pressor athis picture. 2 0 Figure 1. Scatter plots of the six performance measures taken from subjects and model 3. Each point represents one stimulus; its abscissa and ordinate correspond to human and model data, respectively.