* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PITCH RECOGNITION
Survey
Document related concepts
Transcript
PITCH RECOGNITION WITH WAVELETS 1.130 – Wavelets and Filter Banks May 15, 2003 Project by: Stephen Geiger [email protected] 927939048 Abstract This report investigates the use of wavelets for pitch recognition. A method is developed using the Continuous Wavelet Transform at various scales to identify individual notes. Successful results were obtained for computer generated polyphonic piano music that included octave intervals. The current method requires the training of the system before recognition is possible and may only work on some instruments. However, it seems possible that the method could be extended to recognize real polyphonic piano music. Outline Introduction Problem Description Existing Methods Developed Method and Results Conclusions References Appendix A – Matlab Code Appendix B – Additional Results Introduction Pitch recognition, the ability to identify notes contained in an audio signal, is a task some humans are quite proficient at. Given the sound of a dropped metal trash can lid, (or perhaps preferably a violin) they can respond with the name of a corresponding musical note. This ability is typically referred to in the music world as “perfect pitch”. Not all humans seem to have this capability, and there has been somewhat limited success in creating computerized systems capable of pitch recognition. Research in this area has been approached with different motivating factors from several fields. Perhaps the most obvious application is in automatic transcription of music [1][2][3]. There is also interest in pitch recognition for analyzing models of musical instruments [4], speech analysis [5], and from the perspective of perceptual computing [6]. The aim of this work was to explore the use of wavelets [7] for computer based pitch recognition. Problem Description Pitch is one of the properties of sound. It is perhaps most simply described as how high or low a sound is (not loud and soft, but high and low). Pitch also refers to what musical note a sound can be described as. In more technical terms, pitch relates to the fundamental frequency of a sound. Each musical note has a unique fundamental frequency. However, a sound or note typically does not consist of one pure frequency. This is shown in the following graph: Frequency, Hz Relative Frequency Content of a Computer Generated Piano Sound The graph displays the frequencies present in a Middle C (C4) with fundamental frequency 262 Hz. It can be seen that there is a large frequency component at the fundamental frequency, and that there are frequency components at integer multiples of this frequency (harmonics). The fundamental frequency is not always the largest component as shown here: Frequency, Hz Relative Frequency Content of a Computer Generated Oboe Sound In the case of the Oboe sound, the fundamental frequency is again 262 Hz, and is present with its harmonics; but, one can notice that the most prominent frequency component is the 4th harmonic. What may not be obvious is that to the human ear this sound will be heard as having the same pitch as sinusoidal wave at the fundamental frequency of 262 Hz. This is despite the fact the strength of the fundamental frequency component in the signal is relatively small. In fact, there can be cases where the fundamental frequency of a sound is not even present in the signal. It is worthwhile to note that the varying distribution of strengths of frequency components in a note is what determines its musical property called timbre. This is the property that makes an oboe sound like an oboe, and a piano sound like a piano, and a trumpet sound like a trumpet, etc. Two more relevant terms to mention are monophonic and polyphonic. A monophonic sound is one where there is only one pitch present at any given time. Some examples would be one person singing or a single trumpet. Polyphonic sounds are ones that contain multiple notes simultaneously such as an orchestra, or barbershop quartet. There are several existing methods for monophonic pitch recognition and these have had some success. Polyphonic pitch recognition has proven significantly more difficult. This is partially because the combined of frequency spectrum of various notes, is more difficult to analyze, and is especially so in the case of identifying two pitches related by an interval of one octave (for example a middle C and the next highest C played together). This is because all of the frequency components found in the higher note in an octave will also be present in the lower note [8]. Existing Methods A brief overview of some of the methods that have been tried for pitch detection is presented here. Monophonic transcription techniques include time domain techniques based on zero crossings and auto-correlation, and frequency domain techniques based on discrete Fourier transform and cepstrum methods, see references in [8]. The estimation of local maxima to find the pitch period (which can be switched easily to frequency) with the incorporation of wavelets is described in [1][9]. Another technique using wavelets to estimate pitch period and a comparison to auto-correlation methods is presented in [4].The use of models of human pitch perception is also described in [8], as well as the concept of “blackboard systems”. This approach incorporates various sources of knowledge and these sources could include music theory or statistical and probabilistic knowledge [2][6]. Lastly, it is worth noting that one approach to dealing with the problem of distinguishing octaves is to incorporate instrument models. Developed Method and Results Taking a different approach, the method developed in this work makes use of the Continuous Wavelet Transform (CWT), and uses a 2nd Order Gaussian Wavelet. The Continuous Wavelet Transform is defined as follows: C a ,b f (t ) t b dt a a 1 where : f(t) function Ψ(t) Mother wav elet a scaling factor b shift parameter And the 2nd Order Gaussian Mother Wavelet has the following appearance: When the scaling parameter, a, in the wavelet transform is varied it has the effect of stretching or compressing the mother wavelet. The implementation of the CWT found in the Matlab Wavelet Toolbox was used, and further explanation of the CWT, and the 2nd Order Gaussian wavelet can be found in the Wavelet Toolbox User’s Guide [10]. The idea for this method is based on an observation made by Jeremy Todd [11]. In his work he found that taking the CWT of a recording of a piano using a certain CWT scale parameter and a 2nd Order Gaussian wavelet function the onset of a specific note (a G4) could be easily identified. This observation is shown in the following illustration: Original Signal CWT @ Specific “Scale” Furthermore, Todd observed that the same result would occur in situations with polyphony as well. This was particularly interesting. I started my work by running a number of continuous wavelet transforms of varying scale on some test signals (computer generated piano sounds), and observing the results. After looking at a number of results it was possible to identify CWT scale coefficients to respond to all the notes in the musical scale starting at C4. (Note: in the previous sentence the term scale is used with two different meaning; the former instance using its wavelet definition, and the latter its musical definition). These results are shown here: Original Scale: 594 530 472 446 394 722 642 606 The CWT with each one of the selected scaling factor had large values at the occurrence of a specific note, and comparatively small value during the rest of the signal. Next, we can observe the results of the CWT’s in the presence of some polyphony: Original Scale: 594 530 472 446 394 722 642 606 and: Original Scale: 594 530 472 446 394 722 642 606 In both cases this method worked, even with the presence of polyphony. Furthermore, in the second example we see that both the C and the G are not affected by the presence of other octaves. (Note: the three areas of large response on the first line [Scale =594] of the second example are correct. The second two occurrences of the C are found in the bass clef). One of the next steps to take was testing whether notes played with a different instrument (i.e. having a different timbre) would work or not. This test was run using a computer generated brass sound, and the results clearly show that it did not work. Original Scale: 594 530 472 446 394 722 642 606 This result was somewhat expected, and it suggests that the CWT is acting as an “instrument model” of sorts. When the scale parameter of the CWT is adjusted it affects frequency response. So at certain scale parameters it appears that this frequency response is tailored such that it responds to one pitch more than others. Based on the encouraging results so far, investigation was continued to see how effective this method could be on a larger scale. At this point a “training algorithm” was written to aid in the identification of appropriate scale factors corresponding to various pitches. The computer was programmed to “train” itself to find applicable scale factors. The algorithm was implemented in Matlab and works as follows: Different sound files were created for each note in a range of desired notes. The CWT of each sound file was taken The maximum results of the CWT’s from each sound file were compared. If the maximum CWT coefficient from one file was at least twice the value of those in all other files it was considered a result. For all the results the following were recorded: the scale factor, pitch of the sound file, and the factor its max value exceeded all others by. This process was repeated over a range of CWT scale factors in hopes of finding results for every pitch in the desired range of notes. At the end the scale factor of the best result for each pitch was collected. (The code for this algorithm, as well as some of the other work for this project, is included in Appendix A). This algorithm was applied to several different types of training signals. It was tried on computer generated piano sound for a range of three octaves, for a “real” guitar sound (albeit electric guitar, ’70s Ibanez Les Paul), for a set of pure sinusoidal waves, and lastly for a training set of all 88 keys from the computer generated piano. The training on the three octave range was able to find results for all pitches except the bottom two notes. This is likely due to the fact that a limited set of CWT scales was searched, and it is hypothesized that given a large range these values would have been found as well. The results are shown here. The training on the real guitar sound met with limited success. Only 5 out of 8 notes were identified in the training process (again for a somewhat limited set of scales); however, the results were not completely successful in identifying the corresponding notes in a test file. It wasn’t a complete failure, and could merit a more thorough try, but the guitar is expected to be a more difficult case than a basic computer generated sound, or even a real piano. The results for the sinusoidal wave form were found as a step to help gain a better understanding about the relationship between scale and frequency. It can be observed that changing the CWT scale shifts the frequency response of the transform. It also can be observed that some interesting relationships exist between which scales yield results for which notes as seen in the following two graphs. 14000 12000 10000 SCALE 8000 6000 4000 2000 0 0 1 2 3 4 5 NOTE NUMBER Successful Results from the Training Algorithm – For 8 Sinusoidal Pitches in a C Scale 6 7 8 2500 2000 1500 SCALE 1000 500 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 NOTE NUMBER Successful Results from the Training Algorithm – For 3 Octaves of a Computer Generated Piano Sound In the first graph with pure sinusoidal sounds, the scale frequency relationship seems a little more straightforward then in the second case. There also could be some patterns in the second graph as well, though they are less apparent. The tests with all 88 notes were abandoned after considering the time they required to run, and the amount of time left to complete this work. Its worth noting here that running CWT’s for a number of test files at a number scales, could take a number of hours. This could possibly be sped up noticeably with shorter test files or a lower sampling rate, but this was not investigated. The initial results from the training for 88 notes showed some interesting results, picking out notes 70-88. The notes appeared to show up in more clearly defined regions than in the three scale test case. It seems possible that a training run of the three scale test case at higher CWT scales might yield similar results. Lastly, a fragment of the right hand part of Chopin’s Prelude in C, Op. 28 No. 1 was tested, and the results were output into more music like format for comparison: A Test Fragment by Chopin A comparison of the musical score and the graph reveals that the method successfully identified all the notes contained in this polyphonic fragment. This is noteworthy, as the method was successful, even in situations with polyphony and octaves. Conclusions An application of the Continuous Wavelet Transform to pitch recognition was explored, and some interesting results were found. The method demonstrated the ability to recognize a reasonably complex polyphonic fragment and octaves, which means it compares favorably with some of the other results in literature I came across. Two significant drawbacks of the method are that it required training on an instrument sounds, and it is possible that it might be effective only on some instruments. The most obvious next step would be to try and apply the current technique to a real piano and observe how well it worked. One issue that might need to be dealt with is variation in the volumes of notes played, as this might interfere with the simple maximum method used for identifying results. Possibly some type of compression or normalization could be applied. Another issue would be the identification of the beginning and ends of notes. If this was handled successfully the system would be well on its way to handling basic music transcription. Maybe similar techniques to those used in wavelet edge detection could be applied to this problem? References [1] Kevin Chan, Supaporn Erjongmanee, Choon Hong Tay, “Real Time Automated Transcription of Live Music into Sheet Music using Common Music Notation”, 18-551 Final Project (Carnegie Mellon), May 2000. [2] Martin, K. D. (1996). A Blackboard System for Automatic Transcription of Simple Polyphonic Music. M.I.T. Media Lab Perceptual Computing Technical Report #385, July 1996. [3] Michelle Kruvczuk, Ernest Pusateri, Alison Covell, “Music Transcription for the Lazy Musician”, 18-551 Final Project (Carnegie Mellon), May 2000. [4] John Fitch, Wafaa Shabana, “A Wavelet-Based Pitch Detector for Musical Signals”. [5] Inge Gavat, Matei Zirra, Valentin Enescu, “Pitch Detection of Speech by Dyadic Wavelet Transform”. http://www.icspat.com/papers/181mfi.pdf [6] Martin, K. D. and Scheirer, E. D. (1997). “Automatic Transcription of Simple Polyphonic Music: Integrating Musical Knowledge”. Presented at SMPC, August 1997. [7] Robi Polikar, “The Wavelet Tutorial”, http://engineering.rowan.edu/~polikar/WAVELETS/WTtutorial.html. [8] Martin, K. D. (1996). “Automatic Transcription of Simple Polyphonic Music: Robust Front End Processing”. M.I.T. Media Lab Perceptual Computing Technical Report #399, November 1996, presented at the Third Joint Meeting of the Acoustical Societies of America and Japan, December, 1996. [9] Tristan Jehan, “Musical Signal Parameter Estimation”, Thesis – CNMAT, 1997. http://cnmat.cnmat.berkeley.edu/~tristan/Report/Report.html [10] Wavelet Toolbox User’s Guide (MATLAB), Mathworks, 1997. [11] Jeremy Todd, “A Comparison of Fourier and Wavelet Approaches to Musical Transcription”, 18.327 Final Project (MIT). Appendix A – Matlab Code Appendix B – Additional Results