Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A NEW FEATURE EXTRACTION MOTIVATED BY HUMAN EAR Amin Fazel Sharif University of Technology Hossein Sameti, S. K. Ghiathi February 2005 Outline Introduction Physiological basis in the human auditory system Modeling of the basilar membrane and hair cells Experimental results Summary and conclusions Thursday, February 03, 2005 Department of Computer Engineering 2/26 Introduction Speech is #1 real-time communication medium among humans. Advantages of voice interface to machines: Hands-free operation Speed Ease of use Thursday, February 03, 2005 Department of Computer Engineering 3/26 Introduction Human is a high-performance existence proof for speech recognition in noisy environments. Wall Street Journal/Broadcast news readings, 5000 words Untrained human listeners vs. Cambridge HTK LVCSR system Thursday, February 03, 2005 Department of Computer Engineering 4/26 Physiological Basis Thursday, February 03, 2005 Department of Computer Engineering 5/26 Physiological Basis Inner Ear Semicircular Canals Cochlea The semicircular canals are the body's balance organs. Hair cells, in the canals, detect movements of the fluid in the canals caused by angular acceleration The canals are connected to the auditory nerve. Thursday, February 03, 2005 Department of Computer Engineering 6/26 Physiological Basis Inner Ear Semicircular Canals Cochlea The inner ear structure called the cochlea is a snail-shell like structure divided into three fluid-filled parts. Two are canals (Scala tympani and Scala Vestibuli) for the transmission of pressure and in the third is the sensitive organ of Corti, which detects pressure impulses and responds with electrical impulses which travel along the auditory nerve to the brain Thursday, February 03, 2005 Department of Computer Engineering 7/26 Physiological Basis Inner Ear Semicircular Canals Cochlea The organ of Corti can be thought of as the body's microphone. Perception of pitch and perception of loudness is connected with this organ. It is situated on the basilar membrane in the cochlea duct It contains inner hair cells and outer hair cells. There are some 16,000 -20,000 of the hair cells distributed along the basilar membrane. Vibrations of the oval window causes the cochlear fluid to vibrate. This causes the Basilar membrane to vibrate thus producing a traveling wave. This causes the bending of the hair cells which produces generator potentials If large enough will stimulate the fibers of the auditory nerve to produce action potentials The outer hair cells amplify vibrations of the basilar membrane Thursday, February 03, 2005 Department of Computer Engineering 8/26 Modeling of BM and Hair Cells Different parts of basilar membrane and hair cells are sensitive to different frequencies of input signal. Thursday, February 03, 2005 Department of Computer Engineering 9/26 Modeling of BM and Hair Cells Since corporation of basilar membrane and hair cells changes all frequencies of speech into mechanical energy, with good approximation, we can discretely represent basilar membrane and hair cells as forced damped oscillators with different natural frequencies. Thursday, February 03, 2005 Department of Computer Engineering 10/26 Modeling of BM and Hair Cells We stimulate these oscillators with input sound In this simulation we have an oscillating particle which is always pulled by a force towards the center of oscillation Displacement of the article from the center of oscillation is shown by x and the inward force is equal to –kx. k is the constant for each oscillator constant k m0 Thursday, February 03, 2005 2 Department of Computer Engineering 11/26 Modeling of BM and Hair Cells Since we have a foreign force (posed by sound), we can no further use those standard equations which assume the energy of system is constant. If we don't consider the effect of friction, the energy of system will not decrease and it becomes instable. So we must add a force in opposite direction of movement. Since the direction of movement is determined by v (velocity), the friction force is –bv Viewing each diapason as a filter Bandwidth Thursday, February 03, 2005 Department of Computer Engineering m 0 b Q 12/26 Modeling of BM and Hair Cells We model the state of each oscillator with the pair [x v], where x is the displacement and v is the velocity of particle x old x new 1 t 0 v v 0 1 t old a new Where ∆t is the inverse of sampling frequency Thursday, February 03, 2005 Department of Computer Engineering 13/26 Modeling of BM and Hair Cells The particle is imposed by three forces: The diapason itself pulls the particle by force –kx The sound imposes a foreign force, say Fexternal To compute Fexternal from the current sample we use the value of sample itself as the external force The friction opposes to the movement by force –bv Thursday, February 03, 2005 Department of Computer Engineering 14/26 Modeling of BM and Hair Cells Now we can compute a, using the following formula a F bv pr kxpr m For using this model in feature extraction After calculation of the energy for each of these oscillators, we use them as feature vectors in ASR systems 1 2 1 2 E mv kx 2 2 Thursday, February 03, 2005 Department of Computer Engineering 15/26 Experimental results We transform a speech with our human based model and compare it to spectrum domain of this speech These two transformations have little differences Thursday, February 03, 2005 Department of Computer Engineering 16/26 Experimental results This comparing shows that this human based model can be used impressively in ASR systems. In addition, this method can be used as an effective and quick signal transformation instead of FFT or wavelet in various tasks. Thursday, February 03, 2005 Department of Computer Engineering 17/26 ASR Experiments The feature extraction algorithm proposed for speech recognition were tested on a English digit database For training we use 1386 digit sequences spoken by 18 speakers In testing phase we use 200 digit sequences that uttered by speakers out of training database The testing database split to four groups of 50 sequences and four types of noises added to these groups Thursday, February 03, 2005 Department of Computer Engineering 18/26 ASR Experiments Recognition is performed using HTK 16 emitting states and three mixture continuous HMM model 3-state silence model Single state inter-digit pause model In the reference experiments, MFCC_0_D_A is used Consists of 13 standard cepstral coefficients including C0 augmented with first and second derivations of them MFCC features were generated by applying a Hamming window of size 25 ms and overlap 10 ms to the same pre-emphasized 23-channel Mel-scale filterbank. The cepstral features were obtained from DCT of logenergy over the 23 frequency channels. Thursday, February 03, 2005 Department of Computer Engineering 19/26 ASR Experiments Car Noise Word error Rate % Comparing of MFCC and HEFE for Car Noise 100 90 80 70 60 50 40 30 20 10 0 MFCC HEFE 20dB Thursday, February 03, 2005 15dB 5dB 10dB SNR. dB Department of Computer Engineering 0dB -5dB 20/26 ASR Experiments Exhibition Noise Comparing of MFCC and HEFE for Exhibition Noise Word error Rate % 100 80 60 40 MFCC 20 HEFE 0 20dB Thursday, February 03, 2005 15dB 10dB 5dB SNR. dB Department of Computer Engineering 0dB -5dB 21/26 ASR Experiments Babble Noise Comparing of MFCC and HEFE for Babble Noise Word error Rate % 120 100 80 60 40 MFCC 20 HEFE 0 20dB Thursday, February 03, 2005 15dB 10dB 5dB SNR. dB Department of Computer Engineering 0dB -5dB 22/26 ASR Experiments Subway Noise Comparing of MFCC and HEFE for Subway Noise Word error Rate % 100 80 60 40 20 MFCC HEFE 0 20dB 15dB 10dB 5dB 0dB -5dB SNR. dB Thursday, February 03, 2005 Department of Computer Engineering 23/26 ASR Experiments For all contaminated speech, HEFE shows superior performance for all noise types at most SNR levels. For babble noise, HEFE demonstrates significantly better performance than MFCC. For subway noise, improvements by the HEFE are least significant, but still noticeable. Thursday, February 03, 2005 Department of Computer Engineering 24/26 Summary In this paper we have introduced a simple model for basilar membrane and hair calls based on physiological basis We use this model for feature extraction in ASR systems These features significantly outperform MFCC features at babble noise Thursday, February 03, 2005 Department of Computer Engineering 25/26 Thank you!