Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Babies 'cry in mother's tongue' HCS 7367 Speech Perception • Babies' cries imitate their mother tongue as early as three days old • German researchers say babies begin to pick up the nuances of their parents' accents while still in the womb. Dr. Peter Assmann Fall 2012 http://news.bbc.co.uk/2/hi/health/8346058.stm Long-term spectrum of speech Connected speech Speech communication in adverse listening conditions Absolute threshold Males Females http://www.utdallas.edu/~assmann/aud6306/sproc.pdf Long-term spectrum of speech Masking and interference 2000) Vowels Absolute threshold Males Females • Energetic masking – reduced audibility of signal components due to overlap in spectral energy within the same auditory channel. • Informational masking – reduced audibility of signal components due to non-energetic factors such as target-masker similarity. – Forward vs. backward speech maskers – Familiar vs. foreign language 1 Resistance to distortion Effects of noise • Articulation score: % items correct on spoken lists of syllables, words or sentences • Signal-to-noise ratio (SNR): when speech and noise have the same average rms level (SNR=0 dB), articulation scores are above 50% for listeners with normal hearing Signal-to-noise ratio (SNR) Signal-to-noise ratio (SNR) • SNR = 20 log10 [ rms(speech) / rms(noise) ] • SNR = 20 log10 [ rms(speech) / rms(noise) ] Specified in decibels When speech and noise have the same average rms level (SNR=0 dB), articulation scores are above 50% for listeners with normal hearing. Why is speech intelligible when the masker is presented at the same level as the speech? Articulation Index Articulation Index • How much does audibility contribute to difficulty understanding speech in noise? • Articulation Index (AI) estimates the contribution of audibility (and other factors) to speech intelligibility 1. Divides the speech and masker spectrum into a small number of frequency bands 2. Estimates the audibility of speech in each band, weighted by its relative importance for intelligibility 3. Derives overall intelligibility by summing the contributions of each band. 2 Speech recognition in noise Waveform Spectrum 40 1 30 Amplitude (dB) 0.5 Amplitude • Spectral properties of the noise: white, pink, speech-shaped, competing speech, speech babble • Temporal properties of the noise: steady vs. modulated or interrupted White Noise 0 20 10 -0.5 0 -1 0 200 400 600 Time (ms) Effects of noise on speech recognition 800 -10 0 1000 1 2 3 Frequency (kHz) 4 Non-uniform noise Speech-shaped Noise 40 40 30 30 Amplitude (dB) Amplitude (dB) Pink Noise 20 10 20 10 0 0 -10 0 -10 0 1 2 3 Frequency (kHz) 4 1 2 3 Frequency (kHz) 4 Source: Miller, Heise and Lichten, J. Exp. Psychol. 1951 Non-uniform noise Speech babble (mixture of 10 sentences from 1 talker) 50 Amplitude (dB) 40 30 20 10 Multi-talker babble 1 sentence Effect of increasing the number of competing voices 2 sentences 4 sentences 8 sentences 0 16 sentences -10 0 1 2 3 Frequency (kHz) 4 3 Vowel // in quiet and in noise Amplitude (dB) Effects of noise on vowel spectra • Broadband noise tends to fill up the valleys between the formant peaks. • Spectral contrast (peak-to-valley ratio) is reduced by the addition of noise. • Because of the sloping long-term spectrum of speech, the upper formants (F3, F4, F5) are more susceptible to masking and distortion by the noise. 0 0 -20 -20 -40 -40 0.2 0.5 1 2 5 0.2 Excitation (dB) In quiet 0.5 1 2 5 Pink noise, +6 dB SNR 0 0 -20 -20 -40 -40 0.2 0.5 1 2 Frequency (kHz) 5 0.2 0.5 1 2 Frequency (kHz) 5 Effects of noise on formant peaks Effects of filtering High-pass and low-pass filtering Effects of filtering on speech • High-pass filtering to remove components below 1800 Hz also produces about 67%. Identification accuracy (%) • Low-pass filtering to remove frequencies above 1800 Hz reduces intelligibility from near perfect to around 67%. 100 HP 80 LP 60 40 20 0 100 200 300 500 1000 2000 Frequency (Hz) 5000 10000 4 Speech communication has an extraordinary resilience to distortion. Bandpass filtering • Bandpass filtering with one-third octave filters centered 1500-2100 Hz produces better than 95% accuracy for highpredictability sentences (Warren et al., 1995; Stickney & Assmann, 2001). 1. Intelligibility remains high even when large portions of the spectrum are eliminated by filtering. Stickney and Assmann (JASA 2001) Perception of filtered speech Other frequency distortions • Everyday English sentences filtered using narrow bandpass filters remain highly intelligible (>90% words correct) • Notch filtering to remove frequencies between 800 and 3000 Hz leads to consonant identification scores better than 90% (Lippman, 1996) • Conclusion: speech cues are widely distributed one-third octave bandwidth, 1500 Hz center frequency, 100 dB/octave slopes Warren et al. (Percept Psychophys 1995; JASA 2000) Perception of filtered speech Speech communication has an extraordinary resiliance to distortion. 2. Large segments of the waveform can be deleted or replaced by silence. Interruption rate = 5 Hz 1 second Stickney and Assmann (JASA 2001) 5 Speech communication has an extraordinary resiliance to distortion. Noise can be added to the speech signal at equal intensity (Signal-to-noise ratio = 0 dB). Speech Frequency (kHz) + Noise can be added to the speech signal at equal intensity (Signal-to-noise ratio = 0 dB). 3. 5 5 4 4 Frequency (kHz) 3. Speech communication has an extraordinary resiliance to distortion. 3 2 1 3 2 1 Speech-shaped noise 0 0 0 Speech communication has an extraordinary resiliance to distortion. When the noise is from a competing voice, target and masker are similar and must be segregated. 4. 100 4 Frequency (kHz) Frequency (kHz) 5 4 3 2 1 3 2 1 0 300 400 Time (ms) 500 600 0 100 200 300 400 Time (ms) 500 600 How do listeners achieve this? 5 200 Statistical redundancy of speech/language Combined strategies of top-down + bottom-up processing Grouping and segregation of auditory objects Tracking speech properties over time Glimpsing speech fragments during noise-free intervals 0 0 100 200 300 400 Time (ms) 500 600 0 100 200 300 400 Time (ms) 500 600 Redundancy in speech and language Coker and Umeda (1974) define redundancy as: “any characteristic of the language that forces spoken messages to have, on average, more basic elements per message, or more cues per basic element, than the barest minimum [necessary for conveying the linguistic message].” Redundancy in error correction “Redundancy can be used effectively; or it can be squandered on uneven repetition of certain data, leaving other crucial items very vulnerable to noise. . . . But more likely, if a redundancy is a property of a language and has to be learned, then it has a purpose.” Coker and Umeda (1974, p. 349) 6 Redundancy contributes speech perception in several ways 2. 3. by limiting perceptual confusion due to errors in speech production; by helping to bridge gaps in the signal created by interfering noise, reverberation, and distortions of the communication channel; and by compensating for momentary lapses in attention and misperceptions on the part of the listener. Miller, Heise & Lichten, 1951 • Contextual cues lead to improved speech understanding in noise. Acoustic-phonetic context Prosodic context Semantic context Syntax Miller, Heise & Lichten, 1951 100 PERCE NT ITEMS CORRECT 1. Effects of context 80 60 40 20 0 Recognition of interrupted speech in quiet Interrupted speech • In this condition the speech is turned on and off at regular intervals using an electronic switch. Miller and Licklider JASA 1950 7 Interrupted speech Miller and Licklider JASA 1950 Word Identification Accuracy (%) Interrupted speech Interrupted speech 100 • In quiet, speech can be interrupted (turned on and off) periodically without substantial loss of intelligibility (Miller & Licklider, 1950). 80 60 • Miller and Licklider found the worst intelligibility for interruption rates < 2 Hz, where large speech fragments (words, phrases) are missing. 40 20 0 0.1 1.0 10 100 1000 10000 Frequencyof ofinterruptions interruption (s)(s) Frequency Interrupted speech • Miller and Licklider found improved performance for interruption rates between 10 and 100 Hz. Why? Masking of speech by interrupted noise • Miller and Licklider also measured speech intelligibility in conditions where the speech was continuous but the noise was interrupted. • For very high interruption rates (>1 kHz) the signal sounded continuous, and performance was near perfect. 16 Hz 128 Hz 512 Hz 8 Masking of speech by interrupted noise • When the noise is intermittent rather than continuous there is a release from masking. • The benefits of non-stationarity depend on the interruption rate and the duty cycle (onoff ratio) of the noise. Masking of speech by interrupted noise Interrupted noise • At low interruption rates the effects are similar to speech interrupted by silence. • As the interruption rate increases there is a gradual improvement in speech recognition. • With 10 interruptions per second, listeners receive several “glimpses” of each word and can patch together those glimpses to recognize about 75% of the words correctly Interrupted noise • When a noise masker is alternated with silence using a 50% duty cycle, there may be considerable masking release, compared to a continuous masker, especially with alternation rates between 1 and 200 per second (Miller and Licklider, 1950). Summary: Interrupted noise 1. At alternation rates between about 1 and 200 per second, listeners can “patch together” cues from the clean segments between the bursts of noise. 2. With slower interruption rates, entire words or phrases are masked; others are noise-free. 3. At rates > 200/sec the masking effect is the same as uninterrupted, continuous noise. 9 “Picket-fence” effect “Picket-fence” effect • Interrupted speech can have a harsh, distorted quality. • But when speech and noise are alternated periodically, filling silent gaps with noise, the speech sounded smooth and continuous. • Possibly, noise in the gaps enhances the listener’s ability to exploit contextual cues. Howard-Jones and Rosen (1993) • “Checkerboard” noise maskers • Effects of interruption rate and frequency bandwidth of the checkerboard pattern Howard-Jones and Rosen (1993) • Can listeners exploit asynchronous timefrequency glimpses? • Yes, but only over broad frequency ranges 2.0 Frequency Frequency 2.0 1.0 0.5 0.2 0.1 1.0 0.5 0.2 0.1 0 100 200 300 400 Time (ms) 500 600 700 Time 0 100 200 300 400 Time (ms) 500 600 700 Time Speech source separation A glimpsing model of speech perception in noise Martin Cooke Journal of the Acoustical Society of America, Vol. 119, No. 3, pp. 1562–1573, March 2006 • How do the ear and brain separate the target voice from the noise? spatial cues lip-reading semantic context auditory scene analysis (Bregman, 1990) glimpsing and tracking 10 Auditory scene analysis • Bregman (1990) Auditory scene analysis • Computational auditory scene analysis The sound that reaches the eardrum of the listener is often a mixture of different sources Acoustic signals originating from different sound sources combine additively Unlike vision, the concept of occlusion is hard to define in audition: sounds overlap but also combine in complex ways. Reviewed by Cooke and Ellis (2001) Human listeners are good at separating mixtures of sounds, as reflected in speech communication and listening to music in complex listening environments (cocktail parties) Attempts to reproduce this separation process using computational models had limited success (a hard problem!) “Glimpsing” speech in noise “Glimpsing” speech in noise • “speech is a highly modulated signal in time and frequency, regions of high energy are typically sparsely distributed.” • “The information conveyed by the spectrotemporal energy spectrum of clean speech is redundant… Redundancy allows speech to be identified based on relatively sparse evidence.” 2.0 Frequency (Hz) Frequency (Hz) 2.0 1.0 0.5 0.2 1.0 0.5 0.2 0.1 0.1 0 100 200 300 400 Time (ms) 500 600 700 Glimpsing speech in noise • Can listeners take advantage of “glimpses”? direct attention to spectrotemporal regions where the S+N mixture is dominated by the target speech ASR system trained to recognize consonants in noise Maskers differed in “glimpse size” ASR model developed to exploit non-uniform distribution of SNR in different time-frequency bands Conclusion: model + listeners benefit from glimpsing. 0 100 200 300 400 Time (ms) 500 600 700 Speech + noise mixtures • Some regions dominated by target voice • Local SNR varies across time and frequency • Where the target voice dominates, the problem of source segregation is solved because the signal is effectively “clean” speech. • Clean speech is highly redundant; it remains intelligible after 50% or more of its energy is removed by gating and/or filtering 11 STEP model Missing data ASR • Auditory excitation pattern (Moore, 2003) Spectrogram-like representation Reflects non-uniform frequency selectivity in different frequency bands Incorporates a sliding time window reflecting temporal analysis by the auditory system Relative audibility at different frequencies Loudness model Sparseness and redundancy • Glimpses = spectrotemporal regions where signal exceeds masker by ~3 dB. • HMM-based speech recognizer • “Missing-data” models Glimpses only • Ignore missing information (in masked regions) Glimpses-plus-background • Try to fill in missing information (based on masked regions) Syllable identification accuracy as a function of the number of competing voices. The level of the target speech (monosyllabic nonsense words) was held constant at 95dB. (After Miller 1947). target single talker masker eight-talker masker speech-shaped noise glimpses Results 12 Results Conclusions • Best model: Uses information in glimpses and counterevidence in the masked regions Glimpses constrained to a minimum area Treats all regions with local SNR > -5 dB as potential glimpses FIG. 4. The correlation between intelligibility and proportion of the target speech in which the local SNR exceeds 3 dB. Each point represents a noise condition, and proportions are means across all tokens in the test set. The best linear fit is also shown. The correlation between listeners and these putative glimpses is 0.955. Conclusions Conclusions • A higher “glimpse threshold” (e.g. local SNR > 0 dB) produces fewer glimpses, but this provides less distorted information than a lower threshold (e.g. -5 dB). • Limitation: local SNR must be known in advance. Is there a way to estimate the local SNR directly from the mixture? • Tracking problem: how to integrate glimpses over time? Brungart et al. (2001) Different talker, different sex Different talker, same sex Modulated noise Same talker –12 –9 –6 –3 0 3 6 9 Target-to-Masker Ratio (dB) 12 2-talker correct responses (%) 2-talker correct responses (%) Brungart et al. (2001) –12 –9 –6 –3 0 3 6 9 12 Target-to-Masker Ratio (dB) 13