Download HCS 7367 Speech Perception

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Babies 'cry in mother's tongue'
HCS 7367
Speech Perception
• Babies' cries imitate their mother tongue as early as three days old • German researchers say babies begin to pick up the nuances of their parents' accents while still in the womb.
Dr. Peter Assmann
Fall 2012
http://news.bbc.co.uk/2/hi/health/8346058.stm
Long-term spectrum of speech
Connected speech
Speech communication in adverse listening conditions
Absolute
threshold
Males
Females
http://www.utdallas.edu/~assmann/aud6306/sproc.pdf
Long-term spectrum of speech
Masking and interference
2000)
Vowels
Absolute
threshold
Males
Females
• Energetic masking – reduced audibility of
signal components due to overlap in spectral
energy within the same auditory channel.
• Informational masking – reduced audibility
of signal components due to non-energetic
factors such as target-masker similarity.
– Forward vs. backward speech maskers
– Familiar vs. foreign language
1
Resistance to distortion
Effects of noise
• Articulation score: % items correct on
spoken lists of syllables, words or sentences
• Signal-to-noise ratio (SNR): when speech
and noise have the same average rms level
(SNR=0 dB), articulation scores are above
50% for listeners with normal hearing
Signal-to-noise ratio (SNR)
Signal-to-noise ratio (SNR)
• SNR = 20 log10 [ rms(speech) / rms(noise) ]
• SNR = 20 log10 [ rms(speech) / rms(noise) ]
Specified in decibels
When speech and noise have the same average
rms level (SNR=0 dB), articulation scores are
above 50% for listeners with normal hearing.
Why is speech intelligible when the masker is
presented at the same level as the speech?
Articulation Index
Articulation Index
• How much does audibility contribute to
difficulty understanding speech in noise?
• Articulation Index (AI) estimates the
contribution of audibility (and other factors)
to speech intelligibility
1. Divides the speech and masker spectrum
into a small number of frequency bands
2. Estimates the audibility of speech in each
band, weighted by its relative importance
for intelligibility
3. Derives overall intelligibility by summing
the contributions of each band.
2
Speech recognition in noise
Waveform
Spectrum
40
1
30
Amplitude (dB)
0.5
Amplitude
• Spectral properties of the noise: white,
pink, speech-shaped, competing speech,
speech babble
• Temporal properties of the noise:
steady vs. modulated or interrupted
White Noise
0
20
10
-0.5
0
-1
0
200
400
600
Time (ms)
Effects of noise on speech recognition
800
-10
0
1000
1
2
3
Frequency (kHz)
4
Non-uniform noise
Speech-shaped
Noise
40
40
30
30
Amplitude (dB)
Amplitude (dB)
Pink Noise
20
10
20
10
0
0
-10
0
-10
0
1
2
3
Frequency (kHz)
4
1
2
3
Frequency (kHz)
4
Source: Miller, Heise and Lichten, J. Exp. Psychol. 1951
Non-uniform noise
Speech babble
(mixture of 10 sentences from 1 talker)
50
Amplitude (dB)
40
30
20
10
Multi-talker babble
1 sentence
Effect of
increasing the
number of
competing
voices
2 sentences
4 sentences
8 sentences
0
16 sentences
-10
0
1
2
3
Frequency (kHz)
4
3
Vowel // in quiet and in
noise
Amplitude (dB)
Effects of noise on vowel spectra
• Broadband noise tends to fill up the
valleys between the formant peaks.
• Spectral contrast (peak-to-valley ratio) is
reduced by the addition of noise.
• Because of the sloping long-term
spectrum of speech, the upper formants
(F3, F4, F5) are more susceptible to
masking and distortion by the noise.
0
0
-20
-20
-40
-40
0.2
0.5 1
2
5
0.2
Excitation (dB)
In quiet
0.5 1
2
5
Pink noise, +6 dB SNR
0
0
-20
-20
-40
-40
0.2 0.5 1 2
Frequency (kHz)
5
0.2 0.5 1 2
Frequency (kHz)
5
Effects of noise on formant peaks
Effects of filtering
High-pass and low-pass filtering
Effects of filtering on speech
• High-pass filtering to remove components
below 1800 Hz also produces about 67%.
Identification accuracy (%)
• Low-pass filtering to remove frequencies
above 1800 Hz reduces intelligibility from
near perfect to around 67%.
100
HP
80
LP
60
40
20
0
100
200 300
500
1000
2000
Frequency (Hz)
5000
10000
4
Speech communication has an
extraordinary resilience to distortion.
Bandpass filtering
• Bandpass filtering with one-third octave
filters centered 1500-2100 Hz produces
better than 95% accuracy for highpredictability sentences (Warren et al.,
1995; Stickney & Assmann, 2001).
1.
Intelligibility remains high even when large portions of
the spectrum are eliminated by filtering.
Stickney and Assmann (JASA 2001)
Perception of filtered speech
Other frequency distortions
• Everyday English sentences filtered
using narrow bandpass filters remain
highly intelligible (>90% words correct)
• Notch filtering to remove frequencies
between 800 and 3000 Hz leads to
consonant identification scores better than
90% (Lippman, 1996)
• Conclusion: speech cues are widely
distributed
one-third octave bandwidth, 1500 Hz center
frequency, 100 dB/octave slopes
Warren et al. (Percept Psychophys 1995; JASA 2000)
Perception of filtered speech
Speech communication has an
extraordinary resiliance to distortion.
2.
Large segments of the waveform can be deleted or
replaced by silence.
Interruption rate = 5 Hz
1 second
Stickney and Assmann (JASA 2001)
5
Speech communication has an
extraordinary resiliance to distortion.
Noise can be added to the speech signal at equal
intensity (Signal-to-noise ratio = 0 dB).
Speech
Frequency (kHz)
+
Noise can be added to the speech signal at equal
intensity (Signal-to-noise ratio = 0 dB).
3.
5
5
4
4
Frequency (kHz)
3.
Speech communication has an
extraordinary resiliance to distortion.
3
2
1
3
2
1
Speech-shaped noise
0
0
0
Speech communication has an
extraordinary resiliance to distortion.
When the noise is from a competing voice, target and
masker are similar and must be segregated.
4.
100
4
Frequency (kHz)
Frequency (kHz)
5
4
3
2

1


3

2
1
0
300
400
Time (ms)
500
600
0
100
200
300
400
Time (ms)
500
600
How do listeners achieve this?

5
200
Statistical redundancy of speech/language
Combined strategies of top-down + bottom-up
processing
Grouping and segregation of auditory objects
Tracking speech properties over time
Glimpsing speech fragments during noise-free
intervals
0
0
100
200
300
400
Time (ms)
500
600
0
100
200
300
400
Time (ms)
500
600
Redundancy in speech and language

Coker and Umeda (1974) define redundancy as:
“any characteristic of the language that forces
spoken messages to have, on average, more basic
elements per message, or more cues per basic
element, than the barest minimum [necessary for
conveying the linguistic message].”
Redundancy in error correction

“Redundancy can be used effectively; or it can be
squandered on uneven repetition of certain data,
leaving other crucial items very vulnerable to
noise. . . . But more likely, if a redundancy is a
property of a language and has to be learned, then
it has a purpose.”
Coker and Umeda (1974, p. 349)
6
Redundancy contributes speech
perception in several ways
2.
3.
by limiting perceptual confusion due to errors in
speech production;
by helping to bridge gaps in the signal created by
interfering noise, reverberation, and distortions
of the communication channel; and
by compensating for momentary lapses in
attention and misperceptions on the part of the
listener.
Miller, Heise & Lichten, 1951
• Contextual cues lead to improved speech
understanding in noise.
Acoustic-phonetic context
Prosodic context
Semantic context
Syntax
Miller, Heise & Lichten, 1951
100
PERCE NT ITEMS CORRECT
1.
Effects of context
80
60
40
20
0
Recognition of interrupted speech in quiet
Interrupted speech
• In this condition the speech is turned on and
off at regular intervals using an electronic
switch.
Miller and Licklider JASA 1950
7
Interrupted speech
Miller and Licklider JASA 1950
Word Identification Accuracy (%)
Interrupted speech
Interrupted speech
100
• In quiet, speech can be interrupted (turned on
and off) periodically without substantial loss
of intelligibility (Miller & Licklider, 1950).
80
60
• Miller and Licklider found the worst
intelligibility for interruption rates < 2 Hz,
where large speech fragments (words,
phrases) are missing.
40
20
0
0.1
1.0
10
100
1000
10000
Frequencyof
ofinterruptions
interruption (s)(s)
Frequency
Interrupted speech
• Miller and Licklider found improved
performance for interruption rates
between 10 and 100 Hz. Why?
Masking of speech by interrupted noise
• Miller and Licklider also measured speech
intelligibility in conditions where the speech
was continuous but the noise was interrupted.
• For very high interruption rates (>1 kHz)
the signal sounded continuous, and
performance was near perfect.
16 Hz
128 Hz
512 Hz
8
Masking of speech by interrupted noise
• When the noise is intermittent rather than
continuous there is a release from masking.
• The benefits of non-stationarity depend on
the interruption rate and the duty cycle (onoff ratio) of the noise.
Masking of speech by interrupted noise
Interrupted noise
• At low interruption rates the effects are
similar to speech interrupted by silence.
• As the interruption rate increases there is a
gradual improvement in speech recognition.
• With 10 interruptions per second, listeners
receive several “glimpses” of each word
and can patch together those glimpses to
recognize about 75% of the words correctly
Interrupted noise
• When a noise masker is alternated with
silence using a 50% duty cycle, there may
be considerable masking release, compared
to a continuous masker, especially with
alternation rates between 1 and 200 per
second (Miller and Licklider, 1950).
Summary: Interrupted noise
1. At alternation rates between about 1 and 200
per second, listeners can “patch together”
cues from the clean segments between the
bursts of noise.
2. With slower interruption rates, entire words
or phrases are masked; others are noise-free.
3. At rates > 200/sec the masking effect is the
same as uninterrupted, continuous noise.
9
“Picket-fence” effect
“Picket-fence” effect
• Interrupted speech can have a harsh,
distorted quality.
• But when speech and noise are alternated
periodically, filling silent gaps with noise,
the speech sounded smooth and continuous.
• Possibly, noise in the gaps enhances the
listener’s ability to exploit contextual cues.
Howard-Jones and Rosen (1993)
• “Checkerboard” noise maskers
• Effects of interruption rate and frequency
bandwidth of the checkerboard pattern
Howard-Jones and Rosen (1993)
• Can listeners exploit asynchronous timefrequency glimpses?
• Yes, but only over broad frequency ranges
2.0
Frequency
Frequency
2.0
1.0
0.5
0.2
0.1
1.0
0.5
0.2
0.1
0
100
200
300
400
Time (ms)
500
600
700
Time
0
100
200
300
400
Time (ms)
500
600
700
Time
Speech source separation
A glimpsing model of speech
perception in noise
Martin Cooke
Journal of the Acoustical Society of America, Vol.
119, No. 3, pp. 1562–1573, March 2006
• How do the ear and brain separate the target
voice from the noise?
spatial cues
lip-reading
semantic context
auditory scene analysis (Bregman, 1990)
glimpsing and tracking
10
Auditory scene analysis
• Bregman (1990)
Auditory scene analysis
• Computational auditory scene analysis
The sound that reaches the eardrum of the
listener is often a mixture of different sources
Acoustic signals originating from different
sound sources combine additively
Unlike vision, the concept of occlusion is hard
to define in audition: sounds overlap but also
combine in complex ways.
Reviewed by Cooke and Ellis (2001)
Human listeners are good at separating mixtures
of sounds, as reflected in speech communication
and listening to music in complex listening
environments (cocktail parties)
Attempts to reproduce this separation process
using computational models had limited success
(a hard problem!)
“Glimpsing” speech in noise
“Glimpsing” speech in noise
• “speech is a highly modulated signal in time
and frequency, regions of high energy are
typically sparsely distributed.”
• “The information conveyed by the spectrotemporal energy spectrum of clean speech is
redundant… Redundancy allows speech to be
identified based on relatively sparse evidence.”
2.0
Frequency (Hz)
Frequency (Hz)
2.0
1.0
0.5
0.2
1.0
0.5
0.2
0.1
0.1
0
100
200
300
400
Time (ms)
500
600
700
Glimpsing speech in noise
• Can listeners take advantage of “glimpses”?
direct attention to spectrotemporal regions where the
S+N mixture is dominated by the target speech
ASR system trained to recognize consonants in noise
Maskers differed in “glimpse size”
ASR model developed to exploit non-uniform
distribution of SNR in different time-frequency bands
Conclusion: model + listeners benefit from glimpsing.
0
100
200
300
400
Time (ms)
500
600
700
Speech + noise mixtures
• Some regions dominated by target voice
• Local SNR varies across time and frequency
• Where the target voice dominates, the problem
of source segregation is solved because the
signal is effectively “clean” speech.
• Clean speech is highly redundant; it remains
intelligible after 50% or more of its energy is
removed by gating and/or filtering
11
STEP model
Missing data ASR
• Auditory excitation pattern (Moore, 2003)
Spectrogram-like representation
Reflects non-uniform frequency selectivity in
different frequency bands
Incorporates a sliding time window reflecting
temporal analysis by the auditory system
Relative audibility at different frequencies
Loudness model
Sparseness and redundancy
• Glimpses = spectrotemporal regions where
signal exceeds masker by ~3 dB.
• HMM-based speech recognizer
• “Missing-data” models
Glimpses only
• Ignore missing information (in masked regions)
Glimpses-plus-background
• Try to fill in missing information (based on masked
regions)
Syllable identification accuracy as a function of the number
of competing voices. The level of the target speech
(monosyllabic nonsense words) was held constant at 95dB.
(After Miller 1947).
target
single talker
masker
eight-talker
masker
speech-shaped
noise
glimpses
Results
12
Results
Conclusions
• Best model:
Uses information in glimpses and
counterevidence in the masked regions
Glimpses constrained to a minimum area
Treats all regions with local SNR > -5 dB as
potential glimpses
FIG. 4. The correlation between intelligibility and proportion of the target speech in which the local SNR exceeds 3 dB.
Each point represents a noise condition, and proportions are means across all tokens in the test set. The best linear fit is
also shown. The correlation between listeners and these putative glimpses is 0.955.
Conclusions
Conclusions
• A higher “glimpse threshold” (e.g. local SNR
> 0 dB) produces fewer glimpses, but this
provides less distorted information than a
lower threshold (e.g. -5 dB).
• Limitation: local SNR must be known in
advance. Is there a way to estimate the local
SNR directly from the mixture?
• Tracking problem: how to integrate glimpses
over time?
Brungart et al. (2001)
Different
talker,
different
sex
Different
talker,
same
sex
Modulated
noise
Same
talker
–12 –9 –6 –3
0
3
6
9
Target-to-Masker Ratio (dB)
12
2-talker correct responses (%)
2-talker correct responses (%)
Brungart et al. (2001)
–12 –9 –6 –3
0
3
6
9
12
Target-to-Masker Ratio (dB)
13