Download pcp_notes_speech_pro.. - ee.iitb

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Auditory system wikipedia , lookup

Auditory processing disorder wikipedia , lookup

Speech perception wikipedia , lookup

Dysprosody wikipedia , lookup

Transcript
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Introduction to
SPEECH PROCESSING
P. C. Pandey
EE Dept., IIT Bombay
References
1.
2.
3.
L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Pearson Education, 2004
B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley, 2002
D. O'Shaughnessy, Speech Communications: Human and Machine, Universities Press, 2001
1
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Speech communication
• Speech production
• Propagation of the acoustic wave
• Speech perception by the listener
Applications of speech processing
• Generation
• Transmission & storage (coding, speech enhancement)
• Reception (speaker & speech recognition, quality assessment)
• Aids for the disabled
2
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Applications of speech processing
• Generation (speech synthesis)
Machine voice output
Reading aid for the blind
Testing of speech communication systems
Study of speech production & perception system
Speech production aids
• Transmission / storage
Coding ○reduction in channel capacity requirements
○encryption
Reduction in susceptibility to noise
Speech enhancement
Voice transformation: change of accent, speaker identity
3
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
• Reception
Analysis for studying speech signal
Diagnostic aid for speech disorders
Speaker identification & quality assessment
Speech recognition
Aids for the hearing impaired
Variable rate playback
Visual / tactile displays
Effective use of residual hearing
Electrical stimulation of auditory nerve
4
5
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Speech units
• Words
• Syllables
• Phonemes
• Sub-phonemic events
Speech transcription (scripts and alphabet systems)
• Pictographic
• Phonemic
• Alphabetic
• Syllabic
International Phonetic Alphabet (IPA)
अ /^/,
आ /a/,
क़् /k/,
ख़ ् /kh/,
श ् / /, etc.
6
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Efficiency of phonemic transcription
• Measurement of information
Information: mean logarithmic probability (MLP).
For a set of messages x1 , x 2 .... x n with probabilities p1 , p 2... p n ,
information
n
I  MLP xi   pi log 2 1/ pi bits
i 1
If the messages occur at the rate of M messages/s, then the rate of
information is
 
C  M I bits/s


P. C. Pandey / Introduction to Speech Processing / Jan. 2008
• Phonemic transcription of speech
Consider speech generated from phonemic transcription (written text).
No of phonemes in English = 42.
Considering them equiprobable, p = 1/42
42
I   pi log 2
i 1
1/ pi   5.39 bits
For conversational speech, with10 phonemes/s. C  M I = 53.9 bits/s.
Considering actual phoneme probabilities, I = 4.9 bits and C = 49 bits/s
Considering relatedness of sequence of phonemes, C ≈ 45 bits/s
7
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
• Channel capacity requirements
Channel capacity of an analog channel, with signal power S , noise power
N , and bandwidth B Hz ,
S 

C  B log 2  1  
 N
bits/s
For conventional analog telephony,
B = 3 kHz, SNR = 30 dB → S/N = 1000
3

C  3  10 log 2 1  1000
  30 k bits/s
8
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
9
Now consider good quality speech
Digitization (without any data compression techniques)
▪ 12 bit quantization (212 = 4096 levels)
▪ Sampling rate = 10 k samples/s
Channel capacity required for digital transmission
C = (10×103)log2 212 = 120 k bits/s
Thus we have three different estimates for channel capacity requirements
for speech
▪ Analog telephony (3 kHz bandwidth, 30 dB SNR): 30 k bits/s
▪ Digital transmission (12-bit, 10 k samples/s, no encoding): 120 k bits/s
▪ Phonemic transcription:  45 bits/s
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
10
Auditory system
▪ Peripheral auditory system
External ear : pinna and auditory canal (sound collection)
→ Middle ear : ear drum and middle ear bones (impedance matching)
→ Inner ear : cochlea (analysis and transduction)
→ Auditory nerve (transmission of neural impulses)
▪ Central auditory system (information interpretation)
11
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Peripheral auditory system (outer, middle, and inner ear).
(Adapted from Flanagan (1972a), Fig. 4.1.)
Outer ear
Middle
ear
Inner ear
Vestibular apparatus
with semicircular
canals
Vestibular
nerve
Malleus
Incus
Stapes
Pinna
Cochlear
nerve
External
canal
Cochlea
Eardrum
(Tympanic
membrane)
Oval window
Round window
Eustachian tube
Nasal cavity
FIG. 2.1. Peripheral part of the auditory system. Source: Flanagan (1972), Fig. 4.1.
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
A longitudinal section through the uncoiled cochlea
(Source: Pickles (1982), Fig. 3.1.(C))
12
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
The mechanism of
speech production
A schematic diagram
of the human vocal
mechanism
13
14
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
A schematic representation
of the vocal apparatus. (Adapted
from Rabiner & Schafer (1978), Fig. 3.2)
Pitch : F0 (rate of vibration of
vocal cords)
Formant frequencies: F1, F2, ..
(resonance frequencies of
vocal tract filter)
u (t )
vocal tract
filter
Radiation into
air
s (t )
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Phonemic features
▪ Modes of excitation
• Glottal: ○voiced ○unvoiced (aspiration)
• Frication (constriction in vocal tract) : ○voiced ○unvoiced
▪ Movement of articulators
• Continuant (steady vocal tract): vowels, nasal stops, fricatives
• Non-continuant (changing vocal tract): diphthongs, semivowels,
plosives (oral stops)
▪ Place of articulation
• bilabial, labio-dental, linguo-dental, alveolar, palatal, velar, gluttoral
▪ Changes in Fo
15
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Classification
of English
phonemes
16
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Classification
of Hindi
phonemes
17
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Suprasegmental features
• Intonation
• Rhythm (syllable stressing)
Carriers
○Changes in intensity
○Syllable duration
○Changes in voice pitch
• Visual features
Partial information on place of articulation
18
19
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Short-time speech analysis
s ( n)
 s ( n ) w( n  m)
windowing

Storage/display/transmission
 Qk ( n )
decimation
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Speech analysis techniques
• Time domain analysis
○ Intensity
○ Autocorrelation
○ Zero crossing rate
○ AMDF
• Frequency domain analysis
○ Filter bank analysis
○ Spectrographic displays
○ F0 estimation and tracking
○ Short-time Fourier analysis
○Formant estimation & tracking
• Separation of excitation and vocal tract filter effects
○ Cepstral analysis
○ Linear predictive coding (LPC)
20
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
21
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
22
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
23
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
24
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
25
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
26
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
27
28
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Multi-resolution spectrographic analysis
Analog analysis
Spectral analysis using bandpass filter & demodulator
s(t ) 
 St ( f ) dB
f = 300 Hz (wideband), 45 Hz (narrow band)
Digital analysis
N 1
Short-time Fourier transform: X ( n, k )   w(m) x( n  m)e
 j 2
mk
N
m 0
L

nt

 L
L
Hamming window : w(n)  0.54  0.46 cos  2 2  ,   n 
L 1  2
2



SdB (n, k )  20 log  X (n, k ) / X ref
For f s  10 k Sa/s, L  43  f

300 Hz
L  289  f  45 Hz
• Change in resolution, implemented by padding L-sample segment with N-L zero valued
samples, and using N-point DFT. Pre-emphasis for boosting high-frequency components.
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Vowel
triangle
29
30
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Spectrograms for English phonemes
(vowels or /aCa/ syllable)
Manner
Oral stop
Nasal stop
Oral fricative
Nasal
fricative
Voicing
unvoiced
voiced
unvoiced
voiced
unvoiced
voiced
unvoiced
voiced
Back
k
g
×
ng
(sh)
(zh)
×
×
Place of articulation
Mid
Front
t
p
d
b
×
×
n
m
s
f
z
v
×
×
×
×
31
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Cepstral analysis
e( n )
v ( n)
s ( n )  e( n )  v ( n )
S ( z )  E ( z )V ( z )
Excitation
Vocal tract filter
response
We can use "log" transformation for deconvolution and separation of excitation & vocal
tract filter parameters.
32
P. C. Pandey / Introduction to Speech Processing / Jan. 2008


x ( n)
X ( z)
X ( z)
log
Z
Z
1

x(n)  Z 1 log  X ( z )
log  S ( z)  log E( z)  logV ( z)
 Z 1



s ( n )  e( n )  v ( n )
x ( n)
33
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Linear predictive coding (LPC)
 Excitation e(n) : modeled as
an impulse or random noise.
Excitation
e( n )
Vocal tract
s ( n)
V ( z)
 Vocal tract filter transfer
function V(z): modeled as an
all-pole filter (AR model).
 Inverse filter A(z): all-zero filter
(FIR or MA filter), with parameters
estimated for minimizing  (n)
in LMS sense.
Inverse filter
+

 ( n)
Filter param.
Estimation
Result
 Error  (n) acquires the characteristics of
the excitation function
 Inverse filter coefficients, obtained by solving a set of linear equations, approximate the vocal
tract filter coefficients.
34
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Speech production
e( n )
s ( n)
V ( z)
V ( z) 
1

1  A( z )
1
p
1   ak z  k
k 1
p
e ( n )  s ( n )   ak s ( n  k )
k 1
35
P. C. Pandey / Introduction to Speech Processing / Jan. 2008
Linear prediction model

s ( n)
s ( n)

A( z )

p
+

s ( n)   a k s ( n  k )
k 1

 ( n)  s ( n)  s ( n )
p 
 s ( n)   a k s ( n  k )
k 1
 ( n)