Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
V O P YJS Speech production mechanisms Y(J) Stein VoP2 1 V O Spch Prod Speech Production Organs P Brain Hard Palate Nasal cavity Velum Teeth Lips Mouth cavity Uvula Pharynx Tongue Esophagus Larynx Trachea Lungs YJS Y(J) Stein VoP2 2 V Spch Prod Speech Production Organs - cont. O P YJS Air from lungs is exhaled into trachea (windpipe) Vocal chords (folds) in larynx can produce periodic pulses of air by opening and closing (glottis) Throat (pharynx), mouth, tongue and nasal cavity modify air flow Teeth and lips can introduce turbulence Epiglottis separates esophagus (food pipe) from trachea Y(J) Stein VoP2 3 V Spch Prod Voiced vs. Unvoiced Speech O P When vocal cords are held open air flows unimpeded When laryngeal muscles stretch them glottal flow is in bursts When glottal flow is periodic called voiced speech Basic interval/frequency called the pitch Pitch period usually between 2.5 and 20 milliseconds Pitch frequency between 50 and 400 Hz You can feel the vibration of the larynx YJS Vowels are always voiced (unless whispered) Consonants come in voiced/unvoiced pairs for example : B/P K/G D/T V/F J/CH TH/th W/WH Z/S ZH/SH Y(J) Stein VoP2 4 V Spch Prod Excitation spectra O P Voiced speech Pulse train is not sinusoidal - harmonic rich f Unvoiced speech Common assumption : white noise f YJS Y(J) Stein VoP2 5 V Spch Prod Effect of vocal tract O P YJS Mouth and nasal cavities have resonances Resonant frequencies depend on geometry Y(J) Stein VoP2 6 V Spch Prod Effect of vocal tract - cont. O P Sound energy at these resonant frequencies is amplified Frequencies of peak amplification are called formants frequency response F1 F2 F3 F4 frequency voiced speech unvoiced speech F0 YJS Y(J) Stein VoP2 7 V Spch Prod Formant frequencies O Peterson - Barney data (note the “vowel triangle”) P YJS Y(J) Stein VoP2 8 V O Spch Prod Sonograms P YJS Y(J) Stein VoP2 9 V Spch Prod Cylinder model(s) O P Rough model of throat and mouth cavity Voice Excitation With nasal cavity Voice Excitation YJS open open open/closed Y(J) Stein VoP2 10 V Spch Prod Phonemes O P YJS The smallest acoustic unit that can change meaning Different languages have different phoneme sets Types: (notations: phonetic, CVC, ARPABET) – Vowels • front (heed, hid, head, hat) • mid (hot, heard, hut, thought) • back (boot, book, boat) • dipthongs (buy, boy, down, date) – Semivowels • liquids (w, l) • glides (r, y) Y(J) Stein VoP2 11 V O P Spch Prod Phonemes - cont. – Consonants • nasals (murmurs) (n, m, ng) • stops (plosives) – voiced (b,d,g) – unvoiced (p, t, k) • fricatives – voiced (v, that, z, zh) – unvoiced (f, think, s, sh) • affricatives (j, ch) • whispers (h, what) • gutturals ( ע, ) ח • clicks, etc. etc. etc. YJS Y(J) Stein VoP2 12 V Spch Prod Basic LPC Model O P Pulse Generator U/V Switch LPC synthesis filter White Noise Generator YJS Y(J) Stein VoP2 13 V Spch Prod Basic LPC Model - cont. O P YJS Pulse generator produces a harmonic rich periodic impulse train (with pitch period and gain) White noise generator produces a random signal (with gain) U/V switch chooses between voiced and unvoiced speech LPC filter amplifies formant frequencies (all-pole or AR IIR filter) The output will resemble true speech to within residual error Y(J) Stein VoP2 14 V O P Spch Prod Cepstrum Another way of thinking about the LPC model Speech spectrum is the obtained from multiplication Spectrum of (pitch) pulse train times Vocal tract (formant) frequency response So log of this spectrum is obtained from addition Log spectrum of pitch train plus Log of vocal tract frequency response Consider this log spectrum to be the spectrum of some new signal called the cepstrum The cepstrum is the sum of two components: excitation plus vocal tract YJS Y(J) Stein VoP2 15 V Spch Prod Cepstrum - cont. O P Cepstral processing has its own language Cepstrum (note that this is really a signal in the time domain) Quefrency (its units are seconds) Liftering (filtering) Alanysis Saphe Several variants: complex cepstrum power cesptrum LPC cepstrum YJS Y(J) Stein VoP2 16 V O P Spch Prod Do we know enough? Standard speech model (LPC) (used by most speech processing/compression/recognition systems) is a model of speech production Unfortunately, speech production and speech perception systems are not matched So next we’ll look at the biology of the hearing (auditory) system and some psychophysics (perception) YJS Y(J) Stein VoP2 17 V O P Speech Hearing & Perception Mechanisms YJS Y(J) Stein VoP2 18 V O Spch Perc Hearing Organs P YJS Y(J) Stein VoP2 19 V Spch Perc Hearing Organs - cont. O P YJS Sound waves impinge on outer ear enter auditory canal Amplified waves cause eardrum to vibrate Eardrum separates outer ear from middle ear The Eustachian tube equalizes air pressure of middle ear Ossicles (hammer, anvil, stirrup) amplify vibrations Oval window separates middle ear from inner ear Stirrup excites oval window which excites liquid in the cochlea The cochlea is curled up like a snail The basilar membrane runs along middle of cochlea The organ of Corti transduces vibrations to electric pulses Pulses are carried by the auditory nerve to the brain Y(J) Stein VoP2 20 V Spch Perc Function of Cochlea O P YJS Cochlea has 2 1/2 to 3 turns were it straightened out it would be 3 cm in length The basilar membrane runs down the center of the cochlea as does the organ of Corti 15,000 cilia (hairs) contact the vibrating basilar membrane and release neurotransmitter stimulating 30,000 auditory neurons Cochlea is wide (1/2 cm) near oval window and tapers towards apex is stiff near oval window and flexible near apex Hence high frequencies cause section near oval window to vibrate low frequencies cause section near apex to vibrate Overlapping bank of filter frequency decomposition Y(J) Stein VoP2 21 V O P Spch Perc Psychophysics - Weber’s law Ernst Weber Professor of physiology at Leipzig in the early 1800s Just Noticeable Difference : minimal stimulus change that can be detected by senses Discovery: DI=KI Example Tactile sense: place coins in each hand subject could discriminate between with 10 coins and 11, but not 20/21, but could 20/22! Similarly vision lengths of lines, taste saltiness, sound frequency YJS Y(J) Stein VoP2 22 V Spch Perc Weber’s law - cont. O P This makes a lot of sense Bill Gates YJS Y(J) Stein VoP2 23 V O Spch Perc Psychophysics - Fechner’s law P Weber’s law is not a true psychophysical law it relates stimulus threshold to stimulus (both physical entities) not internal representation (feelings) to physical entity Gustav Theodor Fechner student of Weber medicine, physics philosophy Simplest assumption: JND is single internal unit Using Weber’s law we find: Y = A log I + B Fechner Day (October 22 1850) YJS Y(J) Stein VoP2 24 V O Spch Perc Fechner’s law - cont. P Log is very compressive Fechner’s law explains the fantastic ranges of our senses Sight: single photon - direct sunlight 1015 Hearing: eardrum move 1 H atom - jet plane 1012 Bel defined to be log10 of power ratio decibel (dB) one tenth of a Bel d(dB) = 10 log10 P 1 / P 2 YJS Y(J) Stein VoP2 25 V O P Spch Perc Fechner’s law - sound amplitudes Companding adaptation of logarithm to positive/negative signals m-law and A-law are piecewise linear approximations Equivalent to linear sampling at 12-14 bits (8 bit linear sampling is significantly more noisy) YJS Y(J) Stein VoP2 26 V O P Spch Perc Fechner’s law - sound frequencies octaves, well tempered scale 12 2 Critical bands Frequency warping Melody 1 KHz = 1000, JND afterwards f M ~ 1000 log2 ( 1 + fKHz ) Barkhausen can be simultaneously heard B ~ 25 + 75 ( 1 + 1.4 f2KHz )0.69 excite different basilar membrane regions YJS Y(J) Stein VoP2 27 V O P Spch Perc Psychophysics - changes Our senses respond to changes Inverse E Filter YJS Y(J) Stein VoP2 28 V O P Spch Perc Psychophysics - masking Masking: strong tones block weaker ones at nearby frequencies narrowband noise blocks tones (up to critical band) f YJS Y(J) Stein VoP2 29 V O P YJS Speech DSP Y(J) Stein VoP2 30 V O Some Speech DSP P Simplest processing – Gain – AGC – VAD More complex processing – pitch tracking – U/V decision – computing LPC – other features YJS Y(J) Stein VoP2 31 V O P YJS Simple Speech DSP Y(J) Stein VoP2 32 V O P Spch DSP Gain (volume) Control In analog processing (electronics) gain requires an amplifier Great care must be taken to ensure linearity! In digital processing (DSP) gain requires only multiplication y=Gx Need enough bits! YJS Y(J) Stein VoP2 33 V O P Spch DSP Automatic Gain Control (AGC) Can we set the gain automatically? Yes, based on the signal’s Energy! E= x2 (t) dt = S xn2 All we have to do is apply gain until attain desired energy Assume we want the energy to be Y Then y = Y/ E x = Gx has exactly this energy YJS Y(J) Stein VoP2 34 V Spch DSP AGC - cont. O P What if the input isn’t stationary (gets stronger and weaker over time) ? <t< 8 8 The energy is defined for all times so it can’t help! So we define “energy in window” E(t) and continuously vary gain G(t) This is Adaptive Gain Control We don’t want gain to jump from window to window so we smooth the instantaneous gain G(t) a G(t) + (1-a) Y/E(t) IIR filter YJS Y(J) Stein VoP2 35 V Spch DSP AGC - cont. O P The a coefficient determines how fast G(t) can change In more complex implementations we may separately control integration time, attack time, release time What is involved in the computation of G(t) ? – – – – Squaring of input value Accumulation Square root (or Pythagorean sum) Inversion (division) Square root and inversion are hard for a DSP processor but algorithmic improvements are possible (and often needed) YJS Y(J) Stein VoP2 36 V O P Spch DSP Simple VAD Sometimes it is useful to know whether someone is talking (or not) – Save bandwidth – Suppress echo – Segment utterances We might be able to get away with “energy VOX” Normally need Noise Riding Threshold / Signal Riding Threshold However, there are problems energy VOX since it doesn’t differentiate between speech and noise What we really want is a speech-specific activity detector Voice Activity Detector YJS Y(J) Stein VoP2 37 V O P Spch DSP Simple VAD - cont. VADs operate by recognizing that speech is different from noise – Speech is low-pass while noise is white – Speech is mostly voiced and so has pitch in a given range – Average noise amplitude is relatively constant A simple VAD may use: – zero crossings – zero crossing “derivative” – spectral tilt filter – energy contours – combinations of the above YJS Y(J) Stein VoP2 38 V Spch DSP Other “simple” processes O P Simple = not significantly dependent on details of speech signal YJS Speed change of recorded signal Speed change with pitch compensation Pitch change with speed compensation Sample rate conversion Tone generation Tone detection Dual tone generation Dual tone detection (need high reliability) Y(J) Stein VoP2 39 V O P YJS Complex Speech DSP Y(J) Stein VoP2 40 V O P Spch DSP Correlation One major difference between simple and complex processing is the computation of correlations (related to LPC model) Correlation is a measure of similarity Shouldn’t we use squared difference to measure similarity? D2 = < (x(t) - y(t) )2 > No, since squared difference is sensitive to – gain – time shifts YJS Y(J) Stein VoP2 41 V Spch DSP Correlation - cont. O P D2 = < (x(t) - y(t) )2 > = < x2 > + < y2 > - 2 < x(t) y(t) > So when D2 is minimal C(0) = < x(t) y(t) > is maximal and arbitrary gains don’t change this To take time shifts into account C(t) = < x(t) y(t+t) > and look for maximal t! We can even find out how much a signal resembles itself YJS Y(J) Stein VoP2 42 V O P Spch DSP Autocorrelation Crosscorrelation Cx y (t) = < x(t) y(t+t) > Autocorrelation Cx (t) = < x(t) x(t+t) > Cx (0) is the energy! Autocorrelation helps find hidden periodicities! Much stronger than looking in the time representation Wiener Khintchine Autocorrelation C(t) and Power Spectrum S(f) are FT pair So autocorrelation contains the same information as the power spectrum … and can itself be computed by FFT YJS Y(J) Stein VoP2 43 V O P Spch DSP Pitch tracking How can we measure (and track) the pitch? We can look for it in the spectrum – but it may be very weak – may not even be there (filtered out) – need high resolution spectral estimation Correlation based methods The pitch periodicity should be seen in the autocorrelation! Sometimes computationally simpler is the Absolute Magnitude Difference Function < | x(t) - x(t+t) | > YJS Y(J) Stein VoP2 44 V O P Spch DSP Pitch tracking - cont. Sondhi’s algorithm for autocorrelation-based pitch tracking : – obtain window of speech – determine if the segment is voiced (see U/V decision below) – low-pass filter and center-clip to reduce formant induced correlations – compute autocorrelation lags corresponding to valid pitch intervals • find lag with maximum correlation OR • find lag with maximal accumulated correlation in all multiples Post processing Pitch trackers rarely make small errors (usually double pitch) So correct outliers based on neighboring values YJS Y(J) Stein VoP2 45 V O P Spch DSP Other Pitch Trackers Miller’s data-reduction & Gold and Rabiner’s parallel processing methods Zero-crossings, energy, extrema of waveform Noll’s cepstrum based pitch tracker Since the pitch and formant contributions are separated in cepstral domain Most accurate for clean speech, but not robust in noise Methods based on LPC error signal LPC technique breaks down at pitch pulse onset Find periodicity of error by autocorrelation Inverse filtering method Remove formant filtering by low-order LPC analysis Find periodicity of excitation by autocorrelation Sondhi-like methods are the best for noisy speech YJS Y(J) Stein VoP2 46 V O Spch DSP U/V decision P Between VAD and pitch tracking Simplest U/V decision is based on energy and zero crossings More complex methods are combined with pitch tracking Methods based on pattern recognition Is voicing well defined? Degree of voicing (buzz) Voicing per frequency band (interference) Degree of voicing per frequency band YJS Y(J) Stein VoP2 47 V Spch DSP LPC Coefficients O P How do we find the vocal tract filter coefficients? System identification problem Unknown known input filter All-pole (AR) filter Connection to prediction Sn = G en + known output Sm am sn-m Can find G from energy (so let’s ignore it) YJS Y(J) Stein VoP2 48 V Spch DSP LPC Coefficients O P For simplicity let’s assume three a coefficients Sn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3 Need three equations! Sn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3 Sn+1 = en+1 + a1 sn + a 2 s n-1 + a 3 s n-2 Sn+2 = en+2 + a1 sn+1 + a 2 s n + a 3 s n-1 In matrix form Sn Sn+1 Sn+2 s YJS = = en en+1 en+2 + sn-1 s n-2 s n-3 sn s n-1 s n-2 sn+1 s n s n-1 e + S a1 a2 a3 a Y(J) Stein VoP2 49 V O Spch DSP LPC Coefficients - cont. P S=e+Sa so by simple algebra a = S-1 ( s - e ) and we have reduced the problem to matrix inversion Toeplitz matrix so the inversion is easy (Levinson-Durbin algorithm) Unfortunately noise makes this attempt break down! Move to next time and the answer will be different. Need to somehow average the answers The proper averaging is before the equation solving correlation vs autocovariance YJS Y(J) Stein VoP2 50 V O P Spch DSP LPC Coefficients - cont. Can’t just average over time - all equations would be the same! Let’s take the input to be zero Sn = Sm am sn-m multiply by Sn-q and sum over n Sn Sn Sn-q = Sm am Sn sn-m sn-q we recognize the autocorrelations Cs (q) = Sm Cs (|m-q|) am Yule-Walker equations autocorrelation method: sn outside window are zero (Toeplitz) autocovariance method: use all needed sn (no window) Also - pre-emphasis! YJS Y(J) Stein VoP2 51 V Spch DSP Alternative features O P The a coefficients aren’t the only set of features Reflection coefficients (cylinder model) log-area coefficients (cylinder model) pole locations LPC cepstrum coefficients Line Spectral Pair frequencies All theoretically contain the same information (algebraic transformations) YJS Euclidean distance in LPC cepstrum space ~ Itakura Saito measure so these are popular in speech recognition LPC (a) coefficients don’t quantize or interpolate well so these aren’t good for speech compression LSP frequencies are best for compression Y(J) Stein VoP2 52 V Spch DSP LSP coefficients O P a coefficients are not statistically equally weighted pole positions are better (geometric) but radius is sensitive near unit circle Is there an all-angle representation? Theorem 1: Every real polynomial with all roots on the unit circle is palindromic (e.g. 1 + 2t + t2) or antipalindromic (e.g. t + t2 - t3) Theorem 2: Every polynomial can be written as the sum of palindromic and antipalindromic polynomials Consequence: Every polynomial can be represented by roots on the unit circle, that is, by angles YJS Y(J) Stein VoP2 53 V O P Spch DSP LPC - based Compression We learned that from – gain – pitch – a small number of LPC coefficients we could synthesize speech It is easy to find the energy of a speech signal We have seen methods to find pitch We saw how to extract LPC coefficients from speech So do we know how to compress speech? YJS Y(J) Stein VoP2 54