Download available here - IEEE Cleveland Section

The Evolving Quality of Telephonic Speech Why VoIP's speech quality is disappointing, and how it wouldn't have to be. Richard A. Thompson Emeritus Professor Telecom Program University of Pittsburgh [email protected] Outline 1. 2. 3. 4. 5. Introduction Human capacity for aural quality History of evolving & devolving quality Network integration vs app quality High-fidelity Voice-over-IP 1. Introduction • Telecom technology has benefited the human species. – Morse, Bell, Tesla, Zworykin  we communicate over distance, – But their inventions had greatly reduced aural & visual quality. • During the last century, successive technology … – Raised many aspects of the original audio & video quality, – But, also lowered other aspects of app quality • Two examples of lowered quality: 1. Successive technologies  reduced audio bandwidth 2. pixel-block “dance” after noisy or lost internet packets. • This talk discusses the devolution of audio quality – And concludes that we don’t have to live with it. Gucci Family Slogan “Quality is remembered … long after the price is forgotten” $895 $1950 2. Human Capacity for Aural Quality • Anatomy, physics, physiology, & brainware – – • of human speech and hearing How we discriminate phonemes & recognize speakers Section Outline 1. Review of Human Speech 2. Review of Human Hearing 3. Review of Aural Processing 2. Human Capacity for Aural Quality Review of Human Speech • Speech = complex acoustic signal humans emit & receive – Sequence of air compressions & rarefactions; – Travels about 770 mph • Speaking requires a complex structure: – By modulating an exhaled air stream, we emit sequences of elementary sounds, called phonemes. • If we partly close our larynx as we exhale, – our “vocal cords” vibrate at a fundamental pitch, f1 = 80 to 350 Hz, – depending on the speaker’s size, shape, gender, & age. • Altering tension changes f1 to any value between half and double its regular pitch; – for singing and linguistic cues. Variable Acoustic Filter • Acoustic waveform at the larynx resembles a saw-tooth  rich in harmonics. • Mouth is a variable resonant cavity; ee – It acts as a tunable acoustic filter. • By changing our mouth’s internal shape, – we attenuate different harmonics as they pass through. • Our two main techniques are: – Change our tongue position, – Switch our nasal cavity in/out using our uvula. • Each phoneme has a different “recipe” – of the weights of the harmonics. ee nn aa Taxonomy of English phonemes Type Vowel-like mouth nose diphthongs Fricatives (sustained turbulence) Plosives (burst turbulence) unvoiced - voiced vowels, ll, rr mm, nn, ng ow, long-i, … hh ss sh ff wh zz zh vv ch k p t j g b d • Sustained phonemes: • vowels, ll, rr, • nasals, • fricatives. • Dynamic phonemes: • Slowly: diphthongs • Quickly: plosives • Last eight rows: • 8 diff. mouth positions • 2 phonemes per position; • By vibrating larynx or not. Mouth-to-Ear Spectrum • Runs from f1 to our hearing limit of 14 - 20 kHz, – depending on the listener’s age, etc. • Acoustic energy in different phonemes – is distributed differently over the aural spectrum. • For example, fricatives like ss, – have significant energy at the high end of the spectrum. • Hearing accuracy is – a non-linear function of how much of this spectrum is actually heard. 2. Human Capacity for Aural Quality Review of Human Hearing • Ear drum, in each ear, – is AC-coupled (the Eustachian tube maintains DC) – to the cochlea by tiny linked bones. • Cochlea is a horn, wrapped into a snail-shell, – filled with fluid, lined with small hairs. • The acoustic signal – causes standing waves inside the cochlea to excite nerves at the base of each hair. – These nerves transmit a parallel signal to the brain, giving the weights of the signal’s harmonics. • Cochlea & its driver (in brainware) compute* the – Fourier Series coefficients of the received acoustic signal. *Color code for what we think happens AD Hearing Brain-Ware Ear HW ED PD NF SI • Behind this driver, mid-level BW does more processing: 1. 2. 3. 4. Calculates acoustic directionality, Selects the desired signal out of background noise, Performs phoneme discrimination (independent of the speaker), Identifies who the speaker is (independent of the phoneme). • Last 3 tasks are supported by – high-level syntactic & semantic processing which, – at even higher levels of brainware, – depend on content, context, background, and emotional state. • This paper deals only with low- and mid-level brainware, – Which performs the last two tasks on the list above. 2. Human Capacity for Aural Quality Review of Aural Processing • Mid-level brainware identifies speakers – by comparing the set of weights, received from the driver, – against a speaker database. • Our accuracy at finding a best match is a – nonlinear function of how many weights the speaker-identifier process receives from the driver. – This number of coefficients depends on how much acoustic spectrum is heard by the cochlea & its driver. • We discriminate phonemes more indirectly. – The spectral envelope of most phonemes has four relative maxima, called “formant frequencies,” F1 to F4. Formant Frequencies • F1 and F2 peaks for ee and aa can be seen – in the frequency domain. • Generalized time-domain diagrams: – of F1 and F2 for 21 phoneme-pairs, – each a dynamic consonant that elides into a vowel. f1 F1 F2 ee aa Formants for Vowels • Spectral position of these formants, especially F1 and F2, – is the most important cue in phoneme discrimination. – But, it’s complex because formant positions are speaker dependent. • Each point is an [F1, F2] value for – 76 speakers of 10 sustained phonemes. – Clusters show the intended phoneme. – Proximities  pot. error w/o ++spectrum. • EG, upper-left cluster  ee. – Low F1 & high F2 consistent spectrum. – High prob. ee interpreted as short-i. ee Phoneme Discrimination • We discriminate phonemes in mid-level brainware by: 1. Computing formants from weights received from driver, 2. Comparing Fs against a database that works like • Our accuracy at finding the best match is a – – nonlinear function of how many formants the phoneme-discriminator has available. This # of formants depends on how much of the acoustic spectrum is heard by cochlea & driver. • We have a mirrored set of multilevel processes – – – in the speaker’s brainware also. These communicating processes translate thoughts into language, then to sequence of neural signals that control our mouth parts. 3. Technology’s Impact on Quality • After listing components of aural quality, – we review successive technologies and how they – raised some aspects of audio quality and lowered others. • After discussing their effect on – speaker identification and phoneme discrimination, • We review the history of the complaint that technology – should never lower any aspect of application quality. • Section Outline 1. Aural Quality and its Impairments 2. Identifying Phonemes and Speakers 3. The History of the Complaint 3. Technology’s Impact on Quality Aural Quality & its Impairments • Quality of a natural acoustic signal is measured by its: – – – – – Intensity (loudness), Purity (nothing else added), Immediacy (un-delayed) Clarity (undistorted), & Fidelity (). • By definition, Fidelity measures an audio signal’s – faithfulness to its acoustic analog. – We’ll defer to the lay def that it implies high band-width. Natural Impairments • Natural acoustic signals suffer 5 impairments: – – – – – loss, noise, crosstalk, delay, & echo. With given intensity, purity, clarity, & fidelity Transmitted acoustic signal Deletes natural loss & delay. Reduces natural Natural medium Impaired by natural loss, noise, crosstalk, delay, & echo With poorer intensity, purity, Immediacy, clarity & fidelity Received acoustic signal This figure will grow downward Further impaired by following slides on the Pros & Cons of Analog Networks • The role of any network is to eliminate natural loss. With given intensity, purity, clarity, & fidelity – Usually replaces large acoustic delay by small signal delay With poorer – May also reduce crosstalk & echo. intensity, purity, Transmitted acoustic signal Impaired by natural loss, noise, crosstalk, delay, & echo Natural medium Deletes natural loss & delay. Reduces natural noise, crosstalk, & echo Transducer Analog network Further impaired by analog loss, noise, crosstalk, delay, echo, band-loss, & 3 distortions w Immediacy, clarity & fidelity Received acoustic signal Transducer • Analog networks add crosstalk from the loop pair Reduces analog loss, echo, noise, crosstalk, & 3 distortions Further impaired by quantization noise, bit-error noise, & ++band-loss from anti-aliasing filter – and echo from impedance mismatch and leaky hybrids. Transducer Digital network Transducer • &, they add new impairments, not seen in natural signals: Retains all digital network impairments Exacerbates – Amplitude distortion from amplifiers that clip, bit-error noise. Adds more delay, which can ++echo TransPacket Trans– Band-restriction & frequency distortion from wire reactance, ducer network ducer – Delay distortion because frequency components have diff velocities. Fidelity in Analog Networks • 500-sets BW = 12 kHz Local loop Telephone Switching Office Local loop BW = 8 - 10 kHz Telephone Local Trunk Switching loop Office Long Distance Network Trunk Telephone Switching Office – Cut f1 off at low end – Had 12-kHz of bandpass. BW = 4 - 6 kHz – (modern phones have no reason to provide that much BW) Local loop • If phones are connected in a local call, – loop limits end-to-end bandpass to 8-10 kHz, dep on loop-length. • In long-distance calls, – network further limits bandpass to 4-6 kHz, dep on distance. • 4-kHz analog LD channel had poorest fidelity, but… – Bell System “spun” the term “toll grade” to imply high quality. • Note: upper limit of all BWs is given as “3-dB frequency;” – There is significant audio power outside these formal limits. Subsequent Analog improvements • Analog technology advancements in: – – – – – Channels (fiber), Amplifiers, Echo cancellers, Shielding, & Noise filters; • But, not band-restriction, Improved: • • • • • • loss, noise, crosstalk, delay, echo, & amplitude distortion. – nor the other two forms of distortion. • Biggest improvement comes from going digital Pros & Cons of Digital Networks • Digitizing an audio signal greatly improves intensity. • And, a digital PSTN is virtually noise-free. With given intensity, purity, clarity, & fidelity – Even loop noise (assume ADC in CO) is partially blocked on speaker side by ADC anti-alias filter. Transmitted acoustic signal Impaired by natural loss, noise, crosstalk, delay, & echo Natural medium Deletes natural loss & delay. Reduces natural noise, crosstalk, & echo Transducer Analog network Reduces analog loss, echo, noise, crosstalk, & 3 distortions Transducer Digital network Retains all digital network impairments Further impaired by analog loss, noise, crosstalk, delay, echo, band-loss, & 3 distortions w Further impaired by quantization noise, bit-error noise, & ++band-loss from anti-aliasing filter • But, new noise is added by: Transducer Packet network Exacerbates bit-error noise. Adds more delay, which can ++echo With poorer intensity, purity, Immediacy, clarity & fidelity Received acoustic signal Transducer Transducer Note that a digital network is embedded inside an analog network Transducer – quantizing, companding, mu-to-A conversion, & bit errors. • And, Echo is worse because digital transport is 4-wire, – which requires many more hybrids (which can leak) in the network. Fidelity in Digital Networks • By far, the worst impairment is that – anti-aliasing filters in the A-to-D converters impair fidelity, – So, all calls are nominally as band-limited as LD analog calls • Fidelity is even perceptibly lower than “nominal” – because blocking all audio above 4 kHz – requires a half-power point at 3.7 kHz & – high-end drop-off that is much steeper than in analog networks. • So, digital calls have better SNR than analog calls; – But a local digital call has perceptibly lower fidelity than even a long-distance analog call. analog • For example, digital 0 4 kHz VoIP’s Cons • VoIP further impairs digital audio quality • Audio purity is further impaired because: – speech compression  exaggerated bit errors – noticeable clunks from lost packets (packet loss: 0 1% 5% – silence-detecting codecs’ slow-start may clip leading-plosive With given intensity, purity, clarity, & fidelity Transmitted acoustic signal Impaired by natural loss, noise, crosstalk, delay, & echo Natural medium Deletes natural loss & delay. Reduces natural noise, crosstalk, & echo Transducer Analog network Reduces analog loss, echo, noise, crosstalk, & 3 distortions Transducer Digital network Retains all digital network impairments Transducer Packet network Further impaired by analog loss, noise, crosstalk, delay, echo, band-loss, & 3 distortions w Further impaired by quantization noise, bit-error noise, & ++band-loss from anti-aliasing filter Exacerbates bit-error noise. Adds more delay, which can ++echo ) With poorer intensity, purity, Immediacy, clarity & fidelity Received acoustic signal Transducer Transducer Transducer Note that a packet network is embedded inside a digital network VoIP & Delay • Immediacy is greatly impaired by delays caused by: – Packetization, jitter buffers, router proc, & multi-hop packet re-xm. • VoIP calls often exceed – user acceptance of conversation interaction delay. • User opinions below are my “compromise” – between Bell System standards & IETF standards Round-Trip Delay < 150 ms 150-300 ms 300-450 ms > 450 ms Opinion good noticeable annoying unacceptable VoIP & Echo • Acoustic echo – Is eliminated by wearing a head-set. • Electrical echo is much worse because: 1. VoIP-PSTN gateways more problematic than D-to-A gateways because echo canceller is far from the echo source (hybrid) 2. User sensitivity to echo depends on individual, echo-to-signal ratio (TELR), & one-way delay. • Since a digital conversation’s TELR  55 dB, – One-Way delay must be < 200-500 ms; but it’s often >200ms. – Large delay reduces the effectiveness of electronic echo cancellers. • Summarizing, VoIP-to-POTS &, esp, VoIP-to-cell calls – are often characterized by annoying echo. Summarizing… • Digitizing speech  – – – – Improves intensity & purity; But, noticeably degrades fidelity. Overall, digital is perceived as “better than” analog; But, it could be much better. • VoIP makes no positive contribution; – VoIP only lowers the quality. – The last section proposes how we might change this. 3. Technology’s Impact on Quality Identifying Phonemes and Speakers • “Telephone voice” impairs our ability to – hear what a speaker says & identify who the speaker is. • 4-kHz DS0 channel has enough BW for F1 & F2,  Little difficulty identifying vowels, ll, and rr. • Hearing the 3rd and 4th formants would: – Slightly improve discrimination of these sounds, – Greatly improve discrimination of fricatives & plosives. • A low F3 passes over a DS0 channel; – But a high F3 will not, and F4 will not. ++Bandwidth  ++Phoneme Discrimination • We need a 7-kHz channel to receive all four formants, • & >7 kHz for sounds we typically struggle with: – nasals (distinguishing mm and nn), – plosives (distinguishing k and t), – fricatives (distinguishing ss and ff). • Exp: ff was spoken to many listeners over 3 channels: Chan BW 200-5000 Hz 200-2500 Hz 1000-5000 Hz ff 194 186 162 Identified as: th p 35 6 31 6 28 12 other 9 13 50 ++Bandwidth  ++Speaker Identification • We identify speakers directly by their Fourier weights, – Not their formant frequencies. – Success is based on the amount of data: # weights received. • Consider three population groups: • Consistent with most people’s experience on the phone: – Men are easily recognized, women less easily, – & we see why “all children sound the same on the phone.” • A child could be recognized over a 12-kHz channel – as well as an average male is over a 4-kHz channel. – At 12kHz, any woman would be as identifiable as any man at 4kHz, – and men could be almost perfectly identified. Section 4 discusses how audio quality is Impacted by “integrated networks” Type Men Women Children f1-range 75-150 Hz 140-300 Hz 275-350 Hz #H’s < 3.7kHz 25-50 12-26 10-13 Rank most middle least The History of the Complaint • When T1 was proposed in the 1960s, – Amos Joel objected to its 8-kHz sample rate. • T1’s advocates stifled him by saying he was – a dinosaur who objected to digital voice (he did not). • Now, some VoIP advocates – use this tactic to stifle their critics. • 8-kHz sampling was standardized – when bandwidth was expensive; • Now that it isn’t, – we’re still stuck with the DS0 channel … – or are we? 4. Network Integration and App-Quality • Review historical attempts at integrating networks, – Generalize how integration naturally lowers app quality – Ask why we have refused to learn this lesson. • Section Outline 1. History of Integrated Networks 2. Why Integration Lowers App Quality 3. Why are we Blind to this Lesson? 4. Network Integration and App-Quality History of Integrated Networks • More than 35 years ago, ISDN … – was proposed as a global end-to-end network for all data types. – Today, it’s relegated to the network edge, as an access standard. • ISDN’s post mortem shows two reasons it failed: 1. ISDN needed a global digital network, • • • an inexpensive users’ appliance/terminal, and a collection of integrated services – simultaneously. AT&T could have done it, but focused on surviving (it didn’t). 2. We learned that the application matters. • • • Ethernet’s stat-muxing was more efficient for bursty data, especially key-strokes on a LAN, than ISDN circuit switching. And, efficiency trumped integration. The 2nd attempt • More than 20 years ago, ATM … – was proposed as a global end-to-end network for all data types. – Cell relay & virtual circuits  avoid congestion from large packets – Limited success in “core,” where congestion is significant, • Failed to achieve its main goal, again for two reasons: 1. ATM’s success required that it also be cost-effective as a LAN. • But, Ethernet prevailed because of embedded base of interface cards, LAN-manager familiarity, & evolution to higher rates 2. We saw again that application matters. • ATM was compared to a duck: “Ducks can swim, fly, and walk, but none well. ATM carries voice, data, and video, but none well.” The 3rd attempt • Now, the Internet is proposed – as a global end-to-end network – to carry all data types. • ISDN and ATM each failed – in part because application matters. • What is different now? 4. Network Integration and App-Quality Why Integration Lowers App Quality • Let’s examine an economic explanation. – Box represents the cost of a basic un-optimized network $ basic network • Consider four cases defined by Networks: Low app-quality High app-quality Separated Integrated 1 2 3 4 Implementations with Low Quality 1. Separated & low - 2 apps, voice & data, with equal load – Boxes represent the cost of two separate networks, • – each dedicated to one app. App quality is barely acceptable because • neither network has been optimized for its app’s quality. $ basic voice network $ basic data network 2. Integrated & low - 2 apps over an un-optimized integrated network. – This box’s area >> the reference square • • • – because the integrated network supports twice as much load. But, its area is less than the sum of the areas of 2 squares because of economy-of-scale & reduced staff of network managers. Since apps may interact in the integrated network, • • each app’s quality is worse than in 2 separate networks. This is the classic “duck”. $ basic integrated network Implementations with High Quality 3. Separated & high – Increase the quality of both apps by optimizing each network in Case 1  raises cost of each. – Squares  rectangles on different dimensions $ integrated  optimize each network differently for resp. app. $ good data network $ good voice network network that is good for voice & data 4. Integrated & high - Improve apps’ quality in integrated network – – – – Perform same optimizations as on the separate networks. So, the “duck” is elongated, but in both dimensions. Significantly larger square than in Case 2 “SWAN” (Superior-service-With-All-apps Network). Integration vs App-Quality $ basic $ basic voice data network network • If we don’t care about app-quality, – Case 2 beats Case 1 – the integrated network is slightly more economical. $ basic integrated network • If we do care about quality, Case 3 vs Case 4? $ good voice network – Unclear how area of SWAN compares against – the sum of the areas of 2 separate rectangles • Does the cost of optimizing an integrated network, – so its apps have good quality, – cancel the small savings provided by the integration? $ good data network • If not, wouldn’t IP-based voice carriers – Like Qwest long-distance, Skype, and Vonage – have dominated the telephone industry by now? $ integrated network that is good for voice & data Why are we Blind to this Lesson? • Prior analysis is admittedly weak, – But it’s not fundamentally flawed. • Seems clear from analysis & history lesson – that network integration is a bad idea; – assuming we don’t want to further degrade app-quality. • Alchemists, a half a millennium ago, – had a goal that is at least easy to appreciate. • Our determination to continue trying – to integrate networks is admirable, but puzzling. 5. What Can We Do? • Ranting about how bad things are – has become an all-too-familiar form of discourse. – Want to more than rant, & make a positive contribution, • This section makes the transition from – how-bad-it-is to how-good-it-could-be by discussing – the market potential and proposing a solution. • Section Outline 1. Market Potential for High-Quality Apps 2. High-fidelity Voice-over-IP 5. What Can We Do? Market Potential for High-Quality Apps • Significant market niche that cares about voice quality? – If there is a market, it’s among people who • appreciate music that sounds better over a high-fi channel & • are annoyed by, or have difficulty with, cell-phone audio quality – This group is older, and growing rapidly as • the surge of baby boomers become older … and deafer. • Decreasing ear-bandwidth reinforces adequacy of 12-kHz channel. • Not an accurate marketing study - But, it seems likely that, – If market size to justify products isn’t significant enough yet, – it could become large enough in just a few more years. 5. What Can We Do? High-fidelity Voice-over-IP analog signal 12-kHz AAF A-to-D converter Packetizer 24 KHz • VoIP presents the opportunity to raise voice quality, – not just to toll-grade, but even beyond. 1. 12-kHz channel would virtually eliminate “telephone voice” & – – Improve phoneme discrimination & speaker identification. Channel bandwidth = 3x the DS0’s equivalent bandwidth  Note: The paper • is incorrect. – & Must packetize the digital stream at speaker-end • ? ? – G.711 codec  3x: Anti-aliasing filter’s BW & the ADC’s sample rate So it’s easily separated for a G.711 at the listener-end. Should be easily downward compatible: G.711New Made to work with speech compressing codecs While this proposal needs to be built & tested, • Two others have been implemented and tested at Pitt  Minimizing Delay 2. VoIP delay, – & echo’s dependence on delay, – can be reduced by optimal packetization. • When a network is lightly loaded, – packetization delay is reduced by generating small packets – Often – perhaps every 10 ms. • When a network is heavily loaded, – network queue delays are reduced by generating large packets – less often – perhaps every 30 ms. • We have demonstrated this – & necessary signaling has been implemented in RTCP. Maximizing Quality 3. Overall audio quality, – as defined by the ITU, is a complicated function of • codec type, end-to-end delay, fidelity, etc. • If an IP-phone has multiple codec-types, – We can optimize overall audio quality • by changing codec-type mid-stream, • depending on network congestion. – Control signaling can also use VoIP’s RTCP. • At Pitt, we are building … – a prototype system, we call Ernestine, – in which such techniques will be built & tested. 6. Conclusion • Technology has improved net audio quality – over the last 100 years. – But, some aspects of audio quality, especially fidelity, have devolved. – But, this devolution has an ironic solution. • VoIP’s poor audio quality is not inherent to VoIP; – But, is a function of design choices, – some of which date back to the 1960s. • Surprisingly, VoIP gives us the opportunity – to provide excellent audio quality, – If design changes proposed here are implemented.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download available here - IEEE Cleveland Section