Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Speech Signal Processing Dr. Zhang Sen [email protected] Chinese Academy of Sciences Beijing, China 2017/5/25 Report Document •Introduction –Sampling and quantization –Speech coding •Features and Analysis –Main features –Some transformations •Speech-to-Text –State of the art –Main approaches •Text-to-Speech –State of the art –Main approaches •Applications –Human-machine dialogue systems 2 Report Document • Some useful websites for ASR Tools – http://htk.eng.cam.ac.uk • • • • • Free, available since 2000, relation with MS Over 12000 users, ver. 2.1, 3.0, 3.1, 3.2 Include source code and HTK books A set of tools for training, decoding, evaluation Steve Young in Cambridge University – http://www.cs.cmu.edu • • • • Free for research and education Sphinx 2 and 3 Tools, source code, speech database Reddy in CMU 3 Report Document Research on speech recognition in the world 4 Report Document • Carnegie Mellon University – CMU SCS Speech Group – Interact Lab • Oregon Graduate Institute – Center for Spoken Language Understanding • MIT – – – – Lab for Computer Science, Spoken Language Systems Acoustics & Vibration Lab AI LAB Lincoln Lab, Speech Systems Technology Group • Stanford University – Center for Computer Research in Music and Acoustics – Center for the Study of Language and Information 5 Report Document • University of California – Berkeley, Santa Cruz, Los Angeles • Boston University – Signal Processing and Interpretation Lab • Georgia Institute of Technology – Digital Signal Processing Lab • Johns Hopkins University – Center for Language and Speech Processing • Brown University – Lab for Engineering Man-Machine Systems • Mississippi State University • Colorado University • Cornell University 6 Report Document • Cambridge University – speech Vision and Robotics Group • Edinburgh University – human Communication Research Center – center for Speech Technology Research • University College London – Phonetics and Linguistics • University of Essex – Dept. Language and Linguistics 7 Report Document • LIMSI, France • INRIA – Institut National de Recherche en Informatique et Automatique • University of Karlsruhe, Germany – Interractive Systems Lab • DFKI – German Research Center for Artificial Intelligence • KTH Speech Communication & Music Acoustics • CSELT, Italy – Centro Studi e Laboratori Telecommunicazioni, Torino • IRST – Istituto per la Ricerca Scientifica e Tecnologica, Trento • ATR, Japan 8 Report Document • • • • • • • • • • • • AT&T, Advanced Speech Product Group Lucent Technologies, Bell Laboratories IBM , IBM VoiceType Texas Instruments Incorporated National Institute of Standards and Technology Apple Computer Co. Digital Equipment Corporation (DEC) SRI International Dragon systems Co. Sun Microsystems Lab. , speech applications Microsoft Corporation, Speech technology SAPI Entropic Research Laboratory, Inc. 9 Report Document • Important conferences and journals – – – – – IEEE trans. on ASSP ICASSP (every year) EUROSPEECH (every odd year) ICSLP (every even year) STAR • Speech Technology and Research at SRI 10 Report Document Brief history and state-of-the-art of the research on speech recognition 11 Report Document ASR Progress Overview • 50'S – ISOLATED DIGIT RECOGNITION (BELL LAB) • 60'S : – HARDWARE SPEECH SEGMENTATOR (JAPAN) – DYNAMIC PROGRAMMING (U.S.S.R) • 70'S : – CLUSTERING ALGORITHM (SPEAKER INDEPENDECY) – DTW • 80'S: – HMM, DARPA, SPHINX • 90'S : – ADAPTION, ROBUSTNESS 12 Report Document 1952 Bell Labs Digits • First word (digit) recognizer • Approximates energy in formants (vocal tract resonances) over word • Already has some robust ideas (insensitive to amplitude, timing variation) • Worked very well • Main weakness was technological (resistors and capacitors) 13 Report Document The 60’s • Better digit recognition • Breakthroughs: Spectrum Estimation (FFT, cepstra, LPC), Dynamic Time Warp (DTW), and Hidden Markov Model (HMM) theory • HARDWARE SPEECH SEGMENTATOR (JAPAN) 14 Report Document 1971-76 ARPA Project • Focus on Speech Understanding • Main work at 3 sites: System Development Corporation, CMU and BBN • Other work at Lincoln, SRI, Berkeley • Goal was 1000-word ASR, a few speakers, connected speech, constrained grammar, less than 10% semantic error 15 Report Document Results • Only CMU Harpy fulfilled goals used LPC, segments, lots of high level knowledge, learned from Dragon * (Baker) * The CMU system done in the early ‘70’s; as opposed to the company formed in the ‘80’s 16 Report Document Achieved by 1976 • Spectral and cepstral features, LPC • Some work with phonetic features • Incorporating syntax and semantics • Initial Neural Network approaches • DTW-based systems (many) • HMM-based systems (Dragon, IBM) 17 Report Document Dynamic Time Warp • Optimal time normalization with dynamic programming • Proposed by Sakoe and Chiba, circa 1970 • Similar time, proposal by Itakura • Probably Vintsyuk was first (1968) 18 Report Document HMMs for Speech • Math from Baum and others, 1966-1972 • Applied to speech by Baker in the original CMU Dragon System (1974) • Developed by IBM (Baker, Jelinek, Bahl, Mercer,….) (1970-1993) • Extended by others in the mid-1980’s 19 Report Document The 1980’s • Collection of large standard corpora • Front ends: auditory models, dynamics • Engineering: scaling to large vocabulary continuous speech • Second major (D)ARPA ASR project • HMMs become ready for prime time 20 Report Document Standard Corpora Collection • Before 1984, chaos • TIMIT • RM (later WSJ) • ATIS • NIST, ARPA, LDC 21 Report Document Front Ends in the 1980’s • Mel cepstrum (Bridle, Mermelstein) • PLP (Hermansky) • Delta cepstrum (Furui) • Auditory models (Seneff, Ghitza, others) 22 Report Document Dynamic Speech Features • temporal dynamics useful for ASR • local time derivatives of cepstra • “delta’’ features estimated over multiple frames (typically 5) • usually augments static features • can be viewed as a temporal filter 23 Report Document HMM’s for Continuous Speech • Using dynamic programming for cts speech (Vintsyuk, Bridle, Sakoe, Ney….) • Application of Baker-Jelinek ideas to continuous speech (IBM, BBN, Philips, ...) • Multiple groups developing major HMM systems (CMU, SRI, Lincoln, BBN, ATT) • Engineering development - coping with data, fast computers 24 Report Document 2nd (D)ARPA Project • • • • Common task Frequent evaluations Convergence to good, but similar, systems Lots of engineering development - now up to 60,000 word recognition, in real time, on a workstation, with less than 10% word error • Competition inspired others not in project Cambridge did HTK, now widely distributed 25 Report Document Some 1990’s Issues • Independence to long-term spectrum • Adaptation • Effects of spontaneous speech • Information retrieval/extraction with broadcast material • Query-style systems (e.g., ATIS) • Applying ASR technology to related areas (language ID, speaker verification) 26 Report Document Real Uses • Telephone: phone company services (collect versus credit card) • Telephone: call centers for query information (e.g., stock quotes, parcel tracking) • Dictation products: continuous recognition, speaker dependent/adaptive 27 Report Document State-of-the-art of ASR • Tremendous technical advances in the last few years • From small to large vocabularies – 5,000 - 10,000 word vocabulary – 10,000-60,000 word vocabulary • From isolated word to spontaneous talk – Continuous speech recognition – Conversational and spontaneous speech recognition • From speaker-dependent to speaker-independent – Modern ASR is fully speaker independent 28 Report Document SOTA ASR Systems • IBM, Via Voice – Speaker independent, continuous command recognition – Large vocabulary recognition – Text-to-speech confirmation – Barge in (The ability to interrupt an audio prompt as it is playing) • Microsoft, Whisper, Dr Who 29 Report Document SOTA ASR Systems • DARPA – 1982 – GOAL • • • • HIGH ACCURACY REAL-TIME PERFORMANCE UNDERSTANDING CAPABILITY CONTINUOUS SPEECH RECOGNITION – DARPA DATABASE • 997 WORDS (RM) • ABOVE 100 SPEAKERS • TIMID 30 Report Document SOTA ASR Systems • SPHINX II – – – – – – CMU HMM BASED SPEECH RECOGNITION BIGRAM, WORD PAIR GENERALIZED TRIPHONE DARPA DATABASE 97% RECOGNITION (PERPLEXITY 20) • SPHINX III – CHMM BASED – WER, about 15% on WSJ 31 Report Document ASR Advances 2005 wherever speech occurs 2000 vehicle noise radio cell phones NOISE ENVIRONMENT all speakers of the language including foreign regional accents native speakers competent foreign speakers 1995 normal office various microphones telephone quiet room fixed high – quality mic speaker independent and adaptive USER speakerdep. POPULATION 1985 careful reading SPEECH STYLE planned speech natural humanmachine dialog (user can adapt) all styles including human-human (unaware) application – specific speech and expert years to language create app– specific language model COMPLEXITY some application– specific data and one engineer year application independent or adaptive 32 Report Document But • Still <97% accurate on “yes” for telephone • Unexpected rate of speech causes doubling or tripling of error rate • Unexpected accent hurts badly • Accuracy on unrestricted speech at 60% • Don’t know when we know • Few advances in basic understanding 33 Report Document How to Measure the Performance? • What benchmarks? – DARPA – NIST (hub-4, hub-5, …) • • • • • What was training? What was the test? Were they independent? The vocabulary and the sample size? Was the noise added or coincident with speech? What kind of noise? 34 Report Document ASR Performance Word Error Rate (WER) Conversational Speech 40% 30% Broadcast News 20% Read Speech • Spontaneous telephone speech is still a “grand challenge”. • Telephone-quality speech is still central to the problem. • Broadcast news is a very dynamic domain. 10% Continuous Digits Digits 0% Letters and Numbers Command and Control Level Of Difficulty 35 Report Document Machine vs Human Performance Word Error Rate 20% Wall Street Journal (Additive Noise) • Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task. • On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity. 15% Machines 10% • The nature of the noise is as important as the SNR (e.g., cellular phones). 5% Human Listeners (Committee) 0% 10 dB 16 dB 22 dB Quiet • A primary failure mode for humans is inattention. • A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names). Speech-To-Noise Ratio 36 Report Document Core technology for ASR 37 Report Document Why is ASR Hard? • Natural speech is continuous • Natural speech has disfluencies • Natural speech is variable over: global rate, local rate, pronunciation within speaker, pronunciation across speakers, phonemes in different contexts 38 Report Document Why is ASR Hard? (continued) • Large vocabularies are confusable • Out of vocabulary words inevitable • Recorded speech is variable over: room acoustics, channel characteristics, background noise • Large training times are not practical • User expectations are for equal to or greater than “human performance” 39 Report Document Main Causes of Speech Variability Environment Speech - correlated noise reverberation, reflection Uncorrelated noise additive noise (stationary, nonstationary) Attributes of speakers dialect, gender, age Speaker Input Equipment Manner of speaking breath & lip noise stress Lombard effect rate level pitch cooperativeness Microphone (Transmitter) Distance from microphone Filter Transmission system distortion, noise, echo Recording equipment 40 Report Document ASR Dimensions • Speaker dependent, independent • Isolated, continuous, keywords • Lexicon size and difficulty • Task constraints, perplexity • Adverse or easy conditions • Natural or read speech 41 Report Document Telephone Speech • • • • • • Limited bandwidth (F vs S) Large speaker variability Large noise variability Channel distortion Different handset microphones Mobile and handsfree acoustics 42 Report Document What is Speech Recognition? Speech Signal Speech Recognition Words “How are you?” • Related area’s: – Who is the talker (speaker recognition, identification) – What language did he speak? (language recognition) – What is his meaning? (speech understanding) 43 Report Document What is the problem? Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence A A tractable reformulation of the problem is: Acoustic model Language model Daunting search task 44 Report Document View ASR as Pattern Recognition Recognition Front End O1O2 OT Decoder WT Best Word Sequence Analog Observation Speech Sequence Acoustic Model W1W2 Dictionary Language Model 45 Report Document View ASR in Hierarchy Speech Waveform Feature Extraction (Signal Processing) Neural Net Spectral Feature Vectors Phone Likelihood Estimation (Gaussians or Neural Networks) N-gram Grammar HMM Lexicon Phone Likelihoods P(o|q) Decoding (Viterbi or Stack Decoder) Words 46 Report Document Front-End Processing Dynamic features K.F. Lee 47 Report Document Feature Extraction • GOAL : – LESS COMPUTATION & MEMORY – SIMPLE REPRESENTATION OF SIGNAL • METHODS : – FOURIER SPECTRUM BASED • MFCC (mel frequency ceptrum coeffcient) • LFCC (linear frequency ceptrum coefficient) • filter-bank energy – LINEAR PREDICTION SPECTRUM BASED • LPC (linear predictive coding) • LPCC (linear predictive ceptrum coefficeint) – OTHERS • ZERO CROSSING, PITCH, FORMANT, AMPLITUDE 48 Report Document Cepstrum Computation • Cepstrum is the inverse Fourier transform of the log spectrum 1 c ( n) 2 log S (e j ) e jn d , n 0,1,, L 1 IDFT takes form of weighted DCT in computation, see in HTK 49 Report Document Mel Cepstral Coefficients FFT and log DCT transform • Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples • Filter-bank, under 1k hz, linear, above 1k hz, log • Motivated by human auditory response characteristics 50 Report Document Cepstrum as Vector Space Features Overlap 51 Report Document Other Features • LPC – Linear predictive coefficients • PLP – Perceptual Linear Prediction • Though MFCC has been successfully used, what is the robust speech feature? 52 Report Document Acoustic Models • Template-based AM, used in DTW, obsolete • Hidden Markov Model based AM, popular now • Other AMs – Articulatory AM – KNOWLEDGE BASED APPROACH • spectrogram reading (expert system) – CONNECTIONIST APPROACH - TDNN 53 Report Document Template-based Approach • • • • • • DYNAMIC PROGRAMMING ALGORITHM DISTANCE MEASURE ISOLATED WORD SCALING INVARIANCE TIME WARPING CLUSTER METHOD 54 A SSR presentation: 8.2 Definition of the Hidden Markov Model Definition of HMM Formal definition HMM An output observation alphabet O {o1 , o2 ,..., oM } The set of states {1,2,..., N} A transition probability matrix A {aij } P(st j | st 1 i) An output probability matrix bi (k ) P( X t ok | st i ) An initial state distribution P ( s0 i ) Assumptions • Markov assumption • Output independence assumption B {bi (k )} A SSR presentation: 8.2 Definition of the Hidden Markov Model Three Problems of HMM Given a model Ф and a sequence of observations • The Evaluation Problem How to compute the probability of the observation sequence? Forward algorithm • The Decoding Problem How to find the optimal sequence associated with a given observation? Viterbi algorithm • The Training/Learning Problem How can we adjust the model parameter to maximize the joint probability? Baum-Welch algorithm (FORWARD-BACKWARD ALGORITHM ) Report Document Advantages of HMM • • • • ISOLATED & CONTINUOUS SPEECH RECOGNITION NO ATTEMPT TO FIND WORD BOUNDARIES RECOVERY OF ERRONEOUS ASSUMPTION SCALING INVARIANCE, TIME WARPING, LEARNING CAPABILITY 57 Report Document Limitations of HMM • HMMs assume the duration follows an exponential distribution • The transition probability depends only on the origin and destination • All observation frames are dependent only on the state that generated them, not on the neighboring observation frames (observation frames dependent) 58 Report Document HMM-based AM • Hidden Markov Models (HMMs) – Probabilistic State Machines - state sequence unknown, only feature vector outputs observed – Each state has output symbol distribution – Each state has transition probability distribution – Issues: • what topology is proper? • how many states in a model? • how many mixtures in a state? 59 Report Document Hidden Markov Models • Acoustic models encode the temporal evolution of the features (spectrum). • Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. • Phonetic model topologies are simple left-to-right structures. • Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. • Sharing model parameters (tied) is a common strategy to reduce complexity. 60 Report Document AM Parameter Estimation • Closed-loop data-driven modeling supervised only from a word-level transcription. • Single Gaussian Estimation • The expectation/maximization (EM) algorithm is used to improve our parameter estimates. • 2-Way Split • • Mixture Distribution Reestimation Computationally efficient training algorithms (Forward-Backward) have been crucial. • Batch mode parameter updates are typically preferred. • Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge. • Initialization • 4-Way Split • Reestimation ••• 61 Report Document Basic Speech Units • RECOGNITION UNITS – – – – – – PHONEME WORD SYLLABLE DEMISYLLABLE TRIPHONE DIPHONE 62 Report Document Basic Units Selection • Create a set of HMM’s representing the basic sounds (phones) of a language? – – – – – English has about 40 distinct phonemes Chinese has about 22 Initials + 37 Finials Need “lexicon” for pronunciations Letter to sound rules for unusual words Co-articulation effects must be modeled • tri-phones - each phone modified by onset and trailing context phones (1k-2k used in English) – e.g. pl-c+pr 63 Report Document Language Models • What is a language model? – Quantitative ordering of the likelihood of word sequences (statistical viewpoint) – A set of rule specifying how to create word sequences or sentences (grammar viewpoint) • Why use language models? – Not all word sequences equally likely – Search space optimization (*) – Improve accuracy (multiple passes) – Wordlattice to n-best 64 Report Document Finite-State Language Model show me the next any display the last page picture text file • Write Grammar of Possible Sentence Patterns • Advantages: – Long History/ Context – Don’t Need Large Text Database (Rapid Prototyping) – Integrated Syntactic Parsing • Problem: – Work to write grammars – Words sequences not enabled do not exist – Used in small vocabulary ASR, not for LVCASR 65 Report Document Statistical Language Models • Predict next word based on current and history • Probability of next word is given by – Trigram: P(wi | wi-1, wi-2) – Bigram: P(wi | wi-1) – Unigram: P(wi) • Advantage: – Trainable on Large Text Databases – ‘Soft’ Prediction (Probabilities) – Can be directly combined with AM in decoding • Problem: – Need Large Text Database for each Domain – Sparse problems, smoothing approaches • backoff approach • word class approach • Used in LVCASR 66 Report Document Statistical LM Performance 67 Report Document ASR Decoding Levels /w/ States /ah/ Acoustic Models /ts/ /th/ /ax/ Phonemes Dictionary /w/ -> /ah/ -> /ts/ /th/ -> /ax/ Words Sentences Language Model what's display the willamette's location kirk's longitude sterett's lattitude 68 Report Document Decoding Algorithms • Given observations, how to determine the most probable utterance/word sequence? (DTW in template-based match) • Dynamic Programming ( DP) algorithm was proposed by Bellman in 50s for multistep decision process, the “principle of optimality” is divide and conquer. • The DP-based search algorithms have been used in speech recognition decoder to return n-best paths or wordlattice through the acoustic model and the language model • Complete search is usually impossible since the search space is too large, so beam search is required to prune less probable paths and save computation load. • Issues: computation underflow, balance of LM, AM. 69 Report Document Viterbi Search • Uses Viterbi decoding – Takes MAX, not SUM (Viterbi vs. Forward) – Finds the optimal state sequence, not optimal word sequence – Computation load: O(T*N2) • Time synchronous – Extends all paths at each time step – All paths have same length (no need to normalize to compare scores, but A* decoding needs) 70 Report Document Viterbi Search Algorithm Function Viterbi(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix viterbi[num-states+2,T+2] Viterbi[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score)) then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s Backtrace from highest prob state in final column of viterbi[] & return 71 Report Document Viterbi Search Trellis W2 W1 0 1 2 3 t 72 Report Document Viterbi Search Insight Word 1 OldProb(S1) • OutProb • Transprob Word 1 Word 2 Word 2 OldProb(S3) • P(W2 | W1) S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 time t time t+1 score backptr parmptr 73 Report Document Bachtracking • Find Best Association between Word and Signal • Compose Words from Phones Using Dictionary • Backtracking is to find the best state sequence /e/ /th/ t1 tn 74 Report Document N-Best Speech Results ASR N=1 N=2 N=3 Speech Waveform “Get me two movie tickets…” “I want to movie trips…” “My car’s too groovy” N-Best Result Grammar • Use grammar to guide recognition • Post-processing based on grammar/LM • Wordlattice to n-best conversion 75 Report Document Complexity of Search •Lexicon: contains all the words in the system’s vocabulary along with their pronunciations (often there are multiple pronunciations per word, # of items in lexicon) •Acoustic Models: HMMs that represent the basic sound units the system is capable of recognizing (# of models, # of states per model, # of mixtures per state) •Language Model: determines the possible word sequences allowed by the system (fan-out, PP, entropy) 76 Report Document ASR vs Modern AI • ASR is based on AI techniques – Knowledge representation & manipulation • AM and LM, lexicon, observation vector – Machine Learning • Baum-Welch for HMMs • Nearest neighbor & k-means clustering for signal id – “Soft” probabilistic reasoning/Bayes rule • Manage uncertainty mapping in signal, phone, word – ASR is an expert system 77 Report Document ASR Summary • Performance criterion is WER (word error rate) • Three main knowledge sources – Acoustic Model (Gaussian Mixture Models) – Language Model (N-Grams, FS Grammars) – Dictionary (Context-dependent sub-phonetic units) • Decoding – Viterbi Decoder – Time-synchronous – A* decoding (stack decoding, IBM, X.D. Huang) 78 Report Document We still need • We still need science • Need language, intelligence • Acoustic robustness still poor • Perceptual research, models • Fundamentals of statistical pattern recognition for sequences • Robustness to accent, stress, rate of speech, …….. 79 Report Document Future Directions Analog Filter Banks 1960 Dynamic Time-Warping Hidden Markov Models 1980 1970 Conclusions: Challenges: • supervised training is a good machine learning technique • • • • • large databases are essential for the development of robust statistics 2004 1990 discrimination vs. representation generalization vs. memorization pronunciation modeling human-centered language modeling The algorithmic issues for the next decade: • Better features by extracting articulatory information? • Bayesian statistics? Bayesian networks? • Decision Trees? Information-theoretic measures? • Nonlinear dynamics? Chaos? 80 Report Document References • Speech & Language Processing – Jurafsky & Martin -Prentice Hall - 2000 • Spoken Language Processing – X.. D. Huang, al et, Prentice Hall, Inc., 2000 • Statistical Methods for Speech Recognition – Jelinek - MIT Press - 1999 • Foundations of Statistical Natural Language Processing – Manning & Schutze - MIT Press - 1999 • Fundamentals of Speech Recognition – L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993 • Dr. J. Picone - Speech Website – www.isip.msstate.edu 81 Report Document Test • Mode – A final 4-page report or – A 30-min presentation • Content – – – – Review of speech processing Speech features and processing approaches Review of TTS or ASR Audio in computer engineering 82 Report Document THANKS 83