Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Snack for Ruby S Legrand Talk Objectives Tour of API Learn the walk and talk Have Fun Snack Snack library is a tool to aid in the learning about sound, voice, ASR, and is hopefully a fun way to experiment Snack is a tcl-based API Snack has been adapted to and included in Standard Python Distribution Snack Snack is Swedish for “talk” or “chat” Kåre Sjölander is the principal investigator for tcl-based snack Tcl Snack is available at http://www.speech.kth.se/snack/ Snack for Ruby rbSnack is a ruby wrapper around tcl snack rbSnack has additional ruby based utilities rbSnack has html-based help. (rdoc+rbTeX) rbSnack can be found at http://rbsnack.sourceforge.net/ Snack Toolkit Includes Recording, Playback Waveform display Spectrogram: Fourier, LPC Formant analysis Power analysis Filters (will demo) The Speech Signal Continuous speech is discretely sampled Signal consist of rapidly changing data points. The display of the sampled signal is called the waveform Snack can display the waveform real-time Analysis uses frames Signal is broken into frames Frames may overlap Characteristics of signal analyzed using Fourier and LPC analysis on a per frame basis. Going in Circles Complex numbers is just a funny way of multiplying: add angles. Eulers formula Fourier Analysis Fourier matrix is an unitary matrix Multiplication by Fourier matrix returns the frequency components of the signal, called the Fourier coefficients Easy to compute the inverse: Called Fourier Inverse The Fourier Matrix Looks Like Spinning disks Multiplication by signal produces Fourier coefficients (frequency components) Examining Fourier components A Spectrogram gives a picture of the Fourier components (coefficients) as they evolve over time. Snack can display real time. Looks like an X Ray Bands of high activity correspond to formants Linear Filters Useful to understand nature of speech signals Generators: generate square waves, sin waves, saw tooth, etc. Composers: composes several filters. FIR: Finite impulse response IIR: Infinite impulse response FIR Filter Determined completely by response to a unit impulse. Response finite in duration. y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n) (We will demo FIR using rbSnack) IIR Filter Also called Recursive filter Response infinite in duration. y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n) +a1 y(t-1)+ a2y(t-2)+…+an y(t-n) (We will demo IIR using rbSnack) Linear Predictive Analysis Analogous to Fourier analysis Assumption: For each frame, the signal is predicted by y(t)=a1 y(t-1)+ a2y(t-2)+…+ap y(t-p) The LPC coefficients are the best least squares approximation. Can also be used to predict formants What is Sound? What is Speech? Sound is the resulting signal created by the longitude waves in some medium like air. Sound waves are continuous Can be decomposed into linear combination of sin waves. Speech is a special noise made by humans It’s Just Tubing… The simplest model of speech is to consider the lungs and trachea as one long tube. Resonance frequencies are called Formants. F1 F2 Some Speech Recognition Features Formants Pitch Voiced/Unvoiced Nasality Frication Energy Our current work only uses Formants and Energy Basic Utterances A basic unit of speech is called a Phone Vowels are utterances with constant formants Diphthong is the transitioning from one vowel to another Vowels and Diphthongs are essentially characterized by the first and second formant. Other Phones: The Consonants Plosives: closure in oral cavity /p/ Nasal: Closure of nasal cavity /m/ Fricative: Turbulent airstream noise /s/ Retroflex liquid: Vowel like-tongue high curled back /r/ Lateral liquid: Vowel like, tongue central, side air stream /l/ Glide: Vowel like /y/ Some Problems with Speech Signals Segmentation: when does a word begin and end? (Noise?) Wet ware: (speaker’s internal configuration + lip smacks, breathing etc.) SegmentationWorkshop demos one approach. Code Books A code book consists of code words. Idea is to search through code book to find code word corresponding to best match of feature sequence. RbSnack uses codebook approach in word recognition. Code Book Approach ++ Easy to implement + Good for isolated words +- Works best on small vocabularies -- Is insensitive to context, prone to errors Code Book Approach WhichWay is a simple demo of this approach More Problems with Speech Signals Accent: Southern vs. New England vs. California Valley vs. Other. Variation in rate of speech makes it hard to compare words Dynamic Time Warping A pattern comparison technique A way of stretching or compressing one sequence to match another. Evaluated using dynamic programming Dynamic Programming Form a grid, with start at lower left, end at upper right. Label each node with difference (error) between pattern 1 at time i and pattern 2 at time j. Find minimal distance from start to end using Dynamic Programming Basic Assumption: If best path P(S,E) passes through node N, then P(S,E) is the concatenation of P(S,N) (best from S to N) and P(N,E) (best from N to E) A possible path Dynamic Programming 1 2 1 3 2 3 Type I Type III RbSnack includes examples for various time alignment approaches Dynamic Programming 1 1 1 1 1 Type IV 1 1 1 Itakura Hidden Markov Models Sometime the second (or third) best match is the right word. Use HMM’s to ascertain the correct word in the context of the sentence. (Ditto for phones within a word) HMM’s are similar to non-deterministic finite state machines, except for they have non-deterministic output. Hidden Markov Models Dynamic Programming is used to compute weights. HMM’s look like .4 .2 P(/i/)=.5 P(/a/)=.2 P(/o/)=.3 2 1 .4 4 3 PossibleFuture Directions Examine other features, (pitch?) Incorporate other libraries. (Do the computationally hard work in C) Add more signal processing routines Add more examples Use Hidden Markov Models Lessons Learned /to be learned Document everything. Nothings perfect Automate everything Project is never done What’s next? Try it out.