Download Communication Skills

Speech Recognition Module 2… Talking with computers Introduction keys to future developments in human– technology interaction  most natural form of interaction  increase access  Some speech recognition systems are already in use    AT&T Why is it so difficult for computers to recognize speech? Speech recognizers   Capturing the sound is easy– but recognizing the words and their meaning is difficult Types of speech recognition systems  automatic speech recognition (ASR)   isolated-word recognizers--recognize individual words or short phrases, often called ‘utterances’. Pauses at the beginning and end of an utterance make the recognition process much easier because there are no transition stages between utterances continuous-speech recognizers, in which the words are spoken at a normal speech rate. Recognition is more difficult because the end of one word runs into the beginning of the next. Speech recognizers  Types of speech recognition systems  automatic speech recognition (ASR)  speaker-independent - the number of speakers that can be recognized. Systems require training. Systems for use with the general public must be extensively trained using thousands of speech samples, so they are typically designed to recognize a restricted vocabulary of say 2000 words. This is perfectly adequate for telephone banking systems and travel information systems.  speaker-enrolment systems -- personal dictation must handle a more extensive vocabulary – perhaps 50,000 words – and so they are trained to a single individual and referred to as. Contribution of Linguistics Phonology: the study of vocal sounds  Lexicon: defines the vocabulary, or words  used in a language;  Syntax: defines the grammatical rules for combining words into phrases and sentences;  Semantics: defines the conventions for deriving the meaning of words and sentences. Contribution of Linguistics  Lexicon:      If the lexicon is small (say 1000 words) the acoustic features of each word are sufficient to provide direct recognition. The ASR system is first trained by listening to the words in the lexicon spoken many times by many different speakers and saving the information about the statistics of the measurements. During the recognition phase the same acoustical features are measured for a single utterance and the results compared with the stored values. The recognized word is the one that produces the best match with the stored data. Such techniques have been successfully applied to medical and legal speech recognition systems. Contribution of Linguistics  Lexicon:       Extended lexicons, 50,000 words used for general dictation ASR systems, far too many to recognize by acoustic features. English words that have similar sounds but different meanings. For example, ‘there’ and ‘their’, ‘bear’ and ‘bare’, ‘hair’ and ‘hare’, The solution adopted for these large lexicons is to look for combinations of words, typically those occurring in groups of two or three words. The probabilities of these combinations are determined by analysing written texts or recorded speeches. The ASR system uses this knowledge of word combinations to help increase its accuracy of recognition. Contribution of Linguistics  Syntax: ASR system would look for a subject, verb and object. Knowing about past, present and future tense would help it to recognize the correct form of verbs  The ASR systems however cannot recognize the meaning   Semantics:  defines the conventions for deriving the meaning of words and sentences. Preparation: analogue and digital systems       Periodic: The term applied to signals that repeat themselves at regular intervals. Periodic signals tend to exhibit strong peaks in their spectra. Period: The period of a periodic signal is the time it takes for the signal to repeat itself. Alternatively, the period is equal to the duration of one cycle. The period is the reciprocal of the frequency. Bandwidth: The difference between the highest and lowest frequencies present in a signal or the maximum range of frequencies that can be transmitted by a system. Spectrum: A graph showing the frequencies present in a signal. The bandwidth of a signal extending from 100 Hz to 3400 Hz is 3300 Hz. For a periodic signal the frequency is the reciprocal of the period. If the period is 50 ms then the frequency is 20 Hz.    The sampling rule states that the minimum sampling rate must equal twice the bandwidth of the signal. If the bandwidth of the signal is 6 kHz, then the sampling rate must not be less than 12 kHz. The quantization interval of an analogue-to-digital converter is equal to the input voltage range divided by the number of binary codewords. For a 12-bit converter there are 212, or 4096, codewords. Hence the quantization interval of this converter is 5/4096 volts, or approximately 1 millivolt. The peak quantization noise is generally taken to be equal to half the quantization interval. So in this case the peak noise will be 0.5 millivolts.  Fourier analysis is the process determining the frequency components of a periodic signal (or mathematical function), generally expressed in the form of an infinite trigonometric series of sine and cosine terms. The resulting spectrum is termed a line spectrum.  the result of combining two sinewaves, hence the spectrum displays two peaks at the frequencies corresponding to these sinewaves  a rectangular ‘pulse’ – a short burst of energy. The corresponding spectrum has a peak at 0 Hz and a decaying series of peaks. Unlike the other spectra, this one is not a line spectrum, but is a ‘continuous’ spectrum Speech recognition    Stage 1 consists of capturing a digital representation of the spoken word. This representation is referred to as the waveform. • Stage 2 converts the waveform into a series of elemental sound units, referred to as phonemes, so as to classify the word(s) prior to recognition. • Stage 3 uses various forms of mathematical analysis to estimate the most likely word consistent with the series of recognized phonemes. Speech recognition -capturing  Stage 1 consists of a digital representation of the spoken word. This representation is referred to as the waveform.   extraneous noise Phonemes — the elemental parts of speech   English has 42 Detect the Phonemes to match the letter Spectrograms —  time and frequency combined spectrogram, or voice-print,    first developed in the 1930s. A sample three-dimensional (3-D) spectrogram generated by SpeechView is shown in Figure 6. The top part - shows the sampled waveform The bottom part of the figure is a combination of amplitude and frequency information. The vertical scale corresponds to frequency, whilst the darkness of grey tone is related to amplitude. Spectrograms —  time and frequency combined spectrogram, or voice-print,    first developed in the 1930s. A sample three-dimensional (3-D) spectrogram generated by SpeechView is shown in Figure 6. The top part - shows the sampled waveform The bottom part of the figure is a combination of amplitude and frequency information. The vertical scale corresponds to frequency, whilst the darkness of grey tone is related to amplitude. Phoneme character recognization  The process of analysis The speech is digitized  analysed in frames of 5–20 ms duration with successive frames spaced 10 ms apart.  For each frame the spectrum is calculated and a number of spectral features (such as the formant frequencies) are extracted and stored.  Word recognition Priori is the set of word already present  The words are searched in the priori  summary     Automatic speech recognition systems fall into two primary categories: isolated-word recognizers and continuous-speech recognizers. They can be further categorized as: speaker-independent small-vocabulary recognizers, or large-vocabulary speaker-enrolled recognizers. The elemental sounds of speech are called phonemes, and in the case of English correspond to the vowels and consonants of the language. There are about 42 phonemes in the English language and they constitute the ‘alphabet’ of speech recognition. There are two views of a sound signal: a time-domain view that describes how the signal amplitude varies over time, and a frequency-domain view that defines the amplitude of the frequencies present in the signal over a specified interval of time. The time- and frequency-domain representations can be combined into a spectrogram, a graph that displays the changes in frequency and amplitude over time. Speech analysis is based on characterizing the phonetic content over short duration frames, typically 5 to 20 milliseconds duration. Several frames are combined into a context window that captures the co-articulation effects that occur at the transitions between phonemes. The data from the context window is processed to calculate the probability that the frame’s content falls into one of a predefined set of phoneme categories. The time-sequence of calculated phoneme categories is compared with the known sequence corresponding to the word(s) to be recognized. The final result is an estimated probability of word recognition

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Communication Skills