Download Communication Skills

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Superheterodyne receiver wikipedia , lookup

Telecommunication wikipedia , lookup

405-line television system wikipedia , lookup

Valve RF amplifier wikipedia , lookup

Radio transmitter design wikipedia , lookup

Analog television wikipedia , lookup

Index of electronics articles wikipedia , lookup

Spectrum analyzer wikipedia , lookup

Broadcast television systems wikipedia , lookup

Public address system wikipedia , lookup

HD-MAC wikipedia , lookup

Heterodyne wikipedia , lookup

Transcript
Speech Recognition
Module 2… Talking with computers
Introduction
keys to future developments in human–
technology interaction
 most natural form of interaction
 increase access
 Some speech recognition systems are
already in use



AT&T
Why is it so difficult for computers to
recognize speech?
Speech recognizers


Capturing the sound is easy– but recognizing the
words and their meaning is difficult
Types of speech recognition systems

automatic speech recognition (ASR)


isolated-word recognizers--recognize individual words or
short phrases, often called ‘utterances’. Pauses at the beginning
and end of an utterance make the recognition process much
easier because there are no transition stages between
utterances
continuous-speech recognizers, in which the words are
spoken at a normal speech rate. Recognition is more difficult
because the end of one word runs into the beginning of the
next.
Speech recognizers

Types of speech recognition systems

automatic speech recognition (ASR)
 speaker-independent
- the number of speakers
that can be recognized. Systems require training.
Systems for use with the general public must be
extensively trained using thousands of speech
samples, so they are typically designed to recognize
a restricted vocabulary of say 2000 words. This is
perfectly adequate for telephone banking systems
and travel information systems.
 speaker-enrolment systems -- personal dictation
must handle a more extensive vocabulary – perhaps
50,000 words – and so they are trained to a single
individual and referred to as.
Contribution of Linguistics
Phonology: the study of vocal sounds
 Lexicon: defines the vocabulary, or words

used in a language;
 Syntax: defines the grammatical rules for
combining words into phrases and
sentences;
 Semantics: defines the conventions for
deriving the meaning of words and
sentences.
Contribution of Linguistics

Lexicon:





If the lexicon is small (say 1000 words)
the acoustic features of each word are sufficient to
provide direct recognition.
The ASR system is first trained by listening to the words
in the lexicon spoken many times by many different
speakers and saving the information about the statistics
of the measurements.
During the recognition phase the same acoustical
features are measured for a single utterance and the
results compared with the stored values.
The recognized word is the one that produces the best
match with the stored data. Such techniques have been
successfully applied to medical and legal speech
recognition systems.
Contribution of Linguistics

Lexicon:






Extended lexicons, 50,000 words
used for general dictation ASR systems,
far too many to recognize by acoustic features.
English words that have similar sounds but different
meanings. For example, ‘there’ and ‘their’, ‘bear’ and
‘bare’, ‘hair’ and ‘hare’,
The solution adopted for these large lexicons is to look
for combinations of words, typically those occurring in
groups of two or three words.
The probabilities of these combinations are determined
by analysing written texts or recorded speeches. The
ASR system uses this knowledge of word combinations
to help increase its accuracy of recognition.
Contribution of Linguistics

Syntax:
ASR system would look for a subject, verb and
object. Knowing about past, present and future
tense would help it to recognize the correct
form of verbs
 The ASR systems however cannot recognize the
meaning


Semantics:

defines the conventions for deriving the
meaning of words and sentences.
Preparation: analogue and digital
systems






Periodic: The term applied to signals that repeat
themselves at regular intervals. Periodic signals tend to
exhibit strong peaks in their spectra.
Period: The period of a periodic signal is the time it takes
for the signal to repeat itself. Alternatively, the period is
equal to the duration of one cycle. The period is the
reciprocal of the frequency.
Bandwidth: The difference between the highest and
lowest frequencies present in a signal or the maximum
range of frequencies that can be transmitted by a system.
Spectrum: A graph showing the frequencies present in a
signal.
The bandwidth of a signal extending from 100 Hz to 3400
Hz is 3300 Hz.
For a periodic signal the frequency is the reciprocal of the
period. If the period is 50 ms then the frequency is 20 Hz.



The sampling rule states that the minimum sampling rate
must equal twice the bandwidth of the signal. If the
bandwidth of the signal is 6 kHz, then the sampling rate
must not be less than 12 kHz.
The quantization interval of an analogue-to-digital
converter is equal to the input voltage range divided by the
number of binary codewords. For a 12-bit converter there
are 212, or 4096, codewords. Hence the quantization
interval of this converter is 5/4096 volts, or approximately
1 millivolt.
The peak quantization noise is generally taken to be equal
to half the quantization interval. So in this case the peak
noise will be 0.5 millivolts.

Fourier analysis is the process determining
the frequency components of a periodic
signal (or mathematical function), generally
expressed in the form of an infinite
trigonometric series of sine and cosine
terms. The resulting spectrum is termed a
line spectrum.

the result of combining two sinewaves,
hence the spectrum displays two peaks at
the frequencies corresponding to these
sinewaves

a rectangular ‘pulse’ – a short burst of
energy. The corresponding spectrum has a
peak at 0 Hz and a decaying series of
peaks. Unlike the other spectra, this one is
not a line spectrum, but is a ‘continuous’
spectrum
Speech recognition



Stage 1 consists of capturing a digital representation of the spoken
word. This representation is referred to as the waveform.
• Stage 2 converts the waveform into a series of elemental sound
units, referred to as phonemes, so as to classify the word(s) prior to
recognition.
• Stage 3 uses various forms of mathematical analysis to estimate the
most likely word consistent with the series of recognized phonemes.
Speech recognition -capturing

Stage 1 consists of a digital representation of the spoken word. This
representation is referred to as the waveform.


extraneous noise
Phonemes — the elemental parts of speech


English has 42
Detect the Phonemes to match the letter
Spectrograms —

time and frequency combined
spectrogram, or voice-print,



first developed in the 1930s. A sample three-dimensional (3-D)
spectrogram generated by SpeechView is shown in Figure 6.
The top part - shows the sampled waveform
The bottom part of the figure is a combination of amplitude and
frequency information. The vertical scale corresponds to frequency,
whilst the darkness of grey tone is related to amplitude.
Spectrograms —

time and frequency combined
spectrogram, or voice-print,



first developed in the 1930s. A sample three-dimensional (3-D)
spectrogram generated by SpeechView is shown in Figure 6.
The top part - shows the sampled waveform
The bottom part of the figure is a combination of amplitude and
frequency information. The vertical scale corresponds to frequency,
whilst the darkness of grey tone is related to amplitude.
Phoneme character recognization

The process of analysis
The speech is digitized
 analysed in frames of 5–20 ms duration with
successive frames spaced 10 ms apart.
 For each frame the spectrum is calculated and
a number of spectral features (such as the
formant frequencies) are extracted and stored.

Word recognition
Priori is the set of word already present
 The words are searched in the priori

summary




Automatic speech recognition systems fall into two primary categories: isolated-word
recognizers and continuous-speech recognizers. They can be further categorized as:
speaker-independent small-vocabulary recognizers, or large-vocabulary speaker-enrolled
recognizers.
The elemental sounds of speech are called phonemes, and in the case of English
correspond to the vowels and consonants of the language. There are about 42
phonemes in the English language and they constitute the ‘alphabet’ of speech
recognition.
There are two views of a sound signal: a time-domain view that describes how the signal
amplitude varies over time, and a frequency-domain view that defines the amplitude of
the frequencies present in the signal over a specified interval of time. The time- and
frequency-domain representations can be combined into a spectrogram, a graph that
displays the changes in frequency and amplitude over time.
Speech analysis is based on characterizing the phonetic content over short duration
frames, typically 5 to 20 milliseconds duration. Several frames are combined into a
context window that captures the co-articulation effects that occur at the transitions
between phonemes. The data from the context window is processed to calculate the
probability that the frame’s content falls into one of a predefined set of phoneme
categories. The time-sequence of calculated phoneme categories is compared with the
known sequence corresponding to the word(s) to be recognized. The final result is an
estimated probability of word recognition