Download 6PP

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Speech recognition wikipedia , lookup

Speech synthesis wikipedia , lookup

Transcript
Announcements
CS 188: Artificial Intelligence
ƒ Optional midterm
Fall 2006
ƒ On Tuesday 11/21 in class
ƒ Review session 11/19, 7-9pm, in 306 Soda
Lecture 21: Speech / Viterbi
11/09/2006
ƒ Projects
ƒ 3.2 due 11/9
ƒ 3.3 due 11/15
ƒ 3.4 due 11/27
ƒ Contest
ƒ Pacman contest details on web site this week
ƒ Entries due 12/3
Dan Klein – UC Berkeley
Hidden Markov Models
ƒ Hidden Markov models (HMMs)
ƒ Underlying Markov chain over states X
ƒ You observe outputs (effects) E at each time step
ƒ As a Bayes’ net:
X1
X2
X3
X4
X5
E1
E2
E3
E4
E5
Speech Recognition
ƒ [demos]
ƒ Several questions you can answer for HMMs:
ƒ Last time: filtering to track belief about current X given evidence
Digitizing Speech
Speech in an Hour
ƒ Speech input is an acoustic wave form
s
p
ee
ch
l
a
b
“l” to “a”
transition:
Graphs from Simon Arnfield’s web tutorial on speech, Sheffield:
http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/
1
She just had a baby
Spectral Analysis
ƒ Frequency gives pitch; amplitude gives volume
amplitude
ƒ sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec)
ƒ
ƒ
ƒ
ƒ
ƒ
Vowels are voiced, long, loud
Length in time = length in space in waveform picture
Voicing: regular peaks in amplitude
When stops closed: no peaks: silence.
Peaks = voicing: .46 to .58 (vowel [i], from second .65 to .74 (vowel
[4]) and so on
ƒ Silence of stop closure (1.06 to 1.08 for first [b], or 1.26 to 1.28 for
second [b])
ƒ Fricatives like [6] intense irregular pattern; see .33 to .46
p
ee
ch
l
a
b
ƒ Fourier transform of wave displayed as a spectrogram
ƒ darkness indicates energy at each frequency
frequency
ƒ What can we learn from a wavefile?
s
Spectrum
Adding 100 Hz + 1000 Hz Waves
Frequency components (100 and 1000 Hz) on x-axis
Amplitude
0.99
0
100
–0.9654
0
Frequency in Hz
1000
0.05
Time (s)
Back to Spectra
Vowel Formants
ƒ Spectrum represents these freq components
ƒ Computed by Fourier transform, algorithm which
separates out each frequency component of wave.
ƒ x-axis shows frequency, y-axis shows magnitude (in
decibels, a log measure of amplitude)
ƒ Peaks at 930 Hz, 1860 Hz, and 3020 Hz.
2
Resonances of the vocal tract
ƒ The human vocal tract as an open
tube
Closed end
Open end
Length 17.5 cm.
ƒ Air in a tube of a given length will
tend to vibrate at resonance
frequency of tube.
ƒ Constraint: Pressure differential
should be maximal at (closed) glottal
end and minimal at (open) lip end.
Figure from W. Barry Speech Science slides
From
Mark
Liberman’s
website
Vowel [i] sung at successively higher pitch.
Why these Peaks?
ƒ Articulatory facts:
ƒ Vocal cord vibrations
create harmonics
ƒ The mouth is a selective
amplifier
ƒ Depending on shape of
mouth, some harmonics
are amplified more than
others
2
1
5
4
3
6
7
Figures from Ratree Wayland slides from his website
How to read spectrograms
Acoustic Feature Sequence
frequency
ƒ Time slices are translated into acoustic feature
vectors (~39 real numbers per slice)
ƒ bab: closure of lips lowers all formants: so rapid
increase in all formants at beginning of "bab”
ƒ dad: first formant increases, but F2 and F3 slight fall
ƒ gag: F2 and F3 come together: this is a characteristic
of velars. Formant transitions take longer in velars
than in alveolars or labials
……………………………………………..e12e13e14e15e16………..
ƒ These are the observations, now we need the
hidden states X
From Ladefoged “A Course in Phonetics”
3
State Space
HMMs for Speech
ƒ P(E|X) encodes which acoustic vectors are appropriate
for each phoneme (each kind of sound)
ƒ P(X|X’) encodes how sounds can be strung together
ƒ We will have one state for each sound in each word
ƒ From some state x, can only:
ƒ Stay in the same state (e.g. speaking slowly)
ƒ Move to the next position in the word
ƒ At the end of the word, move to the start of the next word
ƒ We build a little state graph for each word and chain
them together to form our state space X
ASR Lexicon: Markov Models
Markov Process with Bigrams
Figure from Huang et al page 618
Decoding
ƒ While there are some practical issues, finding the words
given the acoustics is an HMM inference problem
ƒ We want to know which state sequence x1:T is most likely
given the evidence e1:T:
Viterbi Algorithm
ƒ Question: what is the most likely state sequence given
the observations?
ƒ Slow answer: enumerate all possibilities
ƒ Better answer: cached incremental version
4
Viterbi with 2 Words + Unif. LM
Next Class
ƒ Final part of the course: machine learning
ƒ We’ll start talking about how to learn
model parameters (like probabilities) from
data
ƒ One of the most heavily used technologies
in all of AI
Figure from Huang et al page 612
5