Download hidden Markov models in NLP

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Hidden Markov Models in NLP
Leah Spontaneo
Overview
Introduction
Deterministic Model
Statistical Model
Speech Recognition
Discrete Markov Processes
Hidden Markov Model
Elements of HMMs
Output Sequence
Basic HMM Problems
HMM Operations
Forward-Backward
Viterbi Algorithm
Baum-Welch
Different Types of HMMs
Speech Recognition using HMMs
Isolated Word Recognition
Limitations of HMMs
Introduction
Real-world processes generally produce observable
outputs characterized in signals
Signals can be discrete or continuous
Signal source can be stationary or nonstationary
They can be pure or corrupted
Introduction
Signals are characterized in signal models
Process the signal to provide desired output
Learn about the signal source without the source
available
Work well in practice
Signal models can be deterministic or statistical
Deterministic Model
Exploit some known
specific properties of the
signal
Sine wave
Sum of exponentials
Chaos theory
Specification of the signal
generally straight-forward
Determine/estimate the
values of the parameters of
the signal model
Statistical Model
Statistical models try to
characterize the statistical
properties of the signal
Gaussian processes
Markov processes
Hidden Markov processes
Signal characterized as a
parametric random process
Parameters of the
stochastic process can be
determine/estimated in a
precise, well-defined
manner
Speech Recognition
Basic theory of hidden Markov models in speech
recognition originally pulished in 1960s by Baum and
collegues
Implemented in speech processing applications in
1970s be Baker at CMU and by Jelinek at IBM
Discrete Markov
Processes
Contains a set of N distinct states: S1, S2,..., SN
At discrete time intervals the state changes and the
state at time t is qt
For the first-order Markov chain, the probabilistic
description is
P[qt = Sj| qt -1= Si, qt -2= Sk,...] = P[qt = Sj| qt -1= Si]
Discrete Markov
Processes
Processes considered are those independent of time
leading to transition probabilities
aij = P[qt = Sj| qt -1= Si] 1 ≤ i, j ≤ N
aij ≥ 0
Σ aij = 1
The revious stochastic process is considered an observable
Markov model
The output process is the set of states at each time
interval and each state corresponds to an observable
event
Hidden Markov Model
Markov process decides future probabilities based on
recent values
A hidden Markov model is a Markov process with an
unobservable state
HMMs must have 3 sets of probabilities:
Initial probabilities
Transition probabilities
Emission probabilities
Hidden Markov Model
Includes the case where the observation is a
probabilistic function of the state
A doubly embedded stochastic process with an
underlying unobservable stochastic process
Unobservable process only observed through a set of
stochastic processes producing the observations
Elements of HMMs
N, number of states in the model
Although hidden, there is physical significance
attached to the states of the model
Individual states are denoted as S = {S1, S2,..., SN}
State at time t is denoted qt
M, number of distinct observation symbols for each
state
Elements of HMMs
Observation symbols correspond to the output of the
system modeled
Individual symbols are denoted V = {v1, v2,..., vM}
State transition probability distribution A = {aij}
aij = P[qt = Sj| qt -1= Si] 1 ≤ i, j ≤ N
The special case where any state can reach any other,
aij > 0 for all i, j
Elements of HMMs
The observation probability distribution in state j, B =
{bj(k)}
bj(k) = P[vk at t|qt = Sj] 1 ≤ j ≤ N, 1 ≤ k ≤ M
The initial state distribution π = {πi}
πi = P[q1 = Si] 1 ≤ i ≤ N
With the right values for N, M, A, B, and π the HMM
can generate and output sequence
O = O1O2...OT where each Ot is an observation in V and T
is the total number of observations in the sequence
Output Sequence
1) Choose an initial state q1 = Si according to π
2) t = 1
3) Get Ot = vk based on the emission probability for Si, bi(k)
4) Transition to new state qt+1 = Sj based on transition
probability for Si, aij
5) t = t + 1 and go back to #3 if t < T; otherwise end sequence
Successfully used for acoustic modeling in speech recognition
Applied to language modeling and POS tagging
Output Sequence
The procedure can be used to generate a sequence of
observations and to model how an observation
sequence was produced by and HMM
The cost of determining the probability that the
system is in state Si at time t is O(tN2)
Basic HMM Problems
HMMs can find the state
sequence that most likely
produces a given output
The sequence of states is
most efficiently computed
using the Viterbi algorithm
Maximum likelihood
estimates of the probability
sets are determined using
the Baum-Welch algorithm
HMM Operations
Calculating P(qt = Si|O1O2...Ot) uses the forwardbackward algorithm
Computing Q* = argmaxQ P(Q|O) requires the Viterbi
algorithm
Learning λ* = argmaxλ P(O|λ) using the Baum-Welch
algorithm
The complexity for the three algorithms is O(TN2)
where T is the time taken and N is the number of
states
Evaluation
Scores how well a given model matches an
observation sequence
Extremely useful in considering which model, among
many, best represents the set of observations
Forward-Backward
Given observations O1O2...OT
αt(i) = P(O1O2...Ot ^ qt = Si| λ) is the probability that given the
first t observations, we end up in state Si on visit t
α1(i) = b(O1) πi
α t+1(j) = Σ aijbi(Ot+1) αt(i)
We can now cheaply compute αt(i) = P(O1O2...Ot ^ qt = Si)
P(O1O2...Ot) = Σ αt(i)
P(qt = Si| O1O2...Ot) = αt(i)/ Σ αt(j)
Forward-Backward
The key is since there are only N states, all possible
state sequences will merge to the N nodes
At t = 1, only the values of α1(i), 1 ≤ i ≤ N require
calculation
When t = 2, 3,..., T we calculate αt(j), 1 ≤ j ≤ N where
each calculation involves N previous values of αt-1(i)
only
State Sequence
There is no ‘correct’ state sequence except for
degenerate models
Optimality criterion is used instead to determine the
best possible outcome
There are several reasonable optimality criteria and
thus, the chosen criterion depends on the use of the
uncovered sequence
Used for continuous speech recognition
Viterbi Algorithm
Finds the most likely sequence of hidden states based
on known observations using dynamic programming
Makes three assumptions about the model:
The model must be a state machine
Transitions between states are marked by a metric
Events must be cumulative over the path
Path history must be kept in memory to find the best
probable path in the end
Viterbi Algorithm
Used for speech recognition where the hidden state
is part of word formation
Given a specific signal, it would deduce the most
probable word based on the model
To find the best state sequence, Q = {q1, q2,..., qT} for
the observation sequence O = {O1O2...OT} we define
Viterbi Algorithm
δt(i) is the best score along a single path at time t
accounting for the first t observations ending in state
Si
The inductive step is
Viterbi Algorithm
Similar to forward calculation of the forwardbackward algorithm
The major difference is the maximization over the
previous states instead of summing the probabilities
Optimizing Parameters
A training sequence is used to train the HMM and
adjust the model parameters
Training problem is crucial for most HMM
applications
Allows us to optimally adapt model parameters to
observed training data creating good models for real
data
Baum-Welch
A special case of expectation maximization
EM has two main steps:
Devise the expectation of the log-likelihood using
current estimates of latent variables
Compute the maximized log-likelihood using values
from the first step
Baum-Welch is a form of generalized EM which
allows the algorithm to converge to a local optimum
Baum-Welch
Also known as the forward-backwards algorithm
Baum-Welch uses two main steps:
The forward and backward probability is calculated for
each state in the HMM
The transition and emission probabilities are
determined and divided by the probability of the
whole model based on the previous step
Baum-Welch
Cons: lots of local minima
Pros: local minima are often adequate models of the
data
EM requires the number of states to be given
Sometimes HMMs require some links to be zero. For
this aij = 0 in the initial estimate λ(0)
Different Types of HMM
Left-right model
As time increases the state index increases or stays the
same
No transitions are allowed between states whose
indecies are lower than the current state
Cross-coupled two parallel left-right
Obeys the left-right constraints on transition
probabilities but provide more flexibility
Speech Recognition
using HMMs
Feature Analysis: A spectral/temporal analysis of
speech signals can be performed to provide
observation vectors used to train HMMs
Unit Matching: Each unit is characterized by an HMM
with parameters estimated from speech data
Provides the likelihoods of matches of all sequences of
speech recognition units to the unknown input
Isolated Word
Recognition
Vocabulary of V words to be recognized and each
modeled by a distinct HMM
For each word, a training set of K occurrences of
each spoken word where each occurrence of the
word is an observation sequence
For each word v, build an HMM (estimating the
model parameters which optimize the likelihood of
the training set observations for each word)
Isolated Word
Recognition
For each unknown word recognized, measure the
observation sequence O = {O1O2...OT} via feature
analysis of the speech corresponding to the word
Calculate model likelihoods for all possible models
followed by the selection of the word with the
highest model likelihood
Isolated Word
Recognition
Limitations of HMMs
Assumption that successive observations are
independent and thus, the probability of an
observation sequence can be written as the product
of the probabilities of individual observations
Assumes the distributions of individual observation
parameters are well represented as a mixture of
autoregressive or Gaussian densities
Assumption of being in a given state at each time
interval inappropriate for speech sound which can
extend through several states
Questions?
References
Vogel, S., et al. “HMM-based word alignment in statistical
translation.” In Proceedings of the 16th conference on
Computational linguistics (1996), pp. 836-841.
Rabiner, L. R. “A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition.” Proceedings of
the IEEE (1989), pp. 257-286.
Moore, A. W. “Hidden Markov Models.” Carnegie Mellon
University. https://wiki.cse.yorku.ca/course_archive/201011/F/6390/_media/hmm14.pdf
http://en.wikipedia.org/wiki/Hidden_Markov_model