Download Lecture 10

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
CSE5230/DMS/2003/10
Data Mining - CSE5230
Hidden Markov Models (HMMs)
CSE5230 - Data Mining, 2003
Lecture 10.1
Lecture Outline







Time- and space-varying processes
First-order Markov models
Hidden Markov models
Examples: coin toss experiments
Formal definition
Use of HMMs for classification
The three HMM problems
 The evaluation problem
 The Forward algorithm
 The Viterbi and Forward-Backward algorithms


HMMs for web-mining
References
CSE5230 - Data Mining, 2003
Lecture 10.2
Time- and Space-varying Processes (1)
 The
data mining techniques we have discussed
so far have focused on the classification,
prediction or characterization of single data
points, e.g.:
Assign a record to one of a set of classes
» Decision trees, back-propagation neural networks,
Bayesian classifiers, etc.
Predicting the value of a field in a record given the
values of the other fields
» Regression, back-propagation neural networks, etc.
Finding regions of feature space where data points are
densely grouped
» Clustering, self-organizing maps
CSE5230 - Data Mining, 2003
Lecture 10.3
Time- and Space-varying Processes (2)
 In
the methods we have considered so far, we
have assumed that each observed data point is
statistically independent from the observation
that preceded it, e.g.
Classification: the class of data point xt is not influenced
by the class of x_t-1 (or indeed any other data point)
Prediction: the value of a field for a record depends only
on the values of the field of that record, not on values in
any other records.
 Several
important real-world data mining
problems can not be modeled in this way.
CSE5230 - Data Mining, 2003
Lecture 10.4
Time- and Space-varying Processes (3)


We often encounter sequences of observations, where
each observation may depend on the observations which
preceded it
Examples
 Sequences of phonemes (fundamental sounds) in speech
(speech recognition)
 Sequences of letters or words in text (text categorization,
information retrieval, text mining)
 Sequences of web page accesses (web usage mining)
 Sequences of bases (CGAT) in DNA (genome projects
[human, fruit fly, etc.))
 Sequences of pen-strokes (hand-writing recognition)

In all these cases, the probability of observing a particular
value in the sequence can depend on the values which
came before it
CSE5230 - Data Mining, 2003
Lecture 10.5
Example: web log

Consider the following extract from a web log:
xxx - - [16/Sep/2002:14:50:34 +1000]
xxx - - [16/Sep/2002:14:50:42 +1000]
xxx - - [16/Sep/2002:14:51:28 +1000]
xxx - - [16/Sep/2002:14:51:30 +1000]
xxx - - [16/Sep/2002:14:51:31 +1000]
xxx - - [16/Sep/2002:14:51:40 +1000]
xxx - - [16/Sep/2002:14:51:40 +1000]
xxx - - [16/Sep/2002:14:51:56 +1000]
xxx - - [16/Sep/2002:14:51:56 +1000]
xxx - - [16/Sep/2002:14:52:03 +1000]
xxx - - [16/Sep/2002:14:52:05 +1000]
xxx - - [16/Sep/2002:14:52:24 +1000]

"GET /courseware/cse5230/ HTTP/1.1"
"GET /courseware/cse5230/html/research_paper.html HTTP/1.1"
"GET /courseware/cse5230/html/tutorials.html HTTP/1.1"
"GET /courseware/cse5230/assets/images/citation.pdf HTTP/1.1"
"GET /courseware/cse5230/assets/images/citation.pdf HTTP/1.1"
"GET /courseware/cse5230/assets/images/clustering.pdf HTTP/1.1"
"GET /courseware/cse5230/assets/images/clustering.pdf HTTP/1.1"
"GET /courseware/cse5230/assets/images/NeuralNetworksTute.pdf HTTP/1.1"
"GET /courseware/cse5230/assets/images/NeuralNetworksTute.pdf HTTP/1.1"
"GET /courseware/cse5230/html/lectures.html HTTP/1.1"
"GET /courseware/cse5230/assets/images/week03.ppt HTTP/1.1"
"GET /courseware/cse5230/assets/images/week06.ppt HTTP/1.1"
200
200
200
200
206
200
206
200
206
200
200
200
13539
11118
7750
32768
146390
17100
14520
17137
16017
9608
121856
527872
Cleary the URL which is requested depends on the URL
which was requested before
 If the user uses the “Back” button in his/her browser, the
requested URL may depend on earlier URLs in the sequence
too

Given a particular observed URL, we can calculate the
probabilities of observing all the other possible URLs next.
 Note that we may even observe the same URL next.
CSE5230 - Data Mining, 2003
Lecture 10.6
First-Order Markov Models (1)



In order to model processes such as these, we make use of
the idea of states. At any time t, we consider the system to
be in state q(t).
We can consider a sequence of successive states of length
T:
qT = (q(1), q(2), …, q(T))
We will model the production of such a sequence using
transition probabilities:
P(q j (t  1) | qi (t ))  aij
Which is the probability that the system will be in state qj at
time t+1 given that it was in state qi at time t
CSE5230 - Data Mining, 2003
Lecture 10.7
First-Order Markov Models (2)
 A model
of states and transition probabilities,
such as the one we have just described, is called
a Markov model.
 Since we have assumed that the transition
probabilities depend only on the previous state,
this is a first-order Markov model
Higher order Markov models are possible, but we will
not consider them here.
 For
example, Markov models for human speech
could have states corresponding phonemes
A Markov model for the word “cat” would have states for
/k/, /a/, /t/ and a final silent state
CSE5230 - Data Mining, 2003
Lecture 10.8
Example: Markov model for “cat”
/k/
CSE5230 - Data Mining, 2003
/a/
/t/
/silent/
Lecture 10.9
Hidden Markov Models



In the preceding example, we have said that the states
correspond to phonemes
In a speech recognition system, however, we don’t have
access to phonemes – we can only measure properties of
the sound produced by a speaker
In general, our observed data does not correspond directly
to a state of the model: the data corresponds to the visible
states of the system
 The visible states are directly accessible for measurement.

The system can also have internal “hidden” states, which
can not be observed directly
 For each hidden state, there is a probability of observing each
visible state.

This sort of model is called Hidden Markov Model (HMM)
CSE5230 - Data Mining, 2003
Lecture 10.10
Example: coin toss experiments
 Let
us imagine a scenario where we are in a
room which is divided in two by a curtain.
 We are on one side of the curtain, and on the
other is a person who will carry out a procedure
using coins resulting in a head (H) or a tail (T).
 When the person has carried out the procedure,
they call out the result, H or T, which we record.
 This system will allow us to generate a sequence
of Hs and Ts, e.g.
HHTHTHTTHTTTTTHHTHHHHTHHHTTHHHHHHTTT
TTTTTHTHHTHTTTTTHHTHTHHHTHTHHTTTTHHT
TTHHTHHTTTHTHTHTHTHHHTHHTTHT….
CSE5230 - Data Mining, 2003
Lecture 10.11
Example: single fair coin


Imagine that the person behind the curtain has a single fair
coin (i.e. it has equal probabilities of coming up heads or
tails). This generates sequences such as
THTHHHTTTTHHHHHTHHTTHHTTHHTHHHHHHHTTHTTHHHH
THTTTHHTHTTHHHHTHTHHTTHTHTTHHTHTHHHTHHTHT…
We could model the process producing the sequence of Hs
and Ts as a Markov model with two states, and equal
transition probabilities:
0.5
0.5
H
T
0.5
0.5


Note that here the visible states correspond exactly to the
internal states – the model is not hidden
Note also that states can transition to themselves
CSE5230 - Data Mining, 2003
Lecture 10.12
Example: a fair and a biased coin

Now let us imagine a more complicated scenario. The
person behind the curtain has three coins, two fair and
one biased (for example, P(T) = 0.9)
 One fair coin and the biased coin are used to produce output
– these are the “output coins”. The other fair coin is used to
decide whether to switch output coins.
1. The person starts by picking an output coin a random
2. The person tosses the coin, and calls out the result (H or T)
3. The person tosses the other fair coin. If the result was H, the
person switches output coins
4. Go back to step 2, and repeat.


This process generates sequences like:
HHHTTTTTTTHTHTTHTHTTTTTTHTTTTTTTTTTTTHHTHT
TTHHTTHTHTHTTTHTTTTTTTTHTHTTTTHTTTTHTTTHTH
HTTHTTTHTTHTTTTTTTHTTTTTHT…
Note this looks quite different from the sequence for the
fair coin example.
CSE5230 - Data Mining, 2003
Lecture 10.13
Example: a fair and a biased coin
 In
this scenario, the visible state no longer
corresponds exactly to the hidden state of the
system:
Visible state: output of H or T
Hidden state: which coin was tossed
 We
can model this process using a HMM:
0.5
0.5
Fair
0.5
T
CSE5230 - Data Mining, 2003
0.5
H
0.5
Biased
0.5
0.9
0.1
H
T
Lecture 10.14
Example: a fair and a biased coin
 We
see from the diagram on the preceding slide
that we have extended our model
The visible states are shown in blue, and the emission
probabilities are shown too.
 As
well as internal states qj(t) and state transition
probabilities aij, we have visible states vk(t) and
emission probabilities bjk
P(vk (t ) | q j (t ))  b jk
 We
now have full model such as this is called a
Hidden Markov Model
CSE5230 - Data Mining, 2003
Lecture 10.15
HMM: formal definition (1)

We can now give a more formal definition of a first-order
Hidden Markov Model (adapted from [RaJ1986]:
 There is a finite number of (internal) states, N
 At each time t, a new state is entered, based upon a transition
probability distribution which depends on the state at time t –
1. Self-transitions are allowed
 After each transition is made, a symbol is output, according to
a probability distribution which depends only on the current
state. There are thus N such probability distributions.

If we want to build an HMM to model a real sequence, we
have to solve several problems. We must estimate:
 the number of states N
 the transition probabilities aij
 the emission probabilities bjk
CSE5230 - Data Mining, 2003
Lecture 10.16
HMM: formal definition (2)

When an HMM is “run”, it produces an sequence of
symbols. This called an observation sequence O of length
T:
O = (O(1), O(2), …, O(T))

In order to talk about using, building and training an HMM,
we need some definitions:
N — the number of states in the model
M — the number of different symbols that can observed
Q = {q1, q2, …, qN} — the set of internal states
V = { v1, v2, …, vM} — the set of observable symbols
A = {aij} — the set of state transition probabilities
B = {bjk} — the set of symbol emission probabilities
 = {i = P(qi(1)} — the initial state probability distribution
l = (A, B, ) — a particular HMM model
CSE5230 - Data Mining, 2003
Lecture 10.17
Generating a Sequence

To generate an observation sequence using an HMM, we
use the following algorithm:
1.
Set t = 1
Choose an initial state q(1) according to i
Output a symbol O(T) according to bq(t)k
Choose the next state q(t + 1) according to a q(t)q(t+1)
Set t = t + 1; if t < T, go to 3
2.
3.
4.
5.

In applications, however, we don’t actually do this. We
assume that the process that generates our data does
this. The problem is to work out which HMM is
responsible for a data sequence
CSE5230 - Data Mining, 2003
Lecture 10.18
Use of HMMs



We have now seen what sorts of processes can be
modeled using HMMs, and how an HMM is specified
mathematically.
We now consider how HMMs are actually used.
Consider the two H and T sequences we saw in the
previous examples:
 How could we decided which coin-toss system was most
likely to have produced each sequence?


To which system would you assign these sequences?
1: TTHHHTHHHTTTTTHTTTTTTHTHTHTTHHHHTHTH
2: TTTTTTHTHHTHTTHTTTTHHHTHHHHTTHTHTTTT
3: THTTHTTTTHTTHHHTHTTTHTHHHHTTHTHHHTHT
4: TTTHHTTTHHHTTTTTTTHTTTTTHHTHTTHTTTTH
We can answer this question using a Bayesian formulation
(see last week’s lecture)
CSE5230 - Data Mining, 2003
Lecture 10.19
Use of HMMs for classification


HMMs are often used to classify sequences
To do this, a separate HMM is built and trained (i.e. the
parameters are estimated) for each class of sequence in
which we are interested
 e.g. we might have an HMM for each word in a speech
recognition system. The hidden states would correspond to
phonemes, and the visible states to measured sound features


This gives us a set of HMMs, {ll}
For a given observed sequence O, we estimate the
probability that each HMM ll generated it:
P(O | ll ) P(ll )
P(ll | O) 
P(O)

We assign the sequence to the model with the highest
posterior probability
 i.e. the probability given the evidence, where the evidence is
the sequence to be classified
CSE5230 - Data Mining, 2003
Lecture 10.20
The three HMM problems
 If
we want to apply HMMs real problems with real
data, we must solve three problems:
The evaluation problem: given an observation
sequence O and an HMM model l, compute P(O|l), the
probability that the sequence was produced by the
model
The decoding problem: given an observation
sequence O and a model l, find the most likely state
sequence q(1), q(2), … q(T) to have produced O
The learning problem: given a training sequence O,
find a model l, specified by parameters A, B,  to
maximize P(O|l) (we assume for now that Q and V are
known)
 The
evaluation problem has a direct solution. The
others are harder, and involve optimization
CSE5230 - Data Mining, 2003
Lecture 10.21
The evaluation problem

The simplest way to solve the evaluation problem is to go
over all possible state sequences Ir of length T and
calculate the probability that each of them produced O:
rmax

where
P(O | l )   P(O | I r , l ) P( I r )
r 1
P(O | I , l )  bi1O1 bi2O2 ...biT OT
P( I | l )   i1 ai1i2 ai2i3 ...aiT 1iT

While this could in principle be done, there is a problem:
computational complexity. Our model l has N states. There
are thus rmax = NT possible sequences of length T. The
computational complexity is O(NTT).
 Even if we have small N and T, this is not feasible: for N = 5
and T = 100, there are ~1072 computations needed!
CSE5230 - Data Mining, 2003
Lecture 10.22
The Forward Algorithm (1)
 Luckily,
there is a solution to this problem: we do
not need to do the full calculation
 We can do a recursive evaluation, using an
auxiliary variable at(i), called the forward variable:
a t (i)  P((O(1), O(2),..., O(t )), i(t )  qi | l )
this is the probability of the partial observation
sequence (up until time t) and internal state qi(t)
given the model l
 Why does this help? Because in a first-order
HMM, the transition and emission probabilities
only depend on the current state. This makes a
recursive calculation possible.
CSE5230 - Data Mining, 2003
Lecture 10.23
The Forward Algorithm (2)
can calculate at+1(j) – the next step – using
the previous one:
 We
a t 1 (i )  b jO
t 1
N
 a a (i)
i 1
ij
t
 This
just says that the probability of the
observation sequence up to time t + 1 and being
in state qj at time t + 1 is:
the probability of observing symbol Ot+1 when in state qj,
bjOt+1, times the sum of
the probabilities of getting to state qj from state qi
times the probability of the observation
sequence up to time t and being in state qi
that we have to keep track of at(i) for all N
possible internal states
 Note
CSE5230 - Data Mining, 2003
Lecture 10.24
The Forward Algorithm (3)

If we know aT(i) for all the possible states, we can
calculate the overall probability of the sequence given the
model (as we wanted on slide 10.22):
N
P(O | l )   a T (i )
i 1

We can now specify the forward algorithm, which will let
us calculate the aT(i):
for i = 1 to N { a1 (i )   i biO1 } /* initialize */
for t = 1 to T – 1 {
for j = 1 to N {
a t 1 (i)  b jO
t 1
}
N
 a a (i)
i 1
ij
t
}
CSE5230 - Data Mining, 2003
Lecture 10.25
The Forward Algorithm (4)
 The
forward algorithm allows us to calculate
P(O|l), and has computational complexity
O(N2T), as can be seen from the algorithm
This is linear in T, rather than exponential, as the direct
calculation was. This means that it is feasible
 We
can use P(O|l) in the Bayesian equation on
slide 10.20 to use a set of HMMs a classifier
The other terms in the equation can be estimated from
the training data, as with the Naïve Bayesian Classifier
CSE5230 - Data Mining, 2003
Lecture 10.26
The Viterbi and ForwardBackward Algorithms
 The
most common solution to the decoding
problem uses the Viterbi algorithm [RaJ1986],
which also uses partial sequences and recursion
 There is no known method for finding an optimal
solution to the learning problem. The most
commonly used optimization technique is known
as the forward-backward algorithm, or the BaumWelch algorithm [RaJ1986,DHS2000]. It is a
generalized expectation maximization algorithm.
See references for details
CSE5230 - Data Mining, 2003
Lecture 10.27
HMMs for web-mining (1)
 HMMs
can be used to analyse to clickstreams
the web users leave in the log files of web
servers (see slide 10.6).
 Ypma and Heskes (2002) report the application
of HMMs to:
Web page categorization
User clustering
 They
applied their system to real-word web logs
from a large Dutch commercial web site
[YpH2002]
CSE5230 - Data Mining, 2003
Lecture 10.28
HMMs for web-mining (2)
 A mixture
of HMMs is used to learn page
categories (the hidden state variables) and intercategory transition probabilities
The page URLs are treated as the observations
 Web
surfer types are modeled by HMMs
 A clickstream is modeled as a mixture of HMMs,
to account for several types of user being present
at once
CSE5230 - Data Mining, 2003
Lecture 10.29
HMMs for web-mining (3)

When applied to data from a commercial web site, page
categories were learned as hidden states
 Inspection of the emission probabilities B showed that several
page categories were discovered:
» Shop info
» Start, customer/corporate/promotion
» Tools
» Search, download/products

Four user types were discovered too (HMMs with different
state prior and transition probabilities)
 Two types dominated:
» General interest users
» Shop users
 The starting state was the most important difference between
the types
CSE5230 - Data Mining, 2003
Lecture 10.30
References
 [DHS2000]
Richard O. Duda, Peter E. Hart and David
G. Stork, Pattern Classification (2nd Edn), Wiley, New
York, NY, 2000, pp. 128-138
 [RaJ1986] L. R. Rabiner and B. H. Juang, An
introduction to hidden Markov models, IEEE
Magazine on Acoustics, Speech and Signal
Processing, 3, 1, pp. 4-16, January 1986.
 [YpH2002] Alexander Ypma and Tom Heskes,
Categorization of web pages and user clustering with
mixtures of hidden Markov models, In Proceedings of
the International Workshop on Web Knowledge
Discovery and Data mining (WEBKDD'02),
Edmonton, Canada, July 17 2002.
CSE5230 - Data Mining, 2003
Lecture 10.31