Download ConditionalRandomFields2 - CS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Predictive analytics wikipedia , lookup

Hardware random number generator wikipedia , lookup

Generalized linear model wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Learning classifier system wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Artificial grammar learning wikipedia , lookup

Pattern recognition wikipedia , lookup

Machine learning wikipedia , lookup

Transcript
Conditional Random Fields:
Probabilistic Models
for Segmenting and Labeling
Sequence Data
J. Lafferty, A. McCallum, F. Pereira
Presentation: Inna Weiner
Learning Seminar, 2004
Outline
• Labeling sequence data problem
• Classification with probabilistic models:
Generative and Discriminative
– Why HMMs and MEMMs are not good
enough
• Conditional Field Model
• Experimental Results
Learning Seminar, 2004
Labeling Sequence Data Problem
•
•
•
•
X is a random variable over data sequences
Y is a random variable over label sequences
Yi is assumed to range over a finite label alphabet A
The problem:
– Learn how to give labels from a closed set Y to a data sequence
X
x3
x1
x2
X:
Thinking
is
being
Y:
noun
verb
noun
y1
y2
y3
Learning Seminar, 2004
Labeling Sequence Data Problem
• The lab setup:
Let a monkey do some behavioral task while
recording movement and neural activity
• Motor task: Reach to target
• Goal:
Map neural activity
to behavior
• In our notation:
– X: Neural Data
– Y: Hand movements
Learning Seminar, 2004
Generative Probabilistic Models
• Learning problem:
Choose Θ to maximize joint likelihood:
L(Θ)= Σ log pΘ (yi,xi)
• The goal: maximization of the joint
likelihood of training examples
y = argmax p*(y|x) = argmax p*(y,x)/p(x)
• Needs to enumerate all possible
observation sequences
Learning Seminar, 2004
Markov Model
• A Markov process or model assumes that we
can predict the future based just on the present
(or on a limited horizon into the past):
• Let {X1,…,XT} be a sequence of random
variables taking values {1,…,N} then the Markov
properties are:
• Limited Horizon:
P(Xt+1|X1,…,Xt) = P(Xt+1|Xt) =
• Time invariant (stationary):
= P(X2|X1)
Learning Seminar, 2004
Describing a Markov Chain
• A Markov Chain
can be described
by the transition
matrix A and the
initial probabilities
Q:
Aij = P(Xt+1=j|Xt=i)
qi = P(X1=i)
Learning Seminar, 2004
Hidden Markov Model
• In a Hidden Markov Model (HMM) we do not observe the
sequence that the model passed through (X) but only
some probabilistic function of it (Y). Thus, it is a Markov
model with the addition of emission probabilities:
Bik = P(Yt = k|Xt = i)
Learning Seminar, 2004
The Three Problems of HMM
• Likelihood: Given a series of
observations y and a model λ = {A,B,q},
compute the likelihood p(y| λ)
• Inference: Given a series of observations
y and a model lambda compute the most
likely series of hidden states x.
• Learning: Given a series of observations,
learn the best model λ
Learning Seminar, 2004
Likelihood in HMMs
• Given a model λ = {A,B,q}, we can
compute the likelihood by
P(y) = p(y| λ) = Σp(x)p(y|x) =
= q(x1)ΠA(xt+1|xt) ΠB(yt|xt)
• But … this computation complexity is
O(NT), when |xi| = N  impossible in
practice
Learning Seminar, 2004
Forward-Backward algorithm
• To compute likelihood:
– Need to enumerate over all paths in the lattice (all
possible instantiations of X1…XT). But … some
starting subpath(blue) is common to many continuing
paths (blue+red)
The idea:
using dynamic
programming, calculate a
path in terms of shorter
sub-paths
Learning Seminar, 2004
Forward-Backward algorithm (cont’d)
• We build a matrix of the probability of being at time t at
state i: αt(i)=P(xt=i,y1y2…yt). This is a function of the
previous column (forward procedure):
Learning Seminar, 2004
Forward-Backward algorithm (cont’d)
We can similarly define a backwards procedure
for filling the matrix βt(i) = P(yt+1…yT|xt=i)
Learning Seminar, 2004
Forward-Backward algorithm (cont’d)
• And we can easily combine:
P(y,xt=i) = P(xt=i,y1y2…yt)* P(yt+1…yT|xt=i)=
=αt(i)βt(i)
• And then we get:
P(y) = Σ P(y,xt=i) = Σ αt(i)βt(i)
• Summary: we presented a polynomial
algorithm for computing likelihood in
HMMs.
Learning Seminar, 2004
HMM – why not?
• Advantages:
– Estimation very easy.
– Closed form solution
– The parameters can be estimated with
relatively high confidence from small samples
• But:
– The model represents all possible (x,y)
sequences and defines joint probability over
all possible observation and label sequences
 needless effort
Learning Seminar, 2004
Discriminative Probabilistic Models
Generative
Discriminative
“Solve the problem you need to solve”:
The traditional approach inappropriately uses a
generative joint model in order to solve a conditional
problem in which the observations are given.
To classify we need p(y|x) – there’s no need to implicitly
approximate p(x).
Learning Seminar, 2004
Discriminative Models - Estimation
• Choose Θy to maximize conditional
likelihood:
L(Θy)= Σ log pΘy(yi|xi)
• Estimation usually doesn’t have closed
form
• Example – MinMI discriminative approach
(2nd week lecture)
Learning Seminar, 2004
Maximum Entropy Markov Model
• MEMM:
– a conditional model that represents the
probability of reaching a state given an
observation and the previous state
– These conditional probabilities are specified
by exponential models based on arbitrary
observation features
Learning Seminar, 2004
The Label Bias Problem
• The mass that arrives at the state must be
distributed among the possible successor
states
• Potential victims: Discriminative Models
Learning Seminar, 2004
The Label Bias Problem: Solutions
• Determinization of the Finite State
Machine
Not always possible
May lead to combinatorial explosion
• Start with a fully connected model and let
the training procedure to find a good
structure
Prior structural knowledge has proven to be
valuable in information extraction tasks
Learning Seminar, 2004
Random Field Model: Definition
• Let G = (V, E) be a finite graph, and let A be a
finite alphabet.
• The configuration space Ω is the set of all
labelings of the vertices in V by letters in A. If
C is a part of V and ω is an element of Ω is a
configuration, the ωc denotes the configuration
restricted to C.
• A random field on G is a probability distribution
on Ω.
Learning Seminar, 2004
Random Field Model: The Problem
• Assume that a finite number of features can
define a class
• The features fi(w) are given and fixed.
• The goal: estimating λ to maximize likelihood for
training examples
Learning Seminar, 2004
Conditional Random Field:
Definition
• X – random variable over data sequences
• Y - random variable over label sequences
• Yi is assumed to range over a finite label
alphabet A
• Discriminative approach: we construct a
conditional model p(y|x) and do not
explicitly model marginal p(x)
Learning Seminar, 2004
CRF - Definition
• Let G = (V, E) be a finite graph, and let A be a
finite alphabet
• Y is indexed by the vertices of G
• Then (X,Y) is a conditional random field if the
random variables Yv, conditioned on X, obey the
Markov property with respect to the graph:
p(Y|X,Yw,w≠v) = p(Yv|X,Yw,w~v),
where w~v means that w and v are neighbors in
G
Learning Seminar, 2004
CRF on Simple Chain Graph
• We will handle the case when G is a simple
chain: G = (V = {1,…,m}, E={ (I,i+1) })
HMM (Generative)
MEMM (Discriminative)
Learning Seminar, 2004
CRF
Fundamental Theorem of Random
Fields (Hammersley & Clifford)
• Assumption:
– G structure is a tree, of which simple chain is
a private case
Learning Seminar, 2004
CRF – the Learning Problem
• Assumption: the features fk and gk are
given and fixed.
– For example, a boolean feature gk is TRUE if
the word Xi is upper case and the label Yi is a
“noun”.
• The learning problem
– We need to determine the parameters
Θ = (λ1, λ2, . . . ; µ1, µ2, . . .) from training data
D = {(x(i), y(i))} with empirical distribution
p~(x, y).
Learning Seminar, 2004
CRF – Estimation
• And we return to the log-likelihood maximization
problem, this time – we need to find Θ that
maximizes the conditional log-likelihood:
Learning Seminar, 2004
CRF – Estimation
• From now on we assume that the
dependencies of Y, conditioned on X, form
a chain.
• To simplify some
expressions, we add
special start and stop
states Y0 = start and
Yn+1 = stop.
Learning Seminar, 2004
CRF – Estimation
• Suppose that p(Y|X) is a CRF. For each position i in
the observation sequence X, we define the |Y|*|Y|
matrix random variable Mi(x) = [Mi(y', y|x)] by:
ei is the edge with labels (Yi-1,Yi) and vi is the vertex with label Yi
Learning Seminar, 2004
CRF – Estimation
• The normalization function Z(x) is
• The conditional probability of a label
sequence y is written as
Learning Seminar, 2004
Parameter Estimation for CRFs
• The parameter vector Θ that maximizes the loglikelihood is found using a iterative scaling algorithm.
• We define standard
HMM-like forward and
backward vectors α and β,
which allow polynomial
calculations.
• For example:
Learning Seminar, 2004
Experimental Results – Set 1
• Set 1: modeling label bias
• Data was generated from a simple HMM which encodes a noisy
version of the finite-state network (“rib/ rob”)
• Each state emits its designated symbol with probability 29/32 and
any of the other symbols with probability 1/32
• We train both an MEMM and a CRF
• The observation features are simply the identity of the observation
symbols.
• 2, 000 training and 500 test samples were used
• Results:
– CRF error:
– MEMM error:
4.6%
42%
• Conclusion:
– MEMM fails to discriminate between the two branches and we get the
label bias problem
Learning Seminar, 2004
Experimental Results – Set 2
• Set 2: modeling mixed order sources
• Data was generated from a mixed-order HMM
with state transition probabilities given by
p(yi|yi-1, yi-2) = α p2(yi|yi-1, yi-2) + (1 - α) p1(yi|yi-1)
• Similarly, emission probabilities given by
p(xi|yi, xi-1) = α p2(xi|yi, xi-1)+(1- α) p1(xi|yi)
• Thus, for α = 0 we have a standard first-order
HMM.
• For each randomly generated model, a sample
of 1,000 sequences of length 25 is generated for
training and testing.
Learning Seminar, 2004
Experimental Results – Set 2
Learning Seminar, 2004
Experimental Results – Set 3
• Set 3: Part-Of-Speech tagging experiments
Learning Seminar, 2004
Conclusions
• Conditional random fields offer a unique combination of
properties:
– discriminatively trained models for sequence segmentation and
labeling
– combination of arbitrary and overlapping observation features
from both the past and future
– efficient training and decoding based on dynamic programming
for a simple chain graph
– parameter estimation guaranteed to find the global optimum
• CRFs main current limitation is the slow convergence of
the training algorithm relative to MEMMs, let alone to
HMMs, for which training on fully observed data is very
efficient.
Learning Seminar, 2004
Thank you
Learning Seminar, 2004