Download Automatically Building Special Purpose Search Engines with

Document related concepts

Multi-armed bandit wikipedia , lookup

Concept learning wikipedia , lookup

Inductive probability wikipedia , lookup

Machine learning wikipedia , lookup

Mixture model wikipedia , lookup

Time series wikipedia , lookup

Pattern recognition wikipedia , lookup

Neural modeling fields wikipedia , lookup

Mathematical model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Transcript
Information Extraction
from the World Wide Web
Part II: FSM Models
CSE 454
Based on Slides by
William W. Cohen
Carnegie Mellon University
Andrew McCallum
University of Massachusetts Amherst
From KDD 2003
Administrivia
• Proposals
– Due Today
– Viewable by all ???
© Daniel S. Weld
Landscape of IE Techniques
• Pattern Feature Domain
– NL Text vs. with Formatting vs. Tables
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation and
B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
• Pattern Scope
– Site Specific, Genre Specific, Broad Source
–
• Pattern Complexity
– Closed Set, Regular Expr., Complex Expr., Ambiguous
• Pattern Combinations
– Named-Entity Recognition, …, N-ary Relations
Landscape of IE Techniques (1/1):
Models
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
Each model can capture words, formatting, or both
VP
S
Slides from Cohen & McCallum
Hidden Markov Models (HMMs)
standard sequence model in genomics, speech, NLP, …
Graphical model
Finite state model
S t-1
St
S t+1
...
...
observations
...
O
Generates:
State
sequence
Observation
sequence
transitions
Ot
t -1
O t +1

|o|
o1
o2
o3
o4
o5
o6
o7
o8
 
P(s , o )   P(st | st 1 ) P(ot | st )
t 1
Parameters: for all states S={s1,s2,…}
Start state probabilities: P(st )
Transition probabilities:
P(st|st-1 )
Usually a multinomial over
atomic, fixed alphabet
Observation (emission) probabilities: P(ot|st )
Training:
Maximize probability of training observations (w/
prior)
Slides
from Cohen & McCallum
Example: The Dishonest Casino
A casino has two dice:
• Fair die
P(1) = P(2) = P(3) = P(5) = P(6) = 1/6
• Loaded die
P(1) = P(2) = P(3) = P(5) = 1/10
P(6) = 1/2
Casino player switches back-&-forth between
fair and loaded die once every 20 turns
Game:
1. You bet $1
2. You roll (always with a fair die)
3. Casino player rolls (maybe with fair die,
maybe with loaded die)
4. Highest number wins $2
Slides from Serafim Batzoglou
Question # 1 – Evaluation
GIVEN
A sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION
How likely is this sequence, given our model of how the casino
works?
This is the EVALUATION problem in HMMs
Slides from Serafim Batzoglou
Question # 2 – Decoding
GIVEN
A sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION
What portion of the sequence was generated with the fair die, and
what portion with the loaded die?
This is the DECODING question in HMMs
Slides from Serafim Batzoglou
Question # 3 – Learning
GIVEN
A sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION
How “loaded” is the loaded die? How “fair” is the fair die? How often
does the casino player change from fair to loaded, and back?
This is the LEARNING question in HMMs
Slides from Serafim Batzoglou
The dishonest casino model
0.05
0.95
FAIR
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
0.95
LOADED
0.05
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
Slides from Serafim Batzoglou
What’s this have to do with Info Extraction?
0.05
0.95
FAIR
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
0.95
LOADED
0.05
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
What’s this have to do with Info Extraction?
0.05
0.95
TEXT
P(the | T) = 0.003
P(from | T) = 0.002
…..
0.95
NAME
0.05
P(Dan | N) = 0.005
P(Sue | N) = 0.003
…
Definition of a hidden Markov model
Definition: A hidden Markov model (HMM)
• Alphabet
 = { b1, b2, …, bM }
• Set of states Q = { 1, ..., K }
• Transition probabilities between any two states
1
2
K
…
aij = transition prob from state i to state j
ai1 + … + aiK = 1, for all states i = 1…K
• Start probabilities a0i
a01 + … + a0K = 1
• Emission probabilities within each state
ei(b) = P( xi = b | i = k)
ei(b1) + … + ei(bM) = 1, for all states i = 1…K
Slides from Serafim Batzoglou
A HMM is memory-less
At each time step t,
the only thing that affects future states
is the current state t
P(t+1 = k | “whatever happened so far”) =
P(t+1 = k | 1, 2, …, t, x1, x2, …, xt)
=
P(t+1 = k | t)
1
2
K
…
Slides from Serafim Batzoglou
A parse of a sequence
Given a sequence x = x1……xN,
A parse of x is a sequence of states  = 1, ……, N
1
1
1
…
1
2
2
2
…
2
…
…
…
K
K
K
x1
x2
x3
…
…
K
xK
Slides from Serafim Batzoglou
Likelihood of a parse
Given a sequence x = x1……xN
and a parse  = 1, ……, N,
To find how likely is the parse:
(given our HMM)
1
1
1
…
1
2
2
2
…
2
…
…
…
K
K
K
x1
x2
x3
…
…
P(x, ) = P(x1, …, xN, 1, ……, N) =
P(xN, N | x1…xN-1, 1, ……, N-1) P(x1…xN-1, 1, ……, N-1) =
P(xN, N | N-1) P(x1…xN-1, 1, ……, N-1) =
…=
P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) =
P(xN | N) P(N | N-1) ……P(x2 | 2) P(2 | 1) P(x1 | 1) P(1) =
a01 a12……aN-1N e1(x1)……eN(xN)
K
xK
Slides from Serafim Batzoglou
Example: the dishonest casino
Let the sequence of rolls be:
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
Then, what is the likelihood of
 = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair?
(say initial probs a0Fair = ½, aoLoaded = ½)
½  P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) =
½  (1/6)10  (0.95)9 = .00000000521158647211 = 0.5  10-9
Slides from Serafim Batzoglou
Example: the dishonest casino
So, the likelihood the die is fair in all this run
is just 0.521  10-9
OK, but what is the likelihood of
 = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded,
Loaded, Loaded, Loaded?
½  P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) =
½  (1/10)8  (1/2)2 (0.95)9 = .00000000078781176215 = 0.79  10-9
Therefore, the probability that the die is fair all the way,
is about the same that it is loaded all the way
Slides from Serafim Batzoglou
Example: the dishonest casino
Let the sequence of rolls be:
x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6
Now, what is the likelihood  = F, F, …, F?
½  (1/6)10  (0.95)9 = 0.5  10-9, same as before
What is the likelihood
 = L, L, …, L?
½  (1/10)4  (1/2)6 (0.95)9 = .00000049238235134735 = 0.5  10-7
So, it is 100 times more likely the die is loaded
Slides from Serafim Batzoglou
The three main questions on HMMs
1. Evaluation
GIVEN
FIND
a HMM M,
Prob[ x | M ]
and a sequence x,
2. Decoding
GIVEN
FIND
a HMM M,
and a sequence x,
the sequence  of states that maximizes P[ x,  | M ]
3. Learning
GIVEN
a HMM M, with unspecified transition/emission probs.,
and a sequence x,
FIND
parameters  = (ei(.), aij) that maximize P[ x |  ]
Slides from Serafim Batzoglou
Let’s not be confused by notation
P[ x | M ]:
The probability that sequence x was generated by
the model
The model is: architecture (#states, etc)
+ parameters  = aij, ei(.)
So, P[x | M] is the same with P[ x |  ], and P[ x ], when the
architecture, and the parameters, respectively, are implied
Similarly, P[ x,  | M ], P[ x,  |  ] and P[ x,  ] are the same when
the architecture, and the parameters, are implied
In the LEARNING problem we always write P[ x |  ] to emphasize
that we are seeking the * that maximizes P[ x |  ]
Slides from Serafim Batzoglou
Problem 1: Decoding
Find the best parse of a
sequence
Slides from Serafim Batzoglou
Decoding
GIVEN x = x1x2……xN
We want to find  = 1, ……, N,
such that P[ x,  ] is maximized
1
1
1
…
1
2
2
2
…
2
…
…
…
K
K
K
x1
x2
x3
…
…
K
* = argmax P[ x,  ]
We can use dynamic programming!
xK
Let Vk(i) = max{1,…,i-1} P[x1…xi-1, 1, …, i-1, xi, i = k]
= Probability of most likely sequence of states ending at
state i = k
Slides from Serafim Batzoglou
Decoding – main idea
Given that for all states k,
and for a fixed position i,
Vk(i) = max{1,…,i-1} P[x1…xi-1, 1, …, i-1, xi, i = k]
What is Vl(i+1)?
From definition,
Vl(i+1) = max{1,…,i}P[ x1…xi, 1, …, i, xi+1, i+1 = l ]
= max{1,…,i}P(xi+1, i+1 = l | x1…xi,1,…, i) P[x1…xi, 1,…, i]
= max{1,…,i}P(xi+1, i+1 = l | i ) P[x1…xi-1, 1, …, i-1, xi, i]
= maxk [P(xi+1, i+1 = l | i = k) max{1,…,i-1}P[x1…xi-1,1,…,i-1, xi,i=k]]
= el(xi+1) maxk akl Vk(i)
Remember:
Vk(i) = probability of most likely state seq ending with state i = k
Slides from Serafim Batzoglou
The Viterbi Algorithm
Dynamic Programming
x1 x2 ……xi xi+1……………………………..xN
State 1
2
j
Vj(i+1)
K
Time:
O(K2N)
Space:
O(KN)
Remember:
Vk(i) = probability of most likely state seq ending with state i = k
Slides from Serafim Batzoglou
The Viterbi Algorithm
Input observation seq: x = x1……xN
Initialization:
V0(0) = 1
Vk(0) = 0, for all k > 0
(0 = imaginary 1st position)
Iteration (I from 1 to N, j from 1 to K):
Vj(i) = ej(xi)  maxk akj Vk(i-1)
Ptrj(i) = argmaxk akj Vk(i-1)
Termination:
P(x, *) = maxk Vk(N)
Traceback:
N* = argmaxk Vk(N)
i-1* = Ptri (i)
all i  [1, K]
x1 x2 …………………….. xN
State 1
2
j
V*(N)
K
Remember:
Vk(i) = probability of most likely state seq ending with state i = k
Slides from Serafim Batzoglou
Viterbi Algorithm – a practical detail
Underflows are a significant problem
P[ x1,…., xi, 1, …, i ] = a01 a12……ai e1(x1)……ei(xi)
These numbers become extremely small – underflow
Solution: Take the logs of all values
Vl(i) = log ek(xi) + maxk [ Vk(i-1) + log akl ]
Slides from Serafim Batzoglou
Example
Let x be a sequence with a portion of ~ 1/6 6’s,
followed by a portion of ~ ½ 6’s…
x = 123456123456…12345 6626364656…1626364656
Then, it is not hard to show that optimal parse is (exercise):
FFF…………………...F LLL………………………...L
6 characters “123456” parsed as F, contribute .956(1/6)6
= 1.610-5
parsed as L, contribute .956(1/2)1(1/10)5 = 0.410-5
“162636” parsed as F, contribute .956(1/6)6
= 1.610-5
parsed as L, contribute .956(1/2)3(1/10)3 = 9.010-5
Slides from Serafim Batzoglou
Decoding = Extraction
• Add a slide making the connection
Problem 2: Evaluation
Find the likelihood a sequence
is generated by the model
Slides from Serafim Batzoglou
Why important
Generating a sequence by the model
Given a HMM, we can generate a sequence of length n as follows:
1.
2.
3.
4.
Start at state 1 according to prob a01
Emit letter x1 according to prob e1(x1)
Go to state 2 according to prob a12
… until emitting xn
a02
0
1
1
1
…
1
2
2
2
…
2
…
…
…
K
K
K
x1
x2
x3
…
…
K
e2(x1)
xn
Slides from Serafim Batzoglou
A couple of questions
Given a obs sequence x,
•
P(box: FFFFFFFFFFF) =
(1/6)11 * 0.9512 =
2.76-9 * 0.54 =
What is the probability that1.49
x was
generated by the model?
-9
P(box:
LLLLLLLLLLL)
= emitted x ?
• Given a position i, what is the
most
likely state that
i
[ (1/2)6 * (1/10)5 ] * 0.9510 * 0.052 =
1.56*10-7 * 1.5-3 =
Example: the dishonest casino
0.23-9
Say x = 12341623162616364616234161221341
Most likely path:  = FF……F
However: marked letters more likely to be L than unmarked
letters
Slides from Serafim Batzoglou
Evaluation
We will develop algorithms that allow us to compute:
P(x)
Probability of x given the model
P(xi…xj)
Probability of a substring of x given the model
P(I = k | x) Probability that the ith state is k, given x
A more refined measure of which states x may be in
Slides from Serafim Batzoglou
The Forward Algorithm
We want to calculate
P(x) = probability of x, given the HMM
Sum over all possible ways of generating x:
P(x) =  P(x, ) =  P(x | ) P()
To avoid summing over an exponential number of paths , define
fk(i) = P(x1…xi, i = k)
(the forward probability)
Slides from Serafim Batzoglou
The Forward Algorithm – derivation
Define the forward probability:
fk(i) = P(x1…xi, i = k)
= 1…i-1 P(x1…xi-1, 1,…, i-1, i = k) ek(xi)
= l 1…i-2 P(x1…xi-1, 1,…, i-2, i-1 = l) alk ek(xi)
= ek(xi) l fl(i-1) alk
Slides from Serafim Batzoglou
The Forward Algorithm
We can compute fk(i) for all k, i, using dynamic programming!
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
fk(i) = ek(xi) l fl(i-1) alk
Termination:
P(x) = k fk(N) ak0
Where, ak0 is the probability that the terminating state is k (usually = a0k)
Slides from Serafim Batzoglou
Relation between Forward and Viterbi
VITERBI
FORWARD
Initialization:
V0(0) = 1
Vk(0) = 0, for all k > 0
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
Iteration:
Vj(i) = ej(xi) maxk Vk(i-1) akj
Termination:
P(x, *) = maxk Vk(N)
fl(i) = el(xi) k fk(i-1) akl
Termination:
P(x) = k fk(N) ak0
Slides from Serafim Batzoglou
Motivation for the Backward Algorithm
We want to compute
P(i = k | x),
the probability distribution on the ith position, given x
We start by computing
P(i = k, x) = P(x1…xi, i = k, xi+1…xN)
= P(x1…xi, i = k) P(xi+1…xN | x1…xi, i = k)
= P(x1…xi, i = k) P(xi+1…xN | i = k)
Forward, fk(i)
Backward, bk(i)
Then, P(i = k | x) = P(i = k, x) / P(x)
Slides from Serafim Batzoglou
The Backward Algorithm – derivation
Define the backward probability:
bk(i) = P(xi+1…xN | i = k)
= i+1…N P(xi+1,xi+2, …, xN, i+1, …, N | i = k)
= l i+1…N P(xi+1,xi+2, …, xN, i+1 = l, i+2, …, N | i = k)
= l el(xi+1) akl i+1…N P(xi+2, …, xN, i+2, …, N | i+1 = l)
= l el(xi+1) akl bl(i+1)
Slides from Serafim Batzoglou
The Backward Algorithm
We can compute bk(i) for all k, i, using dynamic programming
Initialization:
bk(N) = ak0, for all k
Iteration:
bk(i) = l el(xi+1) akl bl(i+1)
Termination:
P(x) = l a0l el(x1) bl(1)
Slides from Serafim Batzoglou
Computational Complexity
What is the running time, and space required, for Forward, and Backward?
Time: O(K2N)
Space: O(KN)
Useful implementation technique to avoid underflows
Viterbi:
sum of logs
Forward/Backward: rescaling at each position by multiplying by a
constant
Slides from Serafim Batzoglou
Information Extraction
• We want specific info from text documents
• For example, from colloq emails, want
– Speaker name
– Location
– Start time
© Daniel S. Weld
45
Simple HMM for Job Titles
© Daniel S. Weld
46
HMMs for Info Extraction
For sparse extraction tasks :
• Separate HMM for each type of target
• Each HMM should
– Model entire document
– Consist of target and non-target states
– Not necessarily fully connected
Given HMM, how extract info?
© Daniel S. Weld
Slide by Okan Basegmez
47
How Learn HMM?
• Two questions: structure & parameters
© Daniel S. Weld
48
Simplest Case
• Fix structure
• Learn transition & emission probabilities
• Training data…?
– Label each word as target or non-target
• Challenges
– Sparse training data
– Unseen words have…
– Smoothing!
© Daniel S. Weld
49
Problem 3: Learning
• So far: assumed we know the underlying model
–
  (A,B,  )
• Often these parameters are estimated on annotated
training data, which has 2 drawbacks:
– Annotation is difficult and/or expensive
– Training data is different from the current data
• We want to maximize the parameters with respect to
the current data, i.e., we’re looking for a model ' ,
such that
' argmax P(O | )

© Daniel S. Weld
Slide by Bonnie Dorr
50
Problem 3: Learning
• Unfortunately, there is no known way to
analytically find a global maximum, i.e., a
model ' , such that ' argmax P(O | )

• But it is possible to find a local maximum!

 model  , we can always find
• Given an initial
a model ' , such that
P(O | ')  P(O |  )

© Daniel S. Weld
Slide by Bonnie Dorr


51
Parameter Re-estimation
• Use a hill-climbing algorithm
– Called the forward-backward algorithm
– (or Baum-Welch) algorithm
• Idea:
– 0. Use an initial parameter instantiation
– 1. Loop
• Iteratively re-estimate the parameters
• Improve the probability that given observation are
generated by the new parameters
– 2. Until estimates don’t change much
© Daniel S. Weld
Slide by Bonnie Dorr
52
Parameter Re-estimation
• Three parameters need to be re-estimated:

– Initial state distribution:
i
– Transition probabilities: ai,j
– Emission probabilities: ei(ot)

© Daniel S. Weld
Slide by Bonnie Dorr
53
More details
• Explain remainder
• Also connect evaluation to learning
We want More than an Atomic View of Words
Would like richer representation of text:
many arbitrary, overlapping features of the words.
S t-1
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is “Wisniewski”
is in a list of city names
is under node X in WordNet
part of
ends in
is in bold font
noun phrase
“-ski”
is indented
O t 1
is in hyperlink anchor
last person name was female
next two words are “and Associates”
St
S t+1
…
…
Ot
O t +1
Slides from Cohen & McCallum
Problems with Richer Representation
and a Joint Model
These arbitrary features are not independent.
– Multiple levels of granularity (chars, words, phrases)
– Multiple dependent modalities (words, formatting, layout)
– Past & future
Two choices:
Model the dependencies.
Each state would have its own
Bayes Net. But we are already
starved for training data!
Ignore the dependencies.
This causes “over-counting” of
evidence (ala naïve Bayes).
Big problem when combining
evidence, as in Viterbi!
S t-1
St
S t+1
S t-1
St
O
Ot
O t +1
O
Ot
t -1
t -1
S t+1
O
Slides fromt +1
Cohen & McCallum
Conditional Sequence Models
• We prefer a model that is trained to maximize a
conditional probability rather than joint
probability:
P(s|o) instead of P(s,o):
– Can examine features, but not responsible for
generating them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what
we are given at test time anyway.
Slides from Cohen & McCallum
Conditional Finite State Sequence Models
[McCallum, Freitag & Pereira, 2000]
[Lafferty, McCallum, Pereira 2001]
From HMMs to CRFs

s  s1 , s2 ,...sn

o  o1 , o2 ,...on
St-1
St
St+1
...

|o|
Joint
 
P( s , o )   P( st | st 1 ) P(ot | st )
t 1
Conditional
 
P( s | o ) 
Ot-1
Ot
Ot+1
...

|o|
1
  P( st | st 1 ) P(ot | st )
P(o ) t 1
St-1
St
St+1
...

|o|
1
    s ( st , st 1 ) o (ot , st )
Z (o ) t 1

where  o (t )  exp 

Ot-1
Ot
Ot+1
...

  k f k ( s t , ot ) 
k

(A super-special case of
Conditional Random Fields.)
Slides from Cohen & McCallum
Conditional Random Fields
[Lafferty, McCallum, Pereira 2001]
1. FSM special-case:
linear chain among unknowns, parameters tied across time steps.
St
St+1
St+2
St+3
St+4
 
P( s | o ) 

|o |
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
1

 
  exp    k f k ( st , st 1 , o , t ) 
Z (o ) t 1
 k

2. In general:
CRFs = "Conditionally-trained Markov Network"
arbitrary structure among unknowns
3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]:
Parameters tied across hits from SQL-like queries ("clique templates")
4. Markov Logic Networks [Richardson & Domingos]:
Weights on arbitrary logical sentences
Slides from Cohen & McCallum
Feature Functions

Example f k (st , st 1, o, t ) :
1 if Capitalize d(ot )  si  st 1  s j  st

f Capitalized,si , s j  ( st , st 1 , o, t )  
0 otherwise
o = Yesterday Pedro Domingos spoke this example sentence.
o1
o2
s1
s2
s3
s4
o3
o4
o5
o6
o7

f Capitalized , s1 , s3  ( s1 , s2 , o,2)  1
Slides from Cohen & McCallum
Learning Parameters of CRFs
Maximize log-likelihood of parameters  {k} given training data D

|o |
 1

2k
 
L   log    exp   k f k ( st , st 1 , o, t )     2
s , o D
 k
  k 2
 Z (o ) t 1
Log-likelihood gradient:
L

k
k
 
  (i )
  (i )
 #k (s , o )   P (s '| o ) #k (s ' , o )  2
 
s , o D
i

s'

 

where #k ( s , o )   f k (st 1 , st , o , t )
t
Methods:
• iterative scaling (quite slow)
[Sha & Pereira 2002] & [Malouf 2002]
• conjugate gradient (much faster)
• limited-memory quasi-Newton methods, BFGS (super fast)
Slides from Cohen & McCallum
General CRFs vs. HMMs
• More general and expressive modeling technique
• Comparable computational efficiency
• Features may be arbitrary functions of any or all
observations
• Parameters need not fully specify generation of
observations; require less training data
• Easy to incorporate domain knowledge
• State means only “state of process”, vs
“state of process” and “observational history I’m keeping”
Slides from Cohen & McCallum
Person name Extraction
[McCallum 2001,
unpublished]
Slides from Cohen & McCallum
Person name Extraction
Slides from Cohen & McCallum
Features in Experiment
Capitalized
Xxxxx
Mixed Caps
XxXxxx
All Caps
XXXXX
Initial Cap
X….
Contains Digit
xxx5
All lowercase
xxxx
Initial
X
Punctuation
.,:;!(), etc
Period
.
Comma
,
Apostrophe
‘
Dash
Preceded by HTML tag
Character n-gram classifier
Hand-built FSM person-name
says string is a person
extractor says yes,
name (80% accurate)
(prec/recall ~ 30/95)
In stopword list
Conjunctions of all previous
(the, of, their, etc)
feature pairs, evaluated at
the current time step.
In honorific list
(Mr, Mrs, Dr, Sen, etc)
Conjunctions of all previous
feature pairs, evaluated at
In person suffix list
current step and one step
(Jr, Sr, PhD, etc)
ahead.
In name particle list
All previous features, evaluated
(de, la, van, der, etc)
two steps ahead.
In Census lastname list;
All previous features, evaluated
segmented by P(name)
one step behind.
In Census firstname list;
segmented by P(name)
In locations lists
(states, cities, countries)
In company name list
(“J. C. Penny”)
Total number of features = ~500k
In list of company suffixes
(Inc, & Associates, Foundation)
Slides from Cohen & McCallum
Training and Testing
• Trained on 65k words from 85 pages, 30
different companies’ web sites.
• Training takes 4 hours on a 1 GHz Pentium.
• Training precision/recall is 96% / 96%.
• Tested on different set of web pages with
similar size characteristics.
• Testing precision is 92 – 95%,
recall is 89 – 91%.
Slides from Cohen & McCallum
Limitations of Finite State Models
• Finite state models have a linear structure
• Web documents have a hierarchical
structure
– Are we suffering by not modeling this structure
more explicitly?
• How can one learn a hierarchical extraction
model?
Slides from Cohen & McCallum
IE Resources
• Data
– RISE, http://www.isi.edu/~muslea/RISE/index.html
– Linguistic Data Consortium (LDC)
• Penn Treebank, Named Entities, Relations, etc.
– http://www.biostat.wisc.edu/~craven/ie
– http://www.cs.umass.edu/~mccallum/data
• Code
– TextPro, http://www.ai.sri.com/~appelt/TextPro
– MALLET, http://www.cs.umass.edu/~mccallum/mallet
– SecondString, http://secondstring.sourceforge.net/
• Both
– http://www.cis.upenn.edu/~adwait/penntools.html
Slides from Cohen & McCallum
References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In
Proceedings of ANLP’97, p194-201.
[Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in
Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).
[Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML
documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)
[Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the
Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).
[Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual
Similarity, in Proceedings of ACM SIGMOD-98.
[Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language,
ACM Transactions on Information Systems, 18(3).
[Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning:
Proceedings of the Seventeeth International Conference (ML-2000).
[Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the
Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
[De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural
Language Processing. Larence Erlbaum, 1982, 149-176.
[Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the
Fifteenth National Conference on Artificial Intelligence (AAAI-98).
[Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon
University.
[Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101
(2000).
Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National
Conference on Artificial Intelligence (AAAI-99)
[Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings
AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.
[Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).
[Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.
[Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.
[McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information
extraction and segmentation, In Proceedings of ICML-2000
[Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from
Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.
Slides from Cohen & McCallum