Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multi-armed bandit wikipedia , lookup
Concept learning wikipedia , lookup
Inductive probability wikipedia , lookup
Machine learning wikipedia , lookup
Mixture model wikipedia , lookup
Time series wikipedia , lookup
Pattern recognition wikipedia , lookup
Neural modeling fields wikipedia , lookup
Mathematical model wikipedia , lookup
Information Extraction from the World Wide Web Part II: FSM Models CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts Amherst From KDD 2003 Administrivia • Proposals – Due Today – Viewable by all ??? © Daniel S. Weld Landscape of IE Techniques • Pattern Feature Domain – NL Text vs. with Formatting vs. Tables Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. • Pattern Scope – Site Specific, Genre Specific, Broad Source – • Pattern Complexity – Closed Set, Regular Expr., Complex Expr., Ambiguous • Pattern Combinations – Named-Entity Recognition, …, N-ary Relations Landscape of IE Techniques (1/1): Models Classify Pre-segmented Candidates Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Boundary Models Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Sliding Window Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternate window sizes: Finite State Machines Abraham Lincoln was born in Kentucky. Context Free Grammars Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P Classifier PP which class? VP NP BEGIN END BEGIN NP END Each model can capture words, formatting, or both VP S Slides from Cohen & McCallum Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, … Graphical model Finite state model S t-1 St S t+1 ... ... observations ... O Generates: State sequence Observation sequence transitions Ot t -1 O t +1 |o| o1 o2 o3 o4 o5 o6 o7 o8 P(s , o ) P(st | st 1 ) P(ot | st ) t 1 Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Usually a multinomial over atomic, fixed alphabet Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Slides from Cohen & McCallum Example: The Dishonest Casino A casino has two dice: • Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 • Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: 1. You bet $1 2. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $2 Slides from Serafim Batzoglou Question # 1 – Evaluation GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Slides from Serafim Batzoglou Question # 2 – Decoding GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs Slides from Serafim Batzoglou Question # 3 – Learning GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs Slides from Serafim Batzoglou The dishonest casino model 0.05 0.95 FAIR P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 0.95 LOADED 0.05 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 Slides from Serafim Batzoglou What’s this have to do with Info Extraction? 0.05 0.95 FAIR P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 0.95 LOADED 0.05 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 What’s this have to do with Info Extraction? 0.05 0.95 TEXT P(the | T) = 0.003 P(from | T) = 0.002 ….. 0.95 NAME 0.05 P(Dan | N) = 0.005 P(Sue | N) = 0.003 … Definition of a hidden Markov model Definition: A hidden Markov model (HMM) • Alphabet = { b1, b2, …, bM } • Set of states Q = { 1, ..., K } • Transition probabilities between any two states 1 2 K … aij = transition prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K • Start probabilities a0i a01 + … + a0K = 1 • Emission probabilities within each state ei(b) = P( xi = b | i = k) ei(b1) + … + ei(bM) = 1, for all states i = 1…K Slides from Serafim Batzoglou A HMM is memory-less At each time step t, the only thing that affects future states is the current state t P(t+1 = k | “whatever happened so far”) = P(t+1 = k | 1, 2, …, t, x1, x2, …, xt) = P(t+1 = k | t) 1 2 K … Slides from Serafim Batzoglou A parse of a sequence Given a sequence x = x1……xN, A parse of x is a sequence of states = 1, ……, N 1 1 1 … 1 2 2 2 … 2 … … … K K K x1 x2 x3 … … K xK Slides from Serafim Batzoglou Likelihood of a parse Given a sequence x = x1……xN and a parse = 1, ……, N, To find how likely is the parse: (given our HMM) 1 1 1 … 1 2 2 2 … 2 … … … K K K x1 x2 x3 … … P(x, ) = P(x1, …, xN, 1, ……, N) = P(xN, N | x1…xN-1, 1, ……, N-1) P(x1…xN-1, 1, ……, N-1) = P(xN, N | N-1) P(x1…xN-1, 1, ……, N-1) = …= P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) = P(xN | N) P(N | N-1) ……P(x2 | 2) P(2 | 1) P(x1 | 1) P(1) = a01 a12……aN-1N e1(x1)……eN(xN) K xK Slides from Serafim Batzoglou Example: the dishonest casino Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 Then, what is the likelihood of = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a0Fair = ½, aoLoaded = ½) ½ P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½ (1/6)10 (0.95)9 = .00000000521158647211 = 0.5 10-9 Slides from Serafim Batzoglou Example: the dishonest casino So, the likelihood the die is fair in all this run is just 0.521 10-9 OK, but what is the likelihood of = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) = ½ (1/10)8 (1/2)2 (0.95)9 = .00000000078781176215 = 0.79 10-9 Therefore, the probability that the die is fair all the way, is about the same that it is loaded all the way Slides from Serafim Batzoglou Example: the dishonest casino Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 Now, what is the likelihood = F, F, …, F? ½ (1/6)10 (0.95)9 = 0.5 10-9, same as before What is the likelihood = L, L, …, L? ½ (1/10)4 (1/2)6 (0.95)9 = .00000049238235134735 = 0.5 10-7 So, it is 100 times more likely the die is loaded Slides from Serafim Batzoglou The three main questions on HMMs 1. Evaluation GIVEN FIND a HMM M, Prob[ x | M ] and a sequence x, 2. Decoding GIVEN FIND a HMM M, and a sequence x, the sequence of states that maximizes P[ x, | M ] 3. Learning GIVEN a HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters = (ei(.), aij) that maximize P[ x | ] Slides from Serafim Batzoglou Let’s not be confused by notation P[ x | M ]: The probability that sequence x was generated by the model The model is: architecture (#states, etc) + parameters = aij, ei(.) So, P[x | M] is the same with P[ x | ], and P[ x ], when the architecture, and the parameters, respectively, are implied Similarly, P[ x, | M ], P[ x, | ] and P[ x, ] are the same when the architecture, and the parameters, are implied In the LEARNING problem we always write P[ x | ] to emphasize that we are seeking the * that maximizes P[ x | ] Slides from Serafim Batzoglou Problem 1: Decoding Find the best parse of a sequence Slides from Serafim Batzoglou Decoding GIVEN x = x1x2……xN We want to find = 1, ……, N, such that P[ x, ] is maximized 1 1 1 … 1 2 2 2 … 2 … … … K K K x1 x2 x3 … … K * = argmax P[ x, ] We can use dynamic programming! xK Let Vk(i) = max{1,…,i-1} P[x1…xi-1, 1, …, i-1, xi, i = k] = Probability of most likely sequence of states ending at state i = k Slides from Serafim Batzoglou Decoding – main idea Given that for all states k, and for a fixed position i, Vk(i) = max{1,…,i-1} P[x1…xi-1, 1, …, i-1, xi, i = k] What is Vl(i+1)? From definition, Vl(i+1) = max{1,…,i}P[ x1…xi, 1, …, i, xi+1, i+1 = l ] = max{1,…,i}P(xi+1, i+1 = l | x1…xi,1,…, i) P[x1…xi, 1,…, i] = max{1,…,i}P(xi+1, i+1 = l | i ) P[x1…xi-1, 1, …, i-1, xi, i] = maxk [P(xi+1, i+1 = l | i = k) max{1,…,i-1}P[x1…xi-1,1,…,i-1, xi,i=k]] = el(xi+1) maxk akl Vk(i) Remember: Vk(i) = probability of most likely state seq ending with state i = k Slides from Serafim Batzoglou The Viterbi Algorithm Dynamic Programming x1 x2 ……xi xi+1……………………………..xN State 1 2 j Vj(i+1) K Time: O(K2N) Space: O(KN) Remember: Vk(i) = probability of most likely state seq ending with state i = k Slides from Serafim Batzoglou The Viterbi Algorithm Input observation seq: x = x1……xN Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 (0 = imaginary 1st position) Iteration (I from 1 to N, j from 1 to K): Vj(i) = ej(xi) maxk akj Vk(i-1) Ptrj(i) = argmaxk akj Vk(i-1) Termination: P(x, *) = maxk Vk(N) Traceback: N* = argmaxk Vk(N) i-1* = Ptri (i) all i [1, K] x1 x2 …………………….. xN State 1 2 j V*(N) K Remember: Vk(i) = probability of most likely state seq ending with state i = k Slides from Serafim Batzoglou Viterbi Algorithm – a practical detail Underflows are a significant problem P[ x1,…., xi, 1, …, i ] = a01 a12……ai e1(x1)……ei(xi) These numbers become extremely small – underflow Solution: Take the logs of all values Vl(i) = log ek(xi) + maxk [ Vk(i-1) + log akl ] Slides from Serafim Batzoglou Example Let x be a sequence with a portion of ~ 1/6 6’s, followed by a portion of ~ ½ 6’s… x = 123456123456…12345 6626364656…1626364656 Then, it is not hard to show that optimal parse is (exercise): FFF…………………...F LLL………………………...L 6 characters “123456” parsed as F, contribute .956(1/6)6 = 1.610-5 parsed as L, contribute .956(1/2)1(1/10)5 = 0.410-5 “162636” parsed as F, contribute .956(1/6)6 = 1.610-5 parsed as L, contribute .956(1/2)3(1/10)3 = 9.010-5 Slides from Serafim Batzoglou Decoding = Extraction • Add a slide making the connection Problem 2: Evaluation Find the likelihood a sequence is generated by the model Slides from Serafim Batzoglou Why important Generating a sequence by the model Given a HMM, we can generate a sequence of length n as follows: 1. 2. 3. 4. Start at state 1 according to prob a01 Emit letter x1 according to prob e1(x1) Go to state 2 according to prob a12 … until emitting xn a02 0 1 1 1 … 1 2 2 2 … 2 … … … K K K x1 x2 x3 … … K e2(x1) xn Slides from Serafim Batzoglou A couple of questions Given a obs sequence x, • P(box: FFFFFFFFFFF) = (1/6)11 * 0.9512 = 2.76-9 * 0.54 = What is the probability that1.49 x was generated by the model? -9 P(box: LLLLLLLLLLL) = emitted x ? • Given a position i, what is the most likely state that i [ (1/2)6 * (1/10)5 ] * 0.9510 * 0.052 = 1.56*10-7 * 1.5-3 = Example: the dishonest casino 0.23-9 Say x = 12341623162616364616234161221341 Most likely path: = FF……F However: marked letters more likely to be L than unmarked letters Slides from Serafim Batzoglou Evaluation We will develop algorithms that allow us to compute: P(x) Probability of x given the model P(xi…xj) Probability of a substring of x given the model P(I = k | x) Probability that the ith state is k, given x A more refined measure of which states x may be in Slides from Serafim Batzoglou The Forward Algorithm We want to calculate P(x) = probability of x, given the HMM Sum over all possible ways of generating x: P(x) = P(x, ) = P(x | ) P() To avoid summing over an exponential number of paths , define fk(i) = P(x1…xi, i = k) (the forward probability) Slides from Serafim Batzoglou The Forward Algorithm – derivation Define the forward probability: fk(i) = P(x1…xi, i = k) = 1…i-1 P(x1…xi-1, 1,…, i-1, i = k) ek(xi) = l 1…i-2 P(x1…xi-1, 1,…, i-2, i-1 = l) alk ek(xi) = ek(xi) l fl(i-1) alk Slides from Serafim Batzoglou The Forward Algorithm We can compute fk(i) for all k, i, using dynamic programming! Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fk(i) = ek(xi) l fl(i-1) alk Termination: P(x) = k fk(N) ak0 Where, ak0 is the probability that the terminating state is k (usually = a0k) Slides from Serafim Batzoglou Relation between Forward and Viterbi VITERBI FORWARD Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: Iteration: Vj(i) = ej(xi) maxk Vk(i-1) akj Termination: P(x, *) = maxk Vk(N) fl(i) = el(xi) k fk(i-1) akl Termination: P(x) = k fk(N) ak0 Slides from Serafim Batzoglou Motivation for the Backward Algorithm We want to compute P(i = k | x), the probability distribution on the ith position, given x We start by computing P(i = k, x) = P(x1…xi, i = k, xi+1…xN) = P(x1…xi, i = k) P(xi+1…xN | x1…xi, i = k) = P(x1…xi, i = k) P(xi+1…xN | i = k) Forward, fk(i) Backward, bk(i) Then, P(i = k | x) = P(i = k, x) / P(x) Slides from Serafim Batzoglou The Backward Algorithm – derivation Define the backward probability: bk(i) = P(xi+1…xN | i = k) = i+1…N P(xi+1,xi+2, …, xN, i+1, …, N | i = k) = l i+1…N P(xi+1,xi+2, …, xN, i+1 = l, i+2, …, N | i = k) = l el(xi+1) akl i+1…N P(xi+2, …, xN, i+2, …, N | i+1 = l) = l el(xi+1) akl bl(i+1) Slides from Serafim Batzoglou The Backward Algorithm We can compute bk(i) for all k, i, using dynamic programming Initialization: bk(N) = ak0, for all k Iteration: bk(i) = l el(xi+1) akl bl(i+1) Termination: P(x) = l a0l el(x1) bl(1) Slides from Serafim Batzoglou Computational Complexity What is the running time, and space required, for Forward, and Backward? Time: O(K2N) Space: O(KN) Useful implementation technique to avoid underflows Viterbi: sum of logs Forward/Backward: rescaling at each position by multiplying by a constant Slides from Serafim Batzoglou Information Extraction • We want specific info from text documents • For example, from colloq emails, want – Speaker name – Location – Start time © Daniel S. Weld 45 Simple HMM for Job Titles © Daniel S. Weld 46 HMMs for Info Extraction For sparse extraction tasks : • Separate HMM for each type of target • Each HMM should – Model entire document – Consist of target and non-target states – Not necessarily fully connected Given HMM, how extract info? © Daniel S. Weld Slide by Okan Basegmez 47 How Learn HMM? • Two questions: structure & parameters © Daniel S. Weld 48 Simplest Case • Fix structure • Learn transition & emission probabilities • Training data…? – Label each word as target or non-target • Challenges – Sparse training data – Unseen words have… – Smoothing! © Daniel S. Weld 49 Problem 3: Learning • So far: assumed we know the underlying model – (A,B, ) • Often these parameters are estimated on annotated training data, which has 2 drawbacks: – Annotation is difficult and/or expensive – Training data is different from the current data • We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model ' , such that ' argmax P(O | ) © Daniel S. Weld Slide by Bonnie Dorr 50 Problem 3: Learning • Unfortunately, there is no known way to analytically find a global maximum, i.e., a model ' , such that ' argmax P(O | ) • But it is possible to find a local maximum! model , we can always find • Given an initial a model ' , such that P(O | ') P(O | ) © Daniel S. Weld Slide by Bonnie Dorr 51 Parameter Re-estimation • Use a hill-climbing algorithm – Called the forward-backward algorithm – (or Baum-Welch) algorithm • Idea: – 0. Use an initial parameter instantiation – 1. Loop • Iteratively re-estimate the parameters • Improve the probability that given observation are generated by the new parameters – 2. Until estimates don’t change much © Daniel S. Weld Slide by Bonnie Dorr 52 Parameter Re-estimation • Three parameters need to be re-estimated: – Initial state distribution: i – Transition probabilities: ai,j – Emission probabilities: ei(ot) © Daniel S. Weld Slide by Bonnie Dorr 53 More details • Explain remainder • Also connect evaluation to learning We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 identity of word ends in “-ski” is capitalized is part of a noun phrase is “Wisniewski” is in a list of city names is under node X in WordNet part of ends in is in bold font noun phrase “-ski” is indented O t 1 is in hyperlink anchor last person name was female next two words are “and Associates” St S t+1 … … Ot O t +1 Slides from Cohen & McCallum Problems with Richer Representation and a Joint Model These arbitrary features are not independent. – Multiple levels of granularity (chars, words, phrases) – Multiple dependent modalities (words, formatting, layout) – Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t-1 St S t+1 S t-1 St O Ot O t +1 O Ot t -1 t -1 S t+1 O Slides fromt +1 Cohen & McCallum Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): – Can examine features, but not responsible for generating them. – Don’t have to explicitly model their dependencies. – Don’t “waste modeling effort” trying to generate what we are given at test time anyway. Slides from Cohen & McCallum Conditional Finite State Sequence Models [McCallum, Freitag & Pereira, 2000] [Lafferty, McCallum, Pereira 2001] From HMMs to CRFs s s1 , s2 ,...sn o o1 , o2 ,...on St-1 St St+1 ... |o| Joint P( s , o ) P( st | st 1 ) P(ot | st ) t 1 Conditional P( s | o ) Ot-1 Ot Ot+1 ... |o| 1 P( st | st 1 ) P(ot | st ) P(o ) t 1 St-1 St St+1 ... |o| 1 s ( st , st 1 ) o (ot , st ) Z (o ) t 1 where o (t ) exp Ot-1 Ot Ot+1 ... k f k ( s t , ot ) k (A super-special case of Conditional Random Fields.) Slides from Cohen & McCallum Conditional Random Fields [Lafferty, McCallum, Pereira 2001] 1. FSM special-case: linear chain among unknowns, parameters tied across time steps. St St+1 St+2 St+3 St+4 P( s | o ) |o | O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 1 exp k f k ( st , st 1 , o , t ) Z (o ) t 1 k 2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates") 4. Markov Logic Networks [Richardson & Domingos]: Weights on arbitrary logical sentences Slides from Cohen & McCallum Feature Functions Example f k (st , st 1, o, t ) : 1 if Capitalize d(ot ) si st 1 s j st f Capitalized,si , s j ( st , st 1 , o, t ) 0 otherwise o = Yesterday Pedro Domingos spoke this example sentence. o1 o2 s1 s2 s3 s4 o3 o4 o5 o6 o7 f Capitalized , s1 , s3 ( s1 , s2 , o,2) 1 Slides from Cohen & McCallum Learning Parameters of CRFs Maximize log-likelihood of parameters {k} given training data D |o | 1 2k L log exp k f k ( st , st 1 , o, t ) 2 s , o D k k 2 Z (o ) t 1 Log-likelihood gradient: L k k (i ) (i ) #k (s , o ) P (s '| o ) #k (s ' , o ) 2 s , o D i s' where #k ( s , o ) f k (st 1 , st , o , t ) t Methods: • iterative scaling (quite slow) [Sha & Pereira 2002] & [Malouf 2002] • conjugate gradient (much faster) • limited-memory quasi-Newton methods, BFGS (super fast) Slides from Cohen & McCallum General CRFs vs. HMMs • More general and expressive modeling technique • Comparable computational efficiency • Features may be arbitrary functions of any or all observations • Parameters need not fully specify generation of observations; require less training data • Easy to incorporate domain knowledge • State means only “state of process”, vs “state of process” and “observational history I’m keeping” Slides from Cohen & McCallum Person name Extraction [McCallum 2001, unpublished] Slides from Cohen & McCallum Person name Extraction Slides from Cohen & McCallum Features in Experiment Capitalized Xxxxx Mixed Caps XxXxxx All Caps XXXXX Initial Cap X…. Contains Digit xxx5 All lowercase xxxx Initial X Punctuation .,:;!(), etc Period . Comma , Apostrophe ‘ Dash Preceded by HTML tag Character n-gram classifier Hand-built FSM person-name says string is a person extractor says yes, name (80% accurate) (prec/recall ~ 30/95) In stopword list Conjunctions of all previous (the, of, their, etc) feature pairs, evaluated at the current time step. In honorific list (Mr, Mrs, Dr, Sen, etc) Conjunctions of all previous feature pairs, evaluated at In person suffix list current step and one step (Jr, Sr, PhD, etc) ahead. In name particle list All previous features, evaluated (de, la, van, der, etc) two steps ahead. In Census lastname list; All previous features, evaluated segmented by P(name) one step behind. In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list (“J. C. Penny”) Total number of features = ~500k In list of company suffixes (Inc, & Associates, Foundation) Slides from Cohen & McCallum Training and Testing • Trained on 65k words from 85 pages, 30 different companies’ web sites. • Training takes 4 hours on a 1 GHz Pentium. • Training precision/recall is 96% / 96%. • Tested on different set of web pages with similar size characteristics. • Testing precision is 92 – 95%, recall is 89 – 91%. Slides from Cohen & McCallum Limitations of Finite State Models • Finite state models have a linear structure • Web documents have a hierarchical structure – Are we suffering by not modeling this structure more explicitly? • How can one learn a hierarchical extraction model? Slides from Cohen & McCallum IE Resources • Data – RISE, http://www.isi.edu/~muslea/RISE/index.html – Linguistic Data Consortium (LDC) • Penn Treebank, Named Entities, Relations, etc. – http://www.biostat.wisc.edu/~craven/ie – http://www.cs.umass.edu/~mccallum/data • Code – TextPro, http://www.ai.sri.com/~appelt/TextPro – MALLET, http://www.cs.umass.edu/~mccallum/mallet – SecondString, http://secondstring.sourceforge.net/ • Both – http://www.cis.upenn.edu/~adwait/penntools.html Slides from Cohen & McCallum References • • • • • • • • • • • • • • • • • • • [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’97, p194-201. [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99). [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002) [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000). [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98. [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3). [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000). [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149-176. [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98). [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University. [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000). Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11. [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68). [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001. [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego. [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000 [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233. Slides from Cohen & McCallum