Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Information Extraction Data Mining and Topic Discovery with Probabilistic Models Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann. Information Extraction with Conditional Random Fields Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann. Goal: Mine actionable knowledge from unstructured text. An HR office Jobs, but not HR jobs Jobs, but not HR jobs Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1 A Portal for Job Openings Data Mining the Extracted Job Information IE from Research Papers [McCallum et al ‘99] IE from Research Papers QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries Why prefer “knowledge base search” over “page search” • Targeted, restricted universe of hits – Don’t show resumes when I’m looking for job openings. • Specialized queries – Topic-specific – Multi-dimensional – Based on information spread on multiple pages. • Get correct granularity – Site, page, paragraph • Specialized display – Super-targeted hit summarization in terms of DB slot values • Ability to support sophisticated data mining Information Extraction needed to automatically build the Knowledge Base. What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation Larger Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support Landscape of IE Tasks (1/4): Pattern Feature Domain Text paragraphs without formatting Grammatical sentences and some formatting & links Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Non-grammatical snippets, rich formatting & links Tables Landscape of IE Tasks (2/4): Pattern Scope Web site specific Formatting Amazon.com Book Pages Genre specific Layout Resumes Wide, non-specific Language University Names Landscape of IE Tasks (3/4): Pattern Complexity E.g. word patterns: Closed set Regular set U.S. states U.S. phone numbers He was born in Alabama… Phone: (413) 545-1323 The big Wyoming sky… The CALD main office can be reached at 412-268-1299 Complex pattern U.S. postal addresses University of Arkansas P.O. Box 140 Hope, AR 71802 Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Ambiguous patterns, needing context and many sources of evidence Person names …was among the six houses sold by Hope Feldman that year. Pawel Opalinski, Software Engineer at WhizBang Labs. Landscape of IE Tasks (4/4): Pattern Combinations Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Person: Jeffrey Immelt Location: Connecticut “Named entity” extraction Relation: Company-Location Company: General Electric Location: Connecticut N-ary record Relation: Company: Title: Out: In: Succession General Electric CEO Jack Welsh Jeffrey Immelt Evaluation of Single Entity Extraction TRUTH: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. PRED: Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. # correctly predicted segments Precision = 2 = # predicted segments 6 # correctly predicted segments Recall = 2 = # true segments 4 1 F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 2 State of the Art Performance • Named entity recognition – Person, Location, Organization, … – F1 in high 80’s or low- to mid-90’s • Binary relation extraction – Contained-in (Location1, Location2) Member-of (Person1, Organization1) – F1 in 60’s or 70’s or 80’s • Wrapper induction – Extremely accurate performance obtainable – Human effort (~30min) required on each site Outline • Examples of IE and Data Mining • IE with Hidden Markov Models • Introduction to Conditional Random Fields (CRFs) • Examples of IE with CRFs • Sequence Alignment with CRFs • Semi-supervised Learning Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model Finite state model S t-1 St S t+1 ... ... observations ... Generates: State sequence Observation sequence transitions O Ot t -1 O t +1 |o| o1 o2 o3 o4 o5 o6 o7 o8 P(s , o ) P(st | st 1 ) P(ot | st ) S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) t 1 Parameters: for all states Usually a multinomial over Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet Training: Maximize probability of training observations (w/ prior) IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) arg max s P( s , o ) Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos HMM Example: “Nymble” [Bikel, et al 1998], [BBN “IdentiFinder”] Task: Named Entity Extraction Person start-ofsentence end-ofsentence Org Other Train on ~500k words of news wire text. Case Mixed Upper Mixed Observation probabilities P(st | st-1, ot-1 ) P(ot | st , st-1 ) or (Five other name classes) Results: Transition probabilities Language English English Spanish P(ot | st , ot-1 ) Back-off to: Back-off to: P(st | st-1 ) P(ot | st ) P(st ) P(ot ) F1 . 93% 91% 90% Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 identity of word ends in “-ski” is capitalized is part of a noun phrase is “Wisniewski” is in a list of city names is under node X in WordNet part of ends in is in bold font noun phrase “-ski” is indented O t 1 is in hyperlink anchor last person name was female next two words are “and Associates” St S t+1 … … Ot O t +1 Problems with Richer Representation and a Generative Model These arbitrary features are not independent. – Multiple levels of granularity (chars, words, phrases) – Multiple dependent modalities (words, formatting, layout) – Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t-1 St S t+1 S t-1 St S t+1 O Ot O t +1 O Ot O t +1 t -1 t -1 Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): – Can examine features, but not responsible for generating them. – Don’t have to explicitly model their dependencies. – Don’t “waste modeling effort” trying to generate what we are given at test time anyway. Outline • Examples of IE and Data Mining • IE with Hidden Markov Models • Introduction to Conditional Random Fields (CRFs) • Examples of IE with CRFs • Sequence Alignment with CRFs • Semi-supervised Learning From HMMs to Conditional Random Fields s s1,s2,...sn Joint o o1,o2,...on [Lafferty, McCallum, Pereira 2001] St-1 St St+1 ... |o | P(s,o ) P(st | st1 )P(ot | st ) t1 Ot-1 Ot ... Ot+1 Conditional 1 |o | P(s | o) P(st | st1 )P(ot | st ) P(o ) t1 1 |o | s (st ,st1 )o (ot ,st ) Z(o) t1 where o (t) exp k f k (st ,ot ) k St-1 Ot-1 St Ot St+1 ... Ot+1 ... (A super-special case of Conditional Random Fields.) Set parameters by maximum likelihood, using optimization method on dL. Linear Chain Conditional Random Fields [Lafferty, McCallum, Pereira 2001] St St+1 St+2 St+3 St+4 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 Markov on s, conditional dependency on o. |o | 1 P(s | o) exp j f j (st ,st1,o,t) Z o t1 j Hammersley-Clifford-Besag theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph. Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs. CRFs vs. HMMs • More general and expressive modeling technique • Comparable computational efficiency • Features may be arbitrary functions of any or all observations • Parameters need not fully specify generation of observations; require less training data • Easy to incorporate domain knowledge • State means only “state of process”, vs “state of process” and “observational history I’m keeping” Training CRFs Maximize log - likelihood of parameters given training data L({k } |{ o,s (i) : }) Log - likelihood gradient : L 2 Ck (s (i),o (i) ) P{ k } (s | o (i) ) Ck (s,o (i) ) k k i i s Ck (s,o ) f k (o,t,st1,st ) t Feature count using correct labels - Feature count using predicted labels - Smoothing penalty Outline • Examples of IE and Data Mining • IE with Hidden Markov Models • Introduction to Conditional Random Fields (CRFs) • Examples of IE with CRFs • Sequence Alignment with CRFs • Semi-supervised Learning Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------: : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :-----------------: : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------: 1,000 Head --- Pounds --Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov Labels: CRF of milk during 1995 at $19.9 billion dollars, was eturns averaged $12.93 per hundredweight, 1994. Marketings totaled 154 billion pounds, ngs include whole milk sold to plants and dealers consumers. ds of milk were used on farms where produced, es were fed 78 percent of this milk with the cer households. 1993-95 ------------------------------------ n of Milk and Milkfat 2/ -------------------------------------: Percentage : Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote ... (12 in all) Features: uction of Milk and Milkfat: w • • • • • • • Total ----: of Fat in All :-----------------Milk Produced : Milk : Milkfat ------------------------------------ • • • • • • • Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev. ... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct HMM Stateless MaxEnt CRF 65 % 85 % 95 % Table segments, F1 64 % 92 % IE from Research Papers [McCallum et al ‘99] IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) 75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 error 40% [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) [Peng, McCallum, 2004] 93.9 Chinese Word Segmentation [McCallum & Feng 2003] ~100k words data, Penn Chinese Treebank Lexicon features: Adjective ending character adverb ending character building words Chinese number characters Chinese period cities and regions countries dates department characters digit characters foreign name chars function words job title locations money negative characters organization indicator preposition characters provinces punctuation chars Roman alphabetics Roman digits stopwords surnames symbol characters verb chars wordlist (188k lexicon) Chinese Word Segmentation Results [McCallum & Feng 2003] Precision and recall of segments with perfect boundaries: Method # training sentences testing segmentation prec. recall F1 [Peng] [Ponte] [Teahan] [Xue] CRF CRF CRF ~5M ? ~40k ~10k 2805 140 56 75.1 84.4 ? 95.2 97.3 95.4 93.9 74.0 87.8 ? 95.1 97.8 96.0 95.0 74.2 86.0 94.4 95.2 97.5 95.7 94.4 Prev. world’s best error 50% Named Entity Recognition CRICKET MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: PER ORG LOC MISC Examples: Yayuk Basuki Innocent Butare 3M KDP Cleveland Cleveland Nirmal Hriday The Oval Java Basque 1,000 Lakes Rally Automatically Induced Features [McCallum & Li, 2003, CoNLL] Index Feature 0 inside-noun-phrase (ot-1) 5 stopword (ot) 20 capitalized (ot+1) 75 word=the (ot) 100 in-person-lexicon (ot-1) 200 word=in (ot+2) 500 word=Republic (ot+1) 711 word=RBI (ot) & header=BASEBALL 1027 header=CRICKET (ot) & in-English-county-lexicon (ot) 1298 company-suffix-word (firstmentiont+2) 4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1) 4945 moderately-rare-first-name (ot-1) & very-common-last-name (ot) 4474 word=the (ot-2) & word=of (ot) Named Entity Extraction Results [McCallum & Li, 2003, CoNLL] Method F1 HMMs BBN's Identifinder 73% CRFs w/out Feature Induction 83% CRFs with Feature Induction based on LikelihoodGain 90% Related Work • CRFs are widely used for information extraction ...including more complex structures, like trees: – [Zhu, Nie, Zhang, Wen, ICML 2007] Dynamic Hierarchical Markov Random Fields and their Application to Web Data Extraction – [Viola & Narasimhan]: Learning to Extract Information from Semi-structured Text using a Discriminative Context Free Grammar – [Jousse et al 2006]: Conditional Random Fields for XML Trees Outline • Examples of IE and Data Mining • IE with Hidden Markov Models • Introduction to Conditional Random Fields (CRFs) • Examples of IE with CRFs • Sequence Alignment with CRFs • Semi-supervised Learning String Edit Distance • Distance between sequences x and y: – “cost” of lowest-cost sequence of edit operations that transform string x into y. String Edit Distance • Distance between sequences x and y: – “cost” of lowest-cost sequence of edit operations that transform string x into y. • Applications – Database Record Deduplication Apex International Hotel Grassmarket Street Apex Internat’l Grasmarket Street Records are duplicates of the same hotel? String Edit Distance • Distance between sequences x and y: – “cost” of lowest-cost sequence of edit operations that transform string x into y. • Applications – Database Record Deduplication – Biological Sequences QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. AGCTCTTACGATAGAGGACTCCAGA AGGTCTTACCAAAGAGGACTTCAGA QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. String Edit Distance • Distance between sequences x and y: – “cost” of lowest-cost sequence of edit operations that transform string x into y. • Applications – Database Record Deduplication – Biological Sequences – Machine Translation Il a achete une pomme He bought an apple String Edit Distance • Distance between sequences x and y: – “cost” of lowest-cost sequence of edit operations that transform string x into y. • Applications – Database Record Deduplication – Biological Sequences – Machine Translation – Textual Entailment He bought a new car last night He purchased a brand new automobile yesterday evening Levenshtein Distance Edit operations Align two strings copy insert delete subst Copy a character from x to y Insert a character into y Delete a character from y Substitute one character for another i a m _ W . _ C o h o n copy subst copy copy copy copy delete delete delete copy copy subst insert copy copy copy copy operation cost (cost 0) (cost 1) (cost 1) (cost 1) x1 = William W. Cohon x2 = Willleam Cohen W i l l Lowest cost alignment [1966] W i l l l e a m _ C o h e n 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 Total cost = 6 = Levenshtein Distance Levenshtein Distance Edit operations copy insert delete subst Copy a character from x to y Insert a character into y Delete a character from y Substitute one character for another Dynamic program D(i,j) = score of best alignment from x1... xi to y1... yj. D(i-1,j-1) + d(xi≠yj ) D(i,j) = min D(i-1,j) + 1 D(i,j-1) + 1 W i l l i a m 0 1 2 3 4 5 6 7 W 1 0 1 2 3 4 5 6 i 2 1 0 1 2 3 4 5 l 3 2 1 0 1 2 3 4 l 4 3 2 1 0 1 2 3 l 5 4 3 2 1 1 2 3 e 6 5 4 3 2 2 2 3 (cost 0) (cost 1) (cost 1) (cost 1) a 7 6 5 4 3 3 2 3 m 8 7 6 5 4 4 4 2 insert subst total cost = distance repeated delete is cheaper Levenshtein Distance with Markov Dependencies Edit operations copy insert delete subst c 0 1 1 1 Cost after a Copy a character from x to y Insert a character into y Delete a character from y Substitute one character for another i d s 0 0 0 1 1 1 2 1 12 1 1 1 1 Learn these costs from training data subst copy delete insert W i l l i a m 0 1 2 3 4 5 6 7 W 1 0 1 2 3 4 5 6 i 2 1 0 1 2 3 4 5 l 3 2 1 0 1 2 3 4 l 4 3 2 1 0 1 2 3 l 5 4 3 2 1 1 2 3 e 6 5 4 3 2 2 2 3 a 7 6 5 4 3 3 2 3 m 8 7 6 5 4 4 4 2 3D DP table Ristad & Yianilos (1997) Essentially a Pair-HMM, generating a edit/state/alignment-sequence and two strings string 1 8 W i l l l e a m 8 p(a,x1,x 2 ) p(at | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at ) Match score = p(x1,x 2 ) 8 8 copy 7 subst 6 10 11 12 13 14 15 16 copy insert 5 9 copy 4 8 copy 3 7 copy copy 2 6 delete copy 1 5 delete 4 delete 4 copy 3 copy 2 i a m _ W . _ C o h o n subst 1 copy string 2 W i l l copy alignment x1 a.i1 a.e a.i2 x2 9 10 11 12 13 14 _ C o h e n complete data likelihood t p(a t a:x 1 ,x 2 | at1) p(x1,at .i1 , x 2,a t .i2 | at ) incomplete data likelihood (sum over all alignments consistent with x1 and x2) t Given training set of matching string pairs, objective fn is O p(x(1 j ),x(2 j ) ) j Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely. Ristad & Yianilos Regrets • Limited features of input strings – Examine only single character pair at a time – Difficult to use upcoming string context, lexicons, ... – Example: “Senator John Green” “John Green” • Limited edit operations – Difficult to generate arbitrary jumps in both strings – Example: “UMass” “University of Massachusetts”. • Trained only on positive match data – Doesn’t include information-rich “near misses” – Example: “ACM SIGIR” ≠ “ACM SIGCHI” So, consider model trained by conditional probability Conditional Probability (Sequence) Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): – Can examine features, but not responsible for generating them. – Don’t have to explicitly model their dependencies. CRF String Edit Distance string 1 8 W i l l l e a m joint complete data likelihood 8 8 8 9 10 11 12 13 14 _ C o h e n p(a,x1,x 2 ) p(at | at1 ) p(x1,a t .i1 , x 2,a t .i2 | at ) t conditional complete data likelihood p(a | x1,x 2 ) pairs, Want to train from set of string each labeled one of {match, non-match} match non-match match match non-match copy 7 subst 6 10 11 12 13 14 15 16 copy insert 5 9 copy 4 8 copy 3 7 copy copy 2 6 delete copy 1 5 delete 4 delete 4 copy 3 copy 2 i a m _ W . _ C o h o n subst 1 copy string 2 W i l l copy alignment x1 a.i1 a.e a.i2 x2 1 Z x 1 ,x 2 (a ,a t ,x1,x 2 ) t1 t “William W. Cohon” “Willlleam Cohen” “Bruce D’Ambrosio” “Bruce Croft” “Tommi Jaakkola” “Tommi Jakola” “Stuart Russell” “Stuart Russel” “Tom Deitterich”“Tom Dean” CRF String Edit Distance FSM subst copy delete insert CRF String Edit Distance FSM conditional incomplete data likelihood p(m | x1,x 2 ) Z a S m subst 1 (a ,a t x 1 ,x 2 ,x1,x 2 ) t1 t copy match m=1 delete insert subst copy Start non-match m=0 delete insert CRF String Edit Distance FSM x1 = “Tommi Jaakkola” x2 = “Tommi Jakola” subst copy match m=1 delete insert subst copy Probability summed over all alignments in match states 0.8 Start non-match m=0 delete insert Probability summed over all alignments in non-match states 0.2 CRF String Edit Distance FSM x1 = “Tom Dietterich” x2 = “Tom Dean” subst copy match m=1 delete insert subst copy Probability summed over all alignments in match states 0.1 Start non-match m=0 delete insert Probability summed over all alignments in non-match states 0.9 Parameter Estimation Given training set of string pairs and match/non-match labels, objective fn is the incomplete log likelihood The complete log likelihood O log p(m( j ) | x(1 j ),x(2j ) ) j log p(m ( j) j | a,x(1 j ),x(2 j ) ) p(a | x(1 j ),x(2j ) ) a Expectation Maximization • E-step: Estimate distribution over alignments, p(a | x ( j ),x ( j ) ), using current parameters • M-step: Change parameters to maximize the complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS) 1 2 This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form. Efficient Training • Dynamic programming table is 3D; |x1| = |x2| = 100, |S| = 12, .... 120,000 entries • Use beam search during E-step [Pal, Sutton, McCallum 2005] • Unlike completely observed CRFs, objective function is not convex. • Initialize parameters not at zero, but so as to yield a reasonable initial edit distance. What Alignments are Learned? x1 = “Tommi Jaakkola” x2 = “Tommi Jakola” T o m m i subst copy match m=1 delete insert subst copy Start non-match m=0 delete insert T o m m i J a k o l a J a a k k o l a What Alignments are Learned? x1 = “Bruce Croft” x2 = “Tom Dean” subst copy match m=1 delete insert Start B r u c e subst copy non-match m=0 delete insert T o m D e a n C r o f t What Alignments are Learned? x1 = “Jaime Carbonell” x2 = “Jamie Callan” subst copy match m=1 delete insert Start J a i m e subst copy non-match m=0 delete insert J a m i e C a l l a n C a r b o n e l l Summary of Advantages • Arbitrary features of the input strings – Examine past, future context – Use lexicons, WordNet • Extremely flexible edit operations – Single operation may make arbitrary jumps in both strings, of size determined by input features • Discriminative Training – Maximize ability to predict match vs non-match • Easy to Label Data – Match/Non-Match no need for labeled alignments Experimental Results: Data Sets • Restaurant name, Restaurant address – 864 records, 112 matches – E.g. “Abe’s Bar & Grill, E. Main St” “Abe’s Grill, East Main Street” • People names, UIS DB generator – synthetic noise – E.g. “John Smith” vs “Snith, John” • CiteSeer Citations – In four sections: Reason, Face, Reinforce, Constraint – E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...” “Russell & Norvig, “Artificial Intelligence: An Intro...” Experimental Results: Features • • • • • • • same, different same-alphabetic, different alphbetic same-numeric, different-numeric punctuation1, punctuation2 alphabet-mismatch, numeric-mismatch end-of-1, end-of-2 same-next-character, different-next-character Experimental Results: Edit Operations • • • • • • • insert, delete, substitute/copy swap-two-characters skip-word-if-in-lexicon skip-parenthesized-words skip-any-word substitute-word-pairs-in-translation-lexicon skip-word-if-present-in-other-string Experimental Results [Bilenko & Mooney 2003] F1 (average of precision and recall) Distance metric Restaurant name Restaurant address CiteSeer Reason Face Reinf Constraint Levenshtein Learned Leven. Vector Learned Vector 0.290 0.354 0.365 0.433 0.686 0.712 0.380 0.532 0.927 0.938 0.897 0.924 0.924 0.941 0.923 0.913 0.952 0.966 0.922 0.875 0.893 0.907 0.903 0.808 Experimental Results [Bilenko & Mooney 2003] F1 (average of precision and recall) Distance metric Restaurant name Restaurant address CiteSeer Reason Face Reinf Constraint Levenshtein Learned Leven. Vector Learned Vector 0.290 0.354 0.365 0.433 0.686 0.712 0.380 0.532 0.927 0.938 0.897 0.924 0.952 0.966 0.922 0.875 0.893 0.907 0.903 0.808 0.924 0.941 0.923 0.913 CRF Edit Distance 0.448 0.783 0.964 0.918 0.917 0.976 Experimental Results Data set: person names, with word-order noise added F1 Without skip-if-present-in-other-string With skip-if-present-in-other-string 0.856 0.981 Related Work • Learned Edit Distance – [Bilenko & Mooney 2003], [Cohen et al 2003],... – [Joachims 2003]: Max-margin, trained on alignments • Conditionally-trained models with latent variables – [Jebara 1999]: “Conditional Expectation Maximization” – [Quattoni, Collins, Darrell 2005]: CRF for visual object recognition, with latent classes for object sub-patches – [Zettlemoyer & Collins 2005]: CRF for mapping sentences to logical form, with latent parses. Outline • Examples of IE and Data Mining • IE with Hidden Markov Models • Introduction to Conditional Random Fields (CRFs) • Examples of IE with CRFs • Sequence Alignment with CRFs • Semi-supervised Learning Semi-Supervised Learning How to train with limited labeled data? Augment with lots of unlabeled data “Expectation Regularization” [Mann, McCallum, ICML 2007] Supervised Learning Decision boundary Creation of labeled instances requires extensive human effort What if limited labeled data? Small amount of labeled data Semi-Supervised Learning: Labeled & Unlabeled data Small amount of labeled data Large amount of unlabeled data Augment limited labeled data by using unlabeled data More Semi-Supervised Algorithms than Applications 30 25 20 # papers Algorithms Applications 15 10 5 0 1998 2000 2002 2004 2006 Compiled from [Zhu, 2007] Weakness of Many Semi-Supervised Algorithms Difficult to Implement Significantly more complicated than supervised counterparts Fragile Meta-parameters hard to tune Lacking in Scalability O(n2) or O(n3) on unlabeled data “EM will generally degrade [tagging] accuracy, except when only a limited amount of hand-tagged text is available.” [Merialdo, 1994] “When the percentage of labeled data increases from 50% to 75%, the performance of [Label Propagation with Jensen-Shannon divergence] and SVM become almost same, while [Label propagation with cosine distance] performs significantly worse than SVM.” [Niu,Ji,Tan, 2005] Families of Semi-Supervised Learning 1. 2. 3. 4. Expectation Maximization Graph-Based Methods Auxiliary Functions Decision Boundaries in Sparse Regions Family 1 : Expectation Maximization [Dempster, Laird, Rubin, 1977] Fragile -- often worse than supervised Family 2: Graph-Based Methods [Szummer, Jaakkola, 2002] [Zhu, Ghahramani, 2002] Lacking in scalability, Sensitive to choice of metric Family 3: Auxiliary-Task Methods [Ando and Zhang, 2005] Complicated to find appropriate auxiliary tasks Family 4: Decision Boundary in Sparse Region Family 4: Decision Boundary in Sparse Region Transductive SVMs [Joachims, 1999]: Sparsity measured by margin Entropy Regularization [Grandvalet and Bengio, 2005] …by label entropy Minimal Entropy Solution! How do we know the minimal entropy solution is wrong? We suspect at least some of the data is in the second class! 0.8 0.7 0.6 0.5 0.4 0.3 In fact we often have prior knowledge of the relative class proportions 0.2 0.1 0 Class Size 0.8 : Student 0.2 : Professor How do we know the minimal entropy solution is wrong? We suspect at least some of the data is in the second class! In fact we often have prior knowledge of the relative class proportions 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size 0.1 : Gene Mention 0.9 : Background How do we know the minimal entropy solution is wrong? We suspect at least some of the data is in the second class! 0.6 0.5 0.4 0.3 In fact we often have prior knowledge of the relative class proportions 0.2 0.1 0 Class Size 0.6 : Person 0.4 : Organization Families of Semi-Supervised Learning 1. 2. 3. 4. 5. Expectation Maximization Graph-Based Methods Auxiliary Functions Decision Boundaries in Sparse Regions Expectation Regularization 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size Family 5: Expectation Regularization Low density region 0.8 Favor decision boundaries that match the prior 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size Family 5: Expectation Regularization 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size Label Regularization Expectation Regularization special case general case p(y) p(y|feature) Expectation Regularization Simple: Easy to implement Robust: Meta-parameters need little or no tuning Scalable: Linear in number of unlabeled examples Discriminative Models • Predict class boundaries directly • Do not directly estimate class densities • Make few assumptions (e.g. independence) on features • Are trained by optimizing conditional loglikelihood Logistic Regression Constraints Expectations Expectation Regularization (XR) Log-likelihood KL-Divergence between a prior distribution and an expected distribution over the unlabeled data XR Prior distribution (provided from supervised training or estimated on the labeled data) Model’s expected distribution on the unlabeled data After Training, Model Matches Prior Distribution Supervised only Supervised + XR Gradient for Logistic Regression When the gradient is 0 XR Results for Classification Secondary Structure Prediction Accuracy # Labeled Examples 2 100 1000 SVM (supervised) 55.41% 66.29% Cluster Kernel SVM 57.05% 65.97% QC Smartsub 57.68% 59.16% Naïve Bayes (supervised) 52.42% 57.12% 64.47% Naïve Bayes EM 50.79% 57.34% 57.60% Logistic Regression (supervised) 52.42% 56.74% 65.43% Logistic Regression + Ent. Reg. 48.56% 54.45% 58.28% Logistic Regression + XR 57.08% 58.51% 65.44% XR Results for Classification: Sliding Window Model CoNLL03 Named Entity Recognition Shared Task XR Results for Classification: Sliding Window Model 2 BioCreativeII 2007 Gene/Gene Product Extraction XR Results for Classification: Sliding Window Model 3 Wall Street Journal Part-of-Speech Tagging XR Results for Classification: SRAA Simulated/Real Auto/Aviation Text Classification Noise in Prior Knowledge What happens when users’ estimates of the class proportions is in error? Noisy Prior Distribution CoNLL03 Named Entity Recognition Shared Task 20% change in probability of majority class Conclusion Expectation Regularization is an effective, robust method of semi-supervised training which can be applied to discriminative models, such as logistic regression Ongoing and Future Work Applying Expectation Regularization to other discriminative models, e.g. conditional random fields Experimenting with priors other than class label priors End of Part 1 Joint Inference in Information Extraction & Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann. From Text to Actionable Knowledge Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support Knowledge Discovery IE Segment Classify Associate Cluster Problem: Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Combined in serial juxtaposition, IE and DM are unaware of each others’ weaknesses and opportunities. 1) DM begins from a populated DB, unaware of where the data came from, or its inherent errors and uncertainties. 2) IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach. Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Probabilistic Model Discover patterns - entity types - links / relations - events Discriminatively-trained undirected graphical models Document collection Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Complex Inference and Learning Just what we researchers like to sink our teeth into! Actionable knowledge Prediction Outlier detection Decision support Scientific Questions • What model structures will capture salient dependencies? • Will joint inference actually improve accuracy? • How to do inference in these large graphical models? • How to do parameter estimation efficiently in these models, which are built from multiple large components? • How to do structure discovery in these models? Outline • The need for joint inference • Examples of joint inference – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution – Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) – Joint Relation Extraction and Data Mining (ICM) – Probability + First-order Logic, Co-ref on Entities (MCMC) Cascaded Predictions Named-entity tag Part-of-speech Segmentation (output prediction) Chinese character (input observation) Cascaded Predictions Named-entity tag Part-of-speech (output prediction) Segmentation (input observation) Chinese character (input observation) Cascaded Predictions Named-entity tag (output prediction) Part-of-speech (input obseration) Segmentation (input observation) Chinese character (input observation) But errors cascade--must be perfect at every stage to do well. Joint Prediction Cross-Product over Labels O(|V| x 14852) parameters O(|o| x 14852) running time 3 x 45 x 11 = 1485 possible states e.g.: state label = (Wordbeg, Noun, Person) Segmentation+POS+NE (output prediction) Chinese character (input observation) Joint Prediction Factorial CRF O(|V| x 2785) parameters Named-entity tag (output prediction) Part-of-speech (output prediction) Segmentation (output prediction) Chinese character (input observation) Linear-Chain to Factorial CRFs Model Definition Linear-chain p(y | x) Factorial p(y | x) T 1 y (y t , y t1 )xy (x t , y t ) Z(x) t1 1 Z(x) T (u ,u u t t1 ) v (v t ,v t1 ) w (w t ,w t1 ) t1 uv (ut ,v t )vw (v t ,w t )wx (w t , x t ) where () exp k f k () k y ... x ... u ... v ... w ... x ... Dynamic CRFs Undirected conditionally-trained analogue to Dynamic Bayes Nets (DBNs) Factorial Higher-Order Hierarchical Training CRFs Maximize log - likelihood of parameters given training data L({k } |{ o,s (i) : }) Log - likelihood gradient : L 2 Ck (s (i),o (i) ) P{ k } (s | o (i) ) Ck (s,o (i) ) k k i i s Ck (s,o ) f k (o,t,st1,st ) t Feature count using correct labels - Feature count using predicted labels - Smoothing penalty Training DCRFs Maximize log - likelihood of parameters given training data L({k } |{ o,s (i) : }) Log - likelihood gradient : L 2 Ck (s (i),o (i) ) P{ k } (s | o (i) ) Ck (s,o (i) ) k k i i s Ck (s,o ) f (o,t,c) k t c Cliques Feature count using correct labels - Feature count using predicted labels Same form as general CRFs - Smoothing penalty Experiments Simultaneous noun-phrase & part-of-speech tagging B I I B I I O O O N N N O N N V O V Rockwell International Corp. 's Tulsa unit said it signed B I I O B I O B I O J N V O N O N N a tentative agreement extending its contract with Boeing Co. • Data from CoNLL Shared Task 2000 (Newswire) – 8936 training instances – 45 POS tags, 3 NP tags – Features: word identity, capitalization, regex’s, lexicons… Experiments Simultaneous noun-phrase & part-of-speech tagging B I I B I I O O O N N N O N N V O V Rockwell International Corp. 's Tulsa unit said it signed B I I O B I O B I O J N V O N O N N a tentative agreement extending its contract with Boeing Co. Two experiments • Compare exact and approximate inference • Compare Noun Phrase Segmentation F1 of – Cascaded CRF+CRF – Cascaded Brill+CRF – Joint Factorial DCRFs Comparison of Cascaded CRF+CRF Brill+CRF POS acc 98.28 N/A & Joint FCRF 98.92 Joint acc 95.56 N/A 96.48 NP F1 93.10 93.33 93.87 CRF+CRF and FCRF trained on 8936 CoNLL sentences Brill tagger trained on 30,000+ sentences, including CoNLL test set! 20% error Accuracy by Training Set Size Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Outline • The need for joint inference • Examples of joint inference – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution – Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) – Joint Relation Extraction and Data Mining (ICM) – Probability + First-order Logic, Co-ref on Entities (MCMC) Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … . Green ran for … Dependency among similar, distant mentions ignored. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … . Green ran for … 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterized BP [Wainwright et al, 2002] See also [Finkel, et al, 2005] Outline • The need for joint inference • Examples of joint inference – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution – Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) – Joint Relation Extraction and Data Mining (ICM) – Probability + First-order Logic, Co-ref on Entities (MCMC) Joint co-reference among all pairs Affinity Matrix CRF “Entity resolution” “Object correspondence” . . . Mr Powell . . . 45 . . . Powell . . . Y/N 99 Y/N Y/N 11 ~25% reduction in error on co-reference of proper nouns in newswire. . . . she . . . Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] Coreference Resolution AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty" Output Input News article, with named-entity "mentions" tagged Number of entities, N = 3 Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . he . . . . . . . . . . . . . . . . . . . Condoleezza Rice . . . . . . . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . . . . . President Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . . ........... . . . . . . . . . . . . . . . . #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice ......................... #3 President Bush Bush Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) . . . Mr Powell . . . N Y Y Y Y N Y Y N Y N N Y Y Mention (4) Y/N? . . . Powell . . . Two words in common One word in common "Normalized" mentions are string identical Capitalized word in common > 50% character tri-gram overlap < 25% character tri-gram overlap In same sentence Within two sentences Further than 3 sentences apart "Hobbs Distance" < 3 Number of entities in between two mentions = 0 Number of entities in between two mentions > 4 Font matches Default OVERALL SCORE = 29 13 39 17 19 -34 9 8 -1 11 12 -3 1 -19 98 > threshold=0 The Problem . . . Mr Powell . . . affinity = 98 Y affinity = 104 Pair-wise merging decisions are being made independently from each other . . . Powell . . . N Y affinity = 11 . . . she . . . Affinity measures are noisy and imperfect. They should be made in relational dependence with each other. A Generative Model Solution [Russell 2001], [Pasula et al 2002] (Applied to citation matching, and object correspondence in vision) N id Issues: context words id surname distance fonts . . . gender age . . . 1) Generative model makes it difficult to use complex features. 2) Number of entities is hard-coded into the model structure, but we are supposed to predict num entities! Thus we must modify model structure during inference---MCMC. A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . 45 . . . Powell . . . Y/N 30 Y/N Y/N Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 11 . . . she . . . 1 P(y | x ) exp l f l (x i , x j , y ij ) ' f '(y ij , y jk , y ik ) Zx i, j l i, j,k A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . 45 . . . Powell . . . Y/N 30 Y/N Y/N 11 Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. . . . she . . . 1 P(y | x ) exp l f l (x i , x j , y ij ) ' f '(y ij , y jk , y ik ) Zx i, j l i, j,k A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . 45) . . . Powell . . . Y 30) N Y (11) . . . she . . . infinity 1 P(y | x ) exp l f l (x i , x j , y ij ) ' f '(y ij , y jk , y ik ) Zx i, j l i, j,k A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . 45) . . . Powell . . . Y 30) N N (11) . . . she . . . 64 1 P(y | x ) exp l f l (x i , x j , y ij ) ' f '(y ij , y jk , y ik ) Zx i, j l i, j,k Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . 106 30 134 11 . . . Condoleezza Rice . . . . . . she . . . 10 log P(y | x ) l f l (x i , x j , y ij ) i, j l w i, j w/in paritions ij w i, j across paritions ij Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . 106 30 134 11 . . . Condoleezza Rice . . . . . . she . . . 10 log P(y | x ) l f l (x i , x j , y ij ) i, j l w i, j w/in paritions ij w i, j across paritions ij = 22 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . 106 30 134 11 . . . Condoleezza Rice . . . . . . she . . . 10 log P(y | x ) l f l (x i , x j , y ij ) i, j l w i, j w/in paritions ij w' i, j across paritions ij = 314 Co-reference Experimental Results [McCallum & Wellner, 2003] Proper noun co-reference DARPA ACE broadcast news transcripts, 117 stories Single-link threshold Best prev match [Morton] MRFs Partition F1 16 % 83 % 88 % error=30% Pair F1 18 % 89 % 92 % error=28% DARPA MUC-6 newswire article corpus, 30 stories Single-link threshold Best prev match [Morton] MRFs Partition F1 11% 70 % 74 % error=13% Pair F1 7% 76 % 80 % error=17% Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell Y/N Stuart Russell Y/N Y/N S. Russel Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell Organizations University of California at Berkeley Y/N Y/N Stuart Russell Y/N Y/N S. Russel Berkeley Y/N Y/N Berkeley Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell Organizations University of California at Berkeley Y/N Y/N Stuart Russell Y/N Y/N S. Russel Berkeley Y/N Y/N Reduces error by 22% Berkeley Outline • The need for joint inference • Examples of joint inference – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution – Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) – Joint Relation Extraction and Data Mining (ICM) – Probability + First-order Logic, Co-ref on Entities (MCMC) Joint segmentation and co-reference Extraction from and matching of research paper citations. o s Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990. World Knowledge c y Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. p Co-reference decisions y Database field values c s c y Citation attributes s o Segmentation o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Sparse Generalized Belief Propagation [Pal, Sutton, McCallum, 2005] [Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003] Joint segmentation and co-reference Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE Cowell, Dawid… Probab… Montemerlo, Thrun…FastSLAM… Kjaerulff Approxi… QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. VENUE Springer AAAI… Technic… Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . 1) Segment citation fields Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . 1) Segment citation fields 2) Resolve coreferent citations Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR = Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Resolving conflicts Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR = Perform Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record jointly. IE + Coreference Model AUT AUT YR TITL TITL CRF Segmentation s Observed citation x J Besag 1986 On the… IE + Coreference Model AUTHOR = “J Besag” YEAR = “1986” TITLE = “On the…” Citation mention attributes c CRF Segmentation s Observed citation x J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… Structure for each citation mention c s x Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… Binary coreference variables for each pair of mentions c s x Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… Binary coreference variables for each pair of mentions y n n c s x Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… AUTHOR = “P Smyth” YEAR = “2001” TITLE = “Data Mining…” ... Research paper entity attribute nodes y n n c s x Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth Research paper entity attribute node , P Data mining… y y y c s x Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… y n n c s x Smyth . 2001 Data Mining… J Besag 1986 On the… Such a highly connected graph makes exact inference intractable, so… Approximate Inference 1 • Loopy Belief Propagation m1(v2) v1 m2(v3) v2 m2(v1) v4 v3 m3(v2) v5 messages passed between nodes v6 Approximate Inference 1 • Loopy Belief Propagation • Generalized Belief Propagation m1(v2) v1 m2(v3) v2 m2(v1) v3 m3(v2) messages passed between nodes v4 v5 v6 v1 v2 v3 v4 v5 v6 v7 v8 v9 messages passed between regions Here, a message is a conditional probability table passed among nodes. But when message size grows exponentially with region size! Approximate Inference 2 • Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v6i+1 = argmax v6i P(v6i | v\ v6i) v5 v6 Approximate Inference 2 • Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v5j+1 = argmax v5j P(v5j | v\ v5j) v5 v6 Approximate Inference 2 • Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v4k+1 = argmax v4k P(v4k | v\ v5 v6 v4k) But greedy, and easily falls into local minima. Approximate Inference 2 • Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v4k+1 = argmax P(v4k | v\ v5 v6 v4k) v4k • “Iterated Conditional Sampling” or “Sparse Belief Propagation” Instead of passing only argmax, sample of argmaxes of P(v4k | v \ v4k) e.g. an N-best list (the top N values) v1 v4 v2 v5 v3 v6 Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once. Here, a “message” grows only linearly with region size and N! Inference by Sparse “Generalized BP” Smyth , P Data mining… [Pal, Sutton, McCallum 2005] Exact inference on these linear-chain regions From each chain pass an N-best List into coreference Smyth . 2001 Data Mining… J Besag 1986 On the… Inference by Sparse “Generalized BP” Smyth , P Data mining… [Pal, Sutton, McCallum 2005] Approximate inference by graph partitioning… Make scale to 1M citations with Canopies …integrating out uncertainty in samples of extraction Smyth . 2001 Data Mining… [McCallum, Nigam, Ungar 2000] J Besag 1986 On the… Inference: Sample = N-best List from CRF Segmentation Name Title Book Title Year Laurel, B. Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B. Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Agents: Metaphors with Character The Art of Human Computer Interface Design Laurel, B. Interface When calculating similarity with another citation, have more opportunity to find correct, matching fields. Name Title … Laurel, B Interface Agents: Metaphors with Character The … Laurel, B. Interface Agents: Metaphors with Character … Laurel, B. Interface Agents Metaphors with Character … 1990 y?n Inference by Sparse “Generalized BP” Smyth , P Data mining… [Pal, Sutton, McCallum 2005] Exact (exhaustive) inference over entity attributes y n n Smyth . 2001 Data Mining… J Besag 1986 On the… Inference by Sparse “Generalized BP” Smyth , P Data mining… [Pal, Sutton, McCallum 2005] Revisit exact inference on IE linear chain, now conditioned on entity attributes y n n Smyth . 2001 Data Mining… J Besag 1986 On the… Parameter Estimation: Piecewise Training [Sutton & McCallum 2005] Divide-and-conquer parameter estimation IE Linear-chain Exact MAP Coref graph edge weights MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood y n n In all cases: Climb MAP gradient with quasi-Newton method Results on 4 Sections of CiteSeer Citations Coreference F1 performance N Reinforce Face Reason Constraint 1 0.946 0.967 0.945 0.961 3 0.95 0.979 0.961 0.960 7 0.948 0.979 0.951 0.971 9 0.982 0.967 0.960 0.971 Optimal 0.995 0.992 0.994 0.988 Average error reduction is 35%. “Optimal” makes best use of N-best list by using true labels. Indicates that even more improvement can be obtained Joint segmentation and co-reference [Wellner, McCallum, Peng, Hay, UAI 2004] o Extraction from and matching of research paper citations. s Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990. World Knowledge c y Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. p Co-reference decisions y Database field values c s c y s o Citation attributes Segmentation o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Sparse Belief Propagation [Pal, Sutton, McCallum, 2005] Outline • The need for joint inference • Examples of joint inference – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution – Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) – Joint Relation Extraction and Data Mining (ICM) – Probability + First-order Logic, Co-ref on Entities (MCMC) Data • 270 Wikipedia articles • 1000 paragraphs • 4700 relations • 52 relation types – JobTitle, BirthDay, Friend, Sister, Husband, Employer, Cousin, Competition, Education, … • Targeted for density of relations – Bush/Kennedy/Manning/Coppola families and friends George W. Bush …his father George H. W. Bush… …his cousin John Prescott Ellis… George H. W. Bush …his sister Nancy Ellis Bush… Nancy Ellis Bush …her son John Prescott Ellis… Cousin = Father’s Sister’s Son sibling George HW Bush Nancy Ellis Bush son son cousin George X W Bush John Prescott Ellis Y likely a cousin John Kerry …celebrated with Stuart Forbes… Name Son Rosemary Forbes John Kerry James Forbes Stuart Forbes Name Sibling Rosemary Forbes James Forbes Rosemary Forbes sibling son James Forbes son cousin John Kerry Stuart Forbes Iterative DB Construction Joseph P. Kennedy, Sr … son John F. Kennedy with Rose Fitzgerald Son Wife Name Son Joseph P. Kennedy John F. Kennedy Rose Fitzgerald John F. Kennedy Ronald Reagan George W. Bush Use relational Fill DB with features with “first-pass” CRFCRF “second-pass” (0.3) Results ME CRF RCRF RCRF .9 RCRF .5 RCRF Truth RCRF Truth.5 F1 .5489 .5995 .6100 .6008 .6136 .6791 .6363 Prec .6475 .7019 .6799 .7177 .7095 .7553 .7343 Recall .4763 .5232 .5531 .5166 .5406 .6169 .5614 ME = maximum entropy CRF = conditional random field RCRF = CRF + mined features Examples of Discovered Relational Features • • • • • • • Mother: FatherWife Cousin: MotherHusbandNephew Friend: EducationStudent Education: FatherEducation Boss: BossSon MemberOf: GrandfatherMemberOf Competition: PoliticalPartyMemberCompetition Outline • The need for joint inference • Examples of joint inference – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution – Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) – Joint Relation Extraction and Data Mining (ICM) – Probability + First-order Logic, Co-ref on Entities (MCMC) Sometimes graph partitioning with pairwise comparisons is not enough. • Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them. • Having 2 “given names” is common, but not 4. – e.g. Howard M. Dean / Martin, Dean / Howard Martin • Need to measure size of the clusters of mentions. • a pair of lastname strings that differ > 5? We need measures on hypothesized “entities” We need First-order logic Pairwise Co-reference Features Howard Dean SamePerson(Dean Martin, Howard Dean)? SamePers Martin)? on(Howard Dean, Howard Pairwise Features StringMatch(x1,x2) EditDistance(x1,x2) Dean Martin Howard Martin SamePerson(Dean Martin, Howard Martin)? Toward High-Order Representations Identity Uncertainty Howard Dean First-Order Features • maximum edit distance between any pair is < 0.5 • number of distinct names < 3 • All have same gender • There exists a number mismatch • Only pronouns SamePerson(Howard Dean, Howard Martin, Dean Martin)? Dean Martin Howard Martin Weighted Logic • This model brings together the two main (long-separated) branches of Artificial Intelligence: Logic Probability Toward High-Order Representations . . . Identity Uncertainty Combinatorial Explosion! SamePerson(x1,x2 ,x3,x4 ,x5 ,x6) … SamePerson(x1,x2 ,x3,x4 ,x5) … SamePerson(x1,x2 ,x3,x4) … SamePerson(x1,x2 ,x3) … SamePerson(x1,x2) … Dean Martin Howard Dean Howard Martin Dino Howie . . . Martin … This space complexity is common in first-order probabilistic models Markov Logic as a Template to Construct a Markov Network using First-Order Logic [Richardson & Domingos 2005] [Paskin & Russell 2002] ground Markov network grounding Markov network requires space O(nr) n = number constants r = highest clause arity How can we perform inference and learning in models that cannot be grounded? Inference in First-Order Models SAT Solvers • Weighted SAT solvers [Kautz et al 1997] – Requires complete grounding of network • LazySAT [Singla & Domingos 2006] – Saves memory by only storing clauses that may become unsatisfied – Initialization still requires time O(nr) to visit all ground clauses Inference in First-Order Models MCMC • Gibbs Sampling – Difficult to move between high probability configurations by changing single variables • Although, consider MC-SAT! [Poon & Domingos ‘06] • An alternative: Metropolis-Hastings sampling [Culotta & McCallum 2006] – Can be extended to partial configurations • Only instantiate relevant variables – Key advantage: can design arbitrary “smart” jumps – Successfully used in BLOG models [Milch et al 2005] – 2 parts: proposal distribution, acceptance distribution. Model “First-order features” QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Howard Dean Governor Howie fw: SamePerson(x) Dean Martin Dino Howard Martin Howie Martin Model Howard Martin Howie Martin QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Howard Dean Governor Howie Dean Martin Dino Model ZX: Sum over all possible configurations! Inference with Metropolis-Hastings • p(y’)/p(y) : likelihood ratio – Ratio of P(Y|X) – ZX cancels! • q(y’|y) : proposal distribution – probability of proposing move y y’ • What is nice about this? – Can design arbitrary “smart” proposal distributions Proposal Distribution Dean Martin Howie Martin Howard Martin Dino y Dean Martin Dino Howard Martin Howie Martin y’ Proposal Distribution Dean Martin Howie Martin Howard Martin Dino Dean Martin Howie Martin Howard Martin Howie Martin y y’ Proposal Distribution Dean Martin Howie Martin Howard Martin Howie Martin Dean Martin Howie Martin Howard Martin Dino y y’ Feature List • Exact Match/Mis-Match – – – – – – – Entity type Gender (requires lexicon) Number Case Entity Text Entity Head Entity Modifier/Numerical Modifier – Sentence – WordNet: hypernym,synonym,antonym • Other – Relative pronoun agreement – Sentence distance in bins – Partial text overlaps • Quantification – Existential a gender mismatch three different first names – Universal NER type match named mentions str identical • Filters (limit quantifiers to mention type) – – – – None Pronoun Nominal (description) Proper (name) Learning the Likelihood Ratio Can’t normalize over all possible y’s. ...temporarily consider the following... ad hoc training: Maximize p(b|y,x), where b in {TRUE, FALSE} Error Driven Training Motivation Where to get training examples? • Generate all possible partial clusters intractable – Sample uniformly? – Sample from clusters visited during inference? • Error driven – Focus learning on examples that need it most Error-Driven Training Results B-cubed F1 Non-Error-driven 69 Error-driven 72 Learning the Likelihood Ratio Given a pair of configurations, learn to rank the “better” configuration higher. Rank-Based Training • Instead of training [Powell, Mr. Powell, he] --> TRUE [Powell, Mr. Powell, she] --> FALSE • ...Rather... [Powell, Mr. Powell, he] > [Powell, Mr. Powell, she] [Powell, Mr. Powell, he] > [Powell, Mr. Powell] [Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she] Rank-Based Training Results B-cubed F1 Non-Ranked-Based (Error-driven) 72 Rank-Based (Error-driven) 79 Previous best in literature 68 Experimental Results • ACE 2005 newswire coreference • All entity types – Proper-, common-, pro- nouns • 443 documents B-cubed Previous best results, 1997: Previous best results, 2002: Previous best results, 2005: Our new results [Culotta, Wick, Hall, McCallum, NAACL/HLT 2007] 65 67 68 79 Learning the Proposal Distribution by Tying Parameters • Proposal distribution q(y’|y) “cheap” approximation to p(y) • Reuse subset of parameters in p(y) • E.g. in identity uncertainty model – Sample two clusters – Stochastic agglomerative clustering to propose new configuration Scalability • Currently running on >2 million author name mentions. • Canopies • Schedule of different proposal distributions Weighted Logic Summary • Bring together – Logic – Probability • Inference and Learning incredibly difficult. • Our recommendation: – MCMC for inference – Error-driven, rank-based for training Outline • The need for joint inference • Examples of joint inference – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution – Joint Segmentation and Co-ref (Graph Partitioning) (Sparse BP) – Joint Relation Extraction and Data Mining (ICM) – Probability + First-order Logic, Co-ref on Entities (MCMC) End of Part 2 Sparse Belief Propagation “Beam Search” in arbitrary graphical models Different pruning strategy based on variational inference. 1. Compute messages xs mst xt mvt xv 2. Form marginal b(xt) b’(xt) 3. Retain 1-% of mass “Sparse Belief Propagation” used during training [Pal, Sutton, McCallum, ICAASP 2006] “Accuracy” Like beam search, but modifications from Variational Methods of Inference. Traditional beam Training Time Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Joint inference among detailed steps Leveraging Text in Social Network Analysis Actionable knowledge Prediction Outlier detection Decision support Outline Social Network Analysis with Topic Models • Role Discovery (Author-Recipient-Topic Model, ART) • Group Discovery (Group-Topic Model, GT) • Enhanced Topic Models – Correlations among Topics (Pachinko Allocation, PAM) – Time Localized Topics (Topics-over-Time Model, TOT) – Markov Dependencies in Topics (Topical N-Grams Model, TNG) • Bibliometric Impact Measures enabled by Topics Multi-Conditional Mixtures