Download Automatically Building Special Purpose Search Engines with

Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst Joint work with David Jensen Knowledge Discovery and Dissemination (KDD) Conference September 2004 QuickTime™ and a TIFF (Uncomp resse d) de com press or are nee ded to s ee this picture. Intelligence Technology Innovation Center ITIC QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see this pic ture. Goal: Improve the state-of-the-art in our ability to mine actionable knowledge from unstructured text. Extracting Job Openings from the Web foodscience.com-Job2 Employer: foodscience.com JobTitle: Ice Cream Guru JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see this pic ture. Data Mining the Extracted Job Information Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see this pic ture. IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries Traditional Pipeline Spider Filter Knowledge Discovery IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support Knowledge Discovery IE Segment Classify Associate Cluster Problem: Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Combined in serial juxtaposition, IE and KD are unaware of each others’ weaknesses and opportunities. 1) KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. 2) IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach. Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support Research & Approach: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Probabilistic Model Discover patterns - entity types - links / relations - events Conditionally-trained undirected graphical models Document collection Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…], … Complex Inference and Learning Just what we researchers like to sink our teeth into! Actionable knowledge Prediction Outlier detection Decision support Accomplishments, Discoveries & Results: • Extracting answers, and also uncertainty/confidence. – Formally justified as marginalization in graphical models – Applications to new word discovery in Chinese word segmentation, and correction propagation in interactive IE • Joint inference, with efficient methods – Multiple, cascaded label sequences (Factorial CRFs) – Multiple distant, but related mentions (Skip-chain CRFs) – Multiple co-reference decisions (Affinity Matrix CRF) – Integrating extraction with co-reference (Graphs & chains) • Put it into a large-scale, working system – Social network analysis from Email and the Web – A new portal: research, people, connections. Accomplishments, Discoveries & Results: • Extracting answers, and also uncertainty/confidence. – Formally justified as marginalization in graphical models – Applications to new word discovery in Chinese word segmentation, and correction propagation in interactive IE • Joint inference, with efficient methods – Multiple, cascaded label sequences (Factorial CRFs) – Multiple distant, but related mentions (Skip-chain CRFs) – Multiple co-reference decisions (Affinity Matrix CRF) – Integrating extraction with co-reference (Graphs & chains) • Put it into a large-scale, working system – Social network analysis from Email and the Web – A new portal: research, people, connections. Types of Uncertainty in Knowledge Discovery from Text • Confidence that extractor correctly obtained statements the author intended. • Confidence that what was written is truthful – Author could have had misconceptions. – …or have been purposefully trying to mislead. • Confidence that the emerging, discovered pattern is a reliable fact or generalization. 1. Labeling Sequence Data Linear-chain CRFs [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of outputs given inputs Finite state model Graphical model OTHER y t-1 PERSON yt PERSON y t+1 ORG y t+2 TITLE … y t+3 output seq FSM states ... observations x x t -1 said t Arden 1 T p(y | x)  y (y t , y t1 )xy (x t , y t )  Z(x) t1 x t +1 Bement where x t +2 NSF x t +3 Director … input seq   ()  exp k f k ()  k  Segmenting tables in textual gov’t reports, 85% reduction in error over HMMs. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Asian word segmentation [COLING’04], [ACL’04] IE from  Research papers [HTL’04] Object classification in images [CVPR ‘04] Confidence Estimation in Linear-chain CRFs [Culotta, McCallum 2004] Finite State Lattice y t-1 yt y t+1 y t+2 output sequence y t+3 ORG OTHER ... PERSON Lattice of FSM states TITLE observations x t -1 said x t Arden x t +1 Bement x t +2 NSF x t +3 Director … 1 T p(y | x)   y (y t , y t1)xy (x t , y t ) Z(x) t1 input sequence Confidence Estimation in Linear-chain CRFs [Culotta, McCallum 2004] Constrained Forward-Backward y t-1 yt y t+1 y t+2 output sequence y t+3 ORG OTHER ... PERSON Lattice of FSM states TITLE observations x t -1 said x t Arden x t +1 Bement x t +2 NSF x t +3 Director … T input sequence 1 p(Arden Bement  PERSON | x)   y (yt , y t1)xy (x t , y t ) Z(x) y C t1 Forward-Backward Confidence Estimation improves accuracy/coverage our forward-backward confidence traditional token-wise confidence no use of confidence Confidence Estimation Applied • New word discovery in Chinese word segmentation [Peng, Fangfang, McCallum COLING 2004] – Improves segmentation accuracy by ~25% • Highlighting fields for Interactive Information Extraction [Kristiansen, Culotta, Viola, McCallum AAAI 2004] Honorable Mention Award – After fixing least confident field, constrained Viterbi automatically reduces error by another 23%. Accomplishments, Discoveries & Results: • Extracting answers, and also uncertainty/confidence. – Formally justified as marginalization in graphical models – Applications to new word discovery in Chinese word segmentation, and correction propagation in interactive IE • Joint inference, with efficient methods – – – – Multiple, cascaded label sequences (Factorial CRFs) Multiple distant, but related mentions (Skip-chain CRFs) Multiple co-reference decisions (Affinity Matrix CRF) Integrating extraction with co-reference (Graphs & chains) • Put it into a large-scale, working system – Social network analysis from Email and the Web – A new portal: research, people, connections. 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words But errors cascade--must be perfect at every stage to do well. 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Tree reparameterization BP [Wainwright et al, 2002] 2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … . Green ran for … Dependency among similar, distant mentions ignored. 2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … . Green ran 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterization BP [Wainwright et al, 2002] for … 3. Joint co-reference among all pairs Affinity Matrix CRF “Entity resolution” “Object correspondence” . . . Mr Powell . . . 45 . . . Powell . . . Y/N 99 Y/N Y/N 11 25% reduction in error on co-reference of proper nouns in newswire. . . . she . . . Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] 4. Joint segmentation and co-reference Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE Cowell, Dawid… Probab… Montemerlo, Thrun…FastSLAM… Kjaerulff Approxi… QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. VENUE Springer AAAI… Technic… Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . 1) Segment citation fields Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . 1) Segment citation fields 2) Resolve coreferent citations Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR = Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Resolving conflicts Citation Segmentation and Coreference Laurel, B. Y ? N Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR = Perform Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record jointly. IE + Coreference Model AUT AUT YR TITL TITL CRF Segmentation s Observed citation x J Besag 1986 On the… IE + Coreference Model AUTHOR = “J Besag” YEAR = “1986” TITLE = “On the…” Citation mention attributes c CRF Segmentation s Observed citation x J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… Structure for each citation mention c s x Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… Binary coreference variables for each pair of mentions c s x Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… Binary coreference variables for each pair of mentions y n n c s x Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… AUTHOR = “P Smyth” YEAR = “2001” TITLE = “Data Mining…” ... Research paper entity attribute nodes y n n c s x Smyth . 2001 Data Mining… J Besag 1986 On the… Such a highly connected graph makes exact inference intractable, so… Approximate Inference 1 • Loopy Belief Propagation m1(v2) v1 m2(v3) v2 m2(v1) v4 v3 m3(v2) v5 messages passed between nodes v6 Approximate Inference 1 • Loopy Belief Propagation • Generalized Belief Propagation m1(v2) v1 m2(v3) v2 m2(v1) v3 m3(v2) messages passed between nodes v4 v5 v6 v1 v2 v3 v4 v5 v6 v7 v8 v9 messages passed between regions Here, a message is a conditional probability table passed among nodes. But when message size grows exponentially with size of overlap between regions! Approximate Inference 2 • Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v6i+1 = argmax v6i P(v6i | v\ v6i) v5 v6 Approximate Inference 2 • Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v5j+1 = argmax v5j P(v5j | v\ v5j) v5 v6 Approximate Inference 2 • Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v4k+1 = argmax P(v4k | v\ v5 v6 v4k) v4k Structured inference scales well here, but greedy, and easily falls into local minima. Approximate Inference 2 • Iterated Conditional Modes (ICM) v1 v2 v3 = held constant [Besag 1986] v4 v4k+1 = argmax P(v4k | v\ v5 v6 v4k) v4k • Iterated Conditional Sampling (ICS) (our name) Instead of selecting only argmax, sample of argmaxes of P(v4k | v \ v4k) e.g. an N-best list (the top N values) v1 v4 v2 v5 v3 v6 Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once. Here, a “message” grows only linearly with overlap region size and N! IE + Coreference Model Smyth , P Data mining… Exact inference on these linear-chain regions From each chain pass an N-best List into coreference Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… Approximate inference by graph partitioning… Make scale to 1M citations with Canopies …integrating out uncertainty in samples of extraction Smyth . 2001 Data Mining… [McCallum, Nigam, Ungar 2000] J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… Exact (exhaustive) inference over entity attributes y n n Smyth . 2001 Data Mining… J Besag 1986 On the… IE + Coreference Model Smyth , P Data mining… Revisit exact inference on IE linear chain, now conditioned on entity attributes y n n Smyth . 2001 Data Mining… J Besag 1986 On the… Parameter Estimation Separately for different regions IE Linear-chain Exact MAP Coref graph edge weights MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood y n n In all cases: Climb MAP gradient with quasi-Newton method 4. Joint segmentation and co-reference [Wellner, McCallum, Peng, Hay, UAI 2004] o Extraction from and matching of research paper citations. s Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990. World Knowledge c y Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. p Co-reference decisions y Database field values c s c y s o Citation attributes Segmentation o 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Variant of Iterated Conditional Modes [Besag, 1986] Accomplishments, Discoveries & Results: • Extracting answers, and also uncertainty/confidence. – Formally justified as marginalization in graphical models – Applications to new word discovery in Chinese word segmentation, and correction propagation in interactive IE • Joint inference, with efficient methods – Multiple, cascaded label sequences (Factorial CRFs) – Multiple distant, but related mentions (Skip-chain CRFs) – Multiple co-reference decisions (Affinity Matrix CRF) – Integrating extraction with co-reference (Graphs & chains) • Put it into a large-scale, working system – Social network analysis from Email and the Web – A new portal: research, people, connections. One Application Project: Workplace effectiveness ~ Ability to leverage network of acquaintances “The power of your little black book” But filling Contacts DB by hand is tedious, and incomplete. Contacts DB Email Inbox QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Automatically QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. WWW Qu i ck Ti me ™a nd a TIF F (Un co mpre ss ed )d ec omp res so r a re ne ed ed to s ee th i s pi c tu re. System Overview WWW Email CRF Qu i ck Ti me ™a nd a TIF F (Un co mpre ss ed )d ec omp res so r a re ne ed ed to s ee th i s pi c tu re. Keyword Extraction Person Name Extraction Name Coreference Homepage Retrieval names Contact Info and Person Name Extraction Social Network Analysis An Example To: “Andrew McCallum” [email protected] Subject ... Search for new people First Name: Andrew Middle Name: Kachites Last Name: McCallum JobTitle: Associate Professor Company: University of Massachusetts Street Address: 140 Governor’s Dr. City: Amherst State: MA Zip: 01003 Company Phone: (413) 545-1323 Links: Fernando Pereira, Sam Roweis,… Key Words: Information extraction, social network,… Summary of Results Keywords William Cohen Logic programming Text categorization Data integration Rule learning Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom Mitchell Machine learning Cognitive states Learning apprentice Artificial intelligence QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Contact info and name extraction performance (25 fields) CRF 1. 2. Token Acc Field Prec Field Recall Field F1 94.50 85.73 76.33 80.76 Example keywords extracted Person Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency. Main Application Project: Main Application Project: Cites Research Paper Main Application Project: Expertise Cites Research Paper Grant Venue Person University Groups Main Application Project: Status: • Spider running. Over 1.5M PDFs in hand. • Best-in-world published results in IE from research paper headers and references. • First version of multi-entity co-reference running. • First version of Web servlet interface up. • Well-engineered: Java, servlets, SQL, Lucene, SOAP, etc. • Public launch this Fall. Software Infrastructure MALLET: Machine Learning for Language Toolkit • ~80k lines of Java • Document classification, information extraction, clustering, coreference, POS tagging, shallow parsing, relational classification, … • New package: Graphical models and modern inference methods. – Variational, Tree-reparameterization, Stochastic sampling, contrastive divergence,… • New documentation and interfaces. • Unlike other toolkits (e.g. Weka) MALLET scales to millions of features, 100k’s training examples, as needed for NLP. Released as Open Source Software. http://mallet.cs.umass.edu In use at UMass, MIT, CMU, Stanford, Berkeley, UPenn, UT Austin, Purdue… Publications and Contact Info • Conditional Models of Identity Uncertainty with Application to Noun Coreference. Andrew McCallum and Ben Wellner. Neural Information Processing Systems (NIPS), 2004. • An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. Ben Wellner, Andrew McCallum, Fuchun Peng, Michael Hay. Conference on Uncertainty in Artificial Intelligence (UAI), 2004. • Collective Segmentation and Labeling of Distant Entities in Information Extraction. Charles Sutton and Andrew McCallum. ICML workshop on Statistical Relational Learning, 2004. • Extracting Social Networks and Contact Information from Email and the Web. Aron Culotta, Ron Bekkerman and Andrew McCallum. Conference on Email and Spam (CEAS) 2004. • Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Charles Sutton, Khashayar Rohanimanesh and Andrew McCallum. ICML 2004. • Interactive Information Extraction with Constrained Conditional Random Fields. Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. AAAI 2004. (Winner of Honorable Mention Award.) • Accurate Information Extraction from Research Papers using Conditional Random Fields. Fuchun Peng and Andrew McCallum. HLT-NAACL, 2004. • Chinese Segmentation and New Word Detection using Conditional Random Fields. Fuchun Peng, Fangfang Feng, and Andrew McCallum. International Conference on Computational Linguistics (COLING 2004), 2004. • Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. (HLTNAACL), 2004, http://www.cs.umass.edu/~mccallum End of Talk

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Automatically Building Special Purpose Search Engines with