Download 幻灯片 1 - Peking University

Information Extraction PengBo Dec 2, 2010 Topics of today   IE: Information Extraction Techniques    Wrapper Induction Sliding Windows From FST to HMM What is IE? Example: The Problem Martin Baker, a person Genomics job Employers job posting form Example: A Solution Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.htm OtherCompanyJobs: foodscience.com-Job1 Job Openings:Category = Food Services Keyword = Baker Location = Continental U.S. Data Mining the Extracted Job Information Two ways to manage information “ceremonial soldering” Query Answer Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx X:advisor(wc,Y)&affil(X,lti) ? Query Xxx xxxx xxxx xxxxxxx Xxx xxxxxxx xxx xxx xx xxxx xxxXxx xxx xxxx xxxx xxxx xxx xx xxxx xxx xxx xxx xxxx xxx xx xxxx xxxx xxx Answer inference retrieval Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx {X=em; X=vc} Xxx xxxx xxxx xxxxxxx Xxx xxxxxxx xxx xxxxxxx Xxx xx xxxx xxxxxxx xxx xxx xxxx xxxxx xxxx xxx xxx xxxx xxxxx xxxx xxxx xxx Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx advisor(wc,vc) advisor(yh,tm) IE affil(wc,mld) affil(vc,lti) fn(wc,``William”) fn(vc,``Vitor”) AND What is Information Extraction?  Recovering structured data from formatted text What is Information Extraction?  Recovering structured data from formatted text  Identifying fields (e.g. named entity recognition) What is Information Extraction?  Recovering structured data from formatted text   Identifying fields (e.g. named entity recognition) Understanding relations between fields (e.g. record association) What is Information Extraction?  Recovering structured data from formatted text    Identifying fields (e.g. named entity recognition) Understanding relations between fields (e.g. record association) Normalization and deduplication What is Information Extraction?  Recovering structured data from formatted text     Identifying fields (e.g. named entity recognition) Understanding relations between fields (e.g. record association) Normalization and deduplication Today, focus mostly on field identification & a little on record association Applications IE from Research Papers Chinese Documents regarding Weather Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries Wrapper Induction “Wrappers”  If we think of things from the database point of view    We want to be able to database-style queries But we have data in some horrid textual form/content management system that doesn’t allow such querying We need to “wrap” the data in a component that understands database-style querying  Hence the term “wrappers”    Title: Schulz and Peanuts: A Biography Author: David Michaelis List Price: $34.95 Wrappers: Simple Extraction Patterns  Specify an item to extract for a slot using a regular expression pattern.   Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern and succeeding (post-filler) pattern to identify the end of the filler.  Amazon list price:  Pre-filler pattern: “List Price: ”  Filler pattern: “\b\$\d+(\.\d{2})?\b”  Post-filler pattern: “” Wrapper tool-kits  Wrapper toolkits   Specialized programming environments for writing & debugging wrappers by hand Some Resources   Wrapper Development Tools LAPIS Wrapper Induction   Problem description: Task: learn extraction rules based on labeled examples   Hand-writing rules is tedious, error prone, and time consuming Learning wrappers is wrapper induction Induction Learning  Rule induction:   INPUT:     formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data. Labeled examples: training & testing data Admissible rules (hypotheses space) Search strategy Desired output:  Rule that performs well both on training and testing data Wrapper induction Highly regular source documents   Relatively simple extraction patterns  Efficient learning algorithm  Build a training set of documents paired with human-produced filled extraction templates. Learn extraction patterns for each slot using an appropriate machine learning algorithm. Goal: learn from a human teacher how to extract certain database records from a particular web site. User gives first K positive—and thus many implicit negative examples Learner Kushmerick’s WIEN system   Earliest wrapper-learning system (published IJCAI ’97) Special things about WIEN:    Treats document as a string of characters Learns to extract a relation directly, rather than extracting fields, then associating them together in some way Example is a completely labeled page WIEN system: a sample wrapper Learning LR wrappers labeled pages <HTML><HEAD>Some Country Codes</HEAD> Congo 242 <HTML><HEAD>Some Country Codes</HEAD> Egypt 20 Congo 242 <HTML><HEAD>Some Country Codes</HEAD> Belize 501 Egypt 20 Congo 242 <HTML><HEAD>Some Country Codes</HEAD> Spain 34 Belize 501 Egypt 20 Congo 242 </BODY></HTML> Spain 34 Belize 501 Egypt 20 </BODY></HTML> Spain 34 Belize 501 </BODY></HTML> Spain 34 </BODY></HTML> wrapper l1, r1, …, lK, rK Example: Find 4 strings , , ,   l1 , r1 , l2 , r2  LR wrapper Left delimiters L1=“”, L2=“”; Right R1=“”, R2=“” LR: Finding r1 <HTML><TITLE>Some Country Codes</TITLE> Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> LR: Finding l1, l2 and r2 <HTML><TITLE>Some Country Codes</TITLE> Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> WIEN system  Assumes items are always in fixed, known order    … Name: J. Doe; Address: 1 Main; Phone: 111-1111. Name: E. Poe; Address: 10 Pico; Phone: 777-1111. … Introduces several types of wrappers  LR Learning LR extraction rules  Admissible rules:   prefixes & suffixes of items of interest Search strategy:  start with shortest prefix & suffix, and expand until correct Summary of WIEN  Advantages:   Fast to learn & extract Drawbacks:    Cannot handle permutations and missing items Must label entire page Requires large number of examples Sliding Windows Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University E.g. Looking for seminar location 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement A “Naïve Bayes” Sliding Window Model [Freitag 1997] … 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. A “Naïve Bayes” Sliding Window Model [Freitag 1997] … 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Create dataset of examples like these: 1. +(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…) - (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….) … Train a NaiveBayes classifier If Pr(class=+|prefix,contents,suffix) > threshold, predict the content window is a location. 2. 3. • To think about: what if the extracted entities aren’t consistent, eg if the location overlaps with the speaker? “Naïve Bayes” Sliding Window Results Domain: CMU UseNet Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field Person Name: Location: Start Time: F1 30% 61% 98% Finite State Transducers Finite State Transducers for IE   Basic method for extracting relevant information IE systems generally use a collection of specialized FSTs    Company Name detection Person Name detection Relationship detection Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. Text Analyzer: Frodo – Proper Name Baggins – Proper Name works – Verb for – Prep Hobbit – UnknownCap Factory – NounCap Inc – CompAbbr Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. Some regular expression for finding company names: “some capitalized words, maybe a comma, then a company abbreviation indicator” CompanyName = (ProperName | SomeCap)+ Comma? CompAbbr Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. Company Name Detection FSA CAB word 1 (CAP | PN) 2 comma 3 CAB 4 word (CAP| PN) CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. Company Name Detection FST CAB  CN word  word 1 (CAP | PN)   2 comma   (CAP| PN)   3 CAB  CN 4 word  word CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string, CN = CompanyName Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. Company Name Detection FST CAB  CN word  word 1 (CAP | PN)   2 comma   3 CAB  CN 4 word  word (CAP| PN)   Non-deterministic!!! CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string, CN = CompanyName Finite State Transducers for IE    Several FSTs or a more complex FST can be used to find one type of information (e.g. company names) FSTs are often compiled from regular expressions Probabilistic (weighted) FSTs Finite State Transducers for IE FSTs mean different things to different researchers in IE.  Based on lexical items (words)  Based on statistical language models  Based on deep syntactic/semantic analysis Example: FASTUS   Finite State Automaton Text Understanding System (SRI International) Cascading FSTs      Recognize names Recognize noun groups, verb groups etc Complex noun/verb groups are constructed Identify patterns of interest Identify and merge event structures Hidden Markov Models Hidden Markov Models formalism HMM = probabilistic FSA HMM = states s1, s2, … (special start state s1 special end state sn) token alphabet a1, a2, … state transition probs P(si|sj) token emission probs P(ai|sj) Widely used in many language processing tasks, e.g., speech recognition [Lee, 1989], POS tagging [Kupiec, 1992], topic detection [Yamron et al, 1998]. Applying HMMs to IE    Document  generated by a stochastic process modelled by an HMM Token  word State  “reason/explanation” for a given token     ‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’, ‘company’, … Extraction: via the Viterbi algorithm, a dynamic programming technique for efficiently computing the most likely sequence of states that generated a document. HMM for research papers: transitions [Seymore et al., 99] HMM for research papers: emissions [Seymore et al., 99] ICML 1997... submission to… to appear in… carnegie mellon university… university of california dartmouth college stochastic optimization... reinforcement learning… model building mobile robot... supported in part… copyright... author title institution note Trained on 2 million words of BibTeX data from the Web ... What is an HMM?    Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies between states What is an HMM?    Green circles are hidden states Dependent only on the previous state: Markov process “The past is independent of the future given the present.” What is an HMM?   Purple nodes are observed states Dependent only on their corresponding hidden state HMM Formalism    S S S S S K K K K K {S, K, P, A, B} S : {s1…sN } are the values for the hidden states K : {k1…kM } are the values for the observations HMM Formalism S A S B K     K A S B K A S A S B K {S, K, P, A, B} P = {pi} are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state probabilities K Emission probabilities Transition probabilities  Author Y 0.1 A 0.1 0.1 C 0.8 Year 0.9 Title 0.5 A 0.6 B 0.3 C 0.1 0.8 dddd 0.8 dd 0.2 Journal 0.2 X 0.4 B 0.2 Z 0.4 Need to provide structure of HMM & vocabulary   0.5 Training the model (Baum-Welch algorithm) Efficient dynamic programming algorithms exist for   Finding Pr(K) The highest probability path S that maximizes Pr(K,S) (Viterbi) Using the HMM to segment   Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm 115 Grant street Mumbai 400070 115 Grant ……….. 400070 House House House Road Road Road City City City Pin Pint o o Pint Most Likely Path for a Given Sequence  The probability that the path p 0 ...p N is taken and the sequence x1...xL is generated: L Pr( x1...xL , p0 ...p N ) = a0 p1  bpi ( xi ) api pi1 i =1 transition probabilities emission probabilities Example 0.4 0.2 0.5 A C G T 0.4 0.1 0.2 0.3 0.8 A C G T 1 begin 0 0.5 A C G T 0.4 0.1 0.1 0.4 0.6 3 0.2 2 0.8 0.2 0.3 0.3 0.2 A C G T 0.1 0.4 0.4 0.1 end 5 0.9 4 0.1 Pr( AAC , p ) = a01  b1 (A)  a11  b1 (A)  a13  b3 (C)  a35 = 0.5  0.4  0.2  0.4  0.8  0.3  0.6 Finding the most probable path o1 ot-1 ot ot+1 oT  Find the state sequence that best explains the observations  Viterbi algorithm (1967)  arg max P( X | O) X Viterbi Algorithm x1 xt-1 j o1 ot-1 ot ot+1 oT  j (t ) = max P( x1...xt 1 , o1...ot 1 , xt = j, ot ) x1 ... xt 1 The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t Viterbi Algorithm x1 xt-1 xt xt+1 o1 ot-1 ot ot+1 oT  j (t ) = max P( x1...xt 1 , o1...ot 1 , xt = j, ot ) x1 ... xt 1  j (t  1) = max  i (t )aijb jo t 1 i  j (t  1) = arg max  i (t )aijb jo i t 1 Recursive Computation Viterbi : Dynamic Programming  j (t  1) = max  i (t )aijb jo t 1 i No 115 Grant street Mumbai 400070 b jot 1 115 Grant  i (t ) House Road ……….. House a ij Road City City Pin Pint o 400070 House  j (t  1) Road City o Pint Viterbi Algorithm x1 xt-1 xt xt+1 xT o1 ot-1 ot ot+1 oT Xˆ T = arg max  i (T ) i Xˆ t =  ^ (t  1) X t 1 Compute the most likely state sequence by working backwards Hidden Markov Models Summary    Popular technique to detect and classify a linear sequence of information in text Disadvantage is the need for large amounts of training data Related Works     System for extraction of gene names and locations from scientific abstracts (Leek, 1997) NERC (Biker et al., 1997) McCallum et al. (1999) extracted document segments that occur in a fixed or partially fixed order (title, author, journal) Ray and Craven (2001) – extraction of proteins, locations, genes and disorders and their relationships IE Technique Landscape IE with Symbolic Techniques  Conceptual Dependency Theory    Frame Theory     Minsky, 1975 a frame stores the properties of characteristics of an entity, action or event it typically consists of a number of slots to refer to the properties named by a frame Berkeley FrameNet project    Shrank, 1972; Shrank, 1975 mainly aimed to extract semantic information about individual events from sentences at a conceptual level (i.e., the actor and an action) Baker, 1998; Fillmore and Baker, 2001 online lexical resource for English, based on frame semantics and supported by corpus evidence FASTUS (Finite State Automation Text Understanding System)   Hobbs, 1996 using cascade of FSAs in a frame based information extraction approach IE with Machine Learning Techniques   Training data: documents marked up with ground truth In contrast to text classification, local features crucial. Features of:     Contents Text just before item Text just after item Begin/end boundaries Good Features for Information Extraction Creativity and Domain Knowledge Required! contains-question-mark begins-with-number Example word features:  identity of word contains-question-word begins-with-ordinal  is in all caps ends-with-question-mark begins-with-punctuation  ends in “-ski” first-alpha-is-capitalized begins-with-question-word is part of a noun phrase indented  is in a list of city names begins-with-subject  is under node X in WordNet orindented-1-to-4 blank Cyc indented-5-to-10 contains-alphanum  is in bold font more-than-one-third-space contains-bracketed is in hyperlink anchor only-punctuation  features of past & future number  last person name was female prev-is-blank contains-http  next two words are “and prev-begins-with-ordinal contains-non-space Associates” shorter-than-30 contains-number contains-pipe Good Features for Information Extraction Creativity and Domain Knowledge Required! Is Capitalized Is Mixed Caps Is All Caps Initial Cap Contains Digit All lowercase Is Initial Punctuation Period Comma Apostrophe Dash Preceded by HTML tag Character n-gram classifier says string is a person name (80% accurate) In stopword list (the, of, their, etc) In honorific list (Mr, Mrs, Dr, Sen, etc) In person suffix list (Jr, Sr, PhD, etc) In name particle list (de, la, van, der, etc) In Census lastname list; segmented by P(name) In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list (“J. C. Penny”) In list of company suffixes (Inc, & Associates, Foundation) Word Features  lists of job titles,  Lists of prefixes  Lists of suffixes  350 informative phrases HTML/Formatting Features  {begin, end, in} x {, , <a>, <hN>} x {lengths 1, 2, 3, 4, or longer}  {begin, end} of line Landscape of ML Techniques for IE: Classify Candidates Abraham Lincoln was born in Kentucky. Sliding Window Boundary Models Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Classifier Classifier which class? which class? Classifier Try alternate window sizes: which class? BEGIN Finite State Machines Abraham Lincoln was born in Kentucky. END BEGIN END Wrapper Induction Abraham Lincoln was born in Kentucky. Most likely state sequence? Learn and apply pattern for a website PersonName Any of these models can be used to capture words, formatting or both. IE History Pre-Web  Mostly news articles    De Jong’s FRUMP [1982]  Hand-built system to fill Schank-style “scripts” from news wire Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96] Most early work dominated by hand-built models   E.g. SRI’s FASTUS, hand-built FSMs. But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web  AAAI ’94 Spring Symposium on “Software Agents”   Tom Mitchell’s WebKB, ‘96   Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Build KB’s from the Web. Wrapper Induction  Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],… Summary Sliding Window  Abraham Lincoln was born in Kentucky.  Classifier which class?  Try alternate window sizes:  Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Information Extraction Sliding Window From FST(Finite State Transducer) to HMM Wrapper Induction   Wrapper toolkits LR Wrapper Readings  [1] M. Ion, M. Steve, and K. Craig, "A hierarchical approach to wrapper induction," in Proceedings of the third annual conference on Autonomous Agents. Seattle, Washington, United States: ACM, 1999. Thank You! Q&A

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 幻灯片 1 - Peking University