Download Relation Extraction - Computer Science Department

Relation Extraction Pierre Bourreau LSI-UPC PLN-PTM Plan  Relation Extraction description  Sampling templates  Reducing deep analysis errors…  Conclusion Relation Extraction Description   Finding relations between entities into a text Filling pre-defined templates slots    One-value-per-field Multi-value Depend on analysis:    Chunking Tokenization Sentence Parsing… Plan  Relation Extraction description  Sampling templates (Cox, Nicolson, Finkel, Manning)  Reducing deep analysis errors…  Conclusion First Example: Sampling Templates    Example: workshop announcement PASCAL corpus Relations to extract:    dates of events Workshop conferences names, acronyms and URL Domain knowledge:   Constraints on dates Constraints on names PASCAL Corpus: semi-structured corpus                            <[email protected]> Type: cmu.andrew.academic.bio Topic: "MHC Class II: A Target for Specific Immunomodulation of the Immune Response" Dates: 3-May-95 Time: <stime>3:30 PM</stime> Place: <location>Mellon Institute Conference Room</location> PostedBy: Helena R. Frey on 26-Apr-95 at 11:09 from andrew.cmu.edu Abstract: Seminar: Departments of Biological Sciences Carnegie Mellon and University of Pittsburgh Name: <speaker>Dr. Jeffrey D. Hermes</speaker> Affiliation: Department of Autoimmune Diseases Research & Biophysical Chemistry Merck Research Laboratories Title: "MHC Class II: A Target for Specific Immunomodulation of the Immune Response" Host/e-mail: Robert Murphy, [email protected] Date: Wednesday, May 3, 1995 Time: <stime>3:30 p.m.</stime> Place: <location>Mellon Institute Conference Room</location> Sponsor: MERCK RESEARCH LABORATORIES Schedule for 1995 follows: (as of 4/26/95) Biological Sciences Seminars 1994-1995 Date Speaker Host April 26 Helen Salz Javier L~pez May 3 Jefferey Hermes Bob Murphy MERCK RESEARCH LABORATORIES PASCAL Corpus: semi-structured corpus                 <[email protected]> Type: cmu.andrew.org.heinz.great-lake Topic: Re: PresentationCC: Dates: 25-Oct-93 Time: <stime>12:30</stime> PostedBy: Richard Florida on 21-Oct-93 at 17:00 from andrew.cmu.edu Abstract: Folks: <paragraph> <sentence>Our client has requested that the presentation be postponed until Monday during regular class-time</sentence>. <sentence>He has been asked to make a presentaion for the Governor of Michigan and Premier of Ontario tommorrow morning in Canada, and was afraid he could not catch a plane in time to make our presentation</sentence>. <sentence>After consulting with Rafael and a sub group of project managers, it was decided that Monday was the best feasible presentation alternative</sentence>. <sentence>Greg has been able to secure Room 2503 in Hamburg Hall for our presentation Monday during regular class-time</sentence>. </paragraph>  <paragraph><sentence>We will meet tommmorow in <location>2110</location> at <stime>12:30</stime> (lunch provided) to finalize presentation and briefing book</sentence>. <sentence>Also, the client has faxed a list of reactions and questions for discussion which we should review</sentence>. <sentence>Thanks very much for your hard work and understanding</sentence>. <sentence>Look forward to seeing you tommorrow</sentence>.</paragraph>  Richard     Idea  Sampling Templates:    Generate all available templates Give a probability to each of them Relational model:  Constraints on dates: order     1. submission dates 2. acceptance dates 3. workshop dates / camera ready dates Constraints on names.   Slots: name, acronym, URL URL is generated from acronyms Baselines  CRF     Cliques: max=2 Viterbi algorithm Token => GATE tokenization CMM   Idem Window of the four previous tokens Templates sampling   Tokens  p(Li|Li-1) or p(Li|Li-1,…, Li-4) on 100 of documents Template:   Each slot holds one/no filler value -> date templates:     SUB_DATE ACC_DATE WORK_DATE CAMREADY_DATE Templates sampling   Tokens  p(Li|Li-1) or p(Li|Li-1,…, Li-4) on 100 of documents Template:   Each slot holds one/no filler value -> name templates:       CONF_NAME CONF_ACRO CONF_URL WORK_NAME WORK_ACRO WORK_URL Templates sampling  D a distribution of these templates, over the training set. => LOCAL MODEL (PL) Templates scoring: Date Model    PA/P: Probability of present/absent fields. Set with training data Po: Ordering probability. We give penalty to constraints violations. PA/P* Po = Prel Templates scoring: Name Model     Name->Acronym: independent module (likelihood score – Chang 2002): Pnam->acr Acronym->URL: empirical probability from training: Pacr->url Pb: missing entry give advantage to incomplete templates.  PA/P: pondering templates (in training, most values are filled) Prel= Pnam->acr *Pacr->url *PA/P Results: 300 documents Results  No results over CRF  CRF accepts variation (ex: name)   Rel. Model does not improve CRF (not on graph)   => lower recall Low-window of CRF => less info in distribution. Substantial improvement over CMM (5%) Plan  Relation Extraction description  Sampling templates  Reducing deep analysis errors (Zao, Grishman)  Conclusion Problematic  Use different syntactic analysis for the task:       Tokenization Chunking Sentence Parsing … The more info they give, the less accurate they are. =>combine them to correct errors ACE task… remember  Entities:   Mentions:   PERson – ORGanisation – FACility – GeoPoliticEntity - LOCation – WEApon – VEHicle NAM (proper), NOM (nominal), PRO (pronoun) Relations:  EMP-ORG, PHYS, GPE-AFF, PER-SOC, DISC, ART, Other Kernel, SVM … nice properties  Kernel:     Function replacing scalar vector products Enables us to translate problems into a higherdimension space for solution Sum, product generates kernels. SVM:  SVM can pick up features for best separation The relational model  R=(arg1, arg2, seq, link, path)      T=(word, pos, base)      type: according to ACE type subtype: refining mtype: the way it is mentioned DT=(T, dseq)   pos: Part Of Speech tagging base: morphological base E=(tk, type, subtype, mtype)   arg1, arg2: the two entities to compare seq=(t1, …, tn): sequence of tokens intervening link=(t1, …, tm): idem seq but just with important words path: a dependency path… dseq=(arc1, …, arcn) ARC=(w, dw, label, e)     w: current token dw: token connected to w label: role label of this arc e: direction of the arc The relational model: example arg1=((“areas”, “NNS”, “area”, dseq), “LOC”, “region”, “NOM”)  arg1.dseq=((OBJ, areas, in, 1), (OBJ, areas, controlled, 1)) path=((OBJ, areas, controlled, 1), (SBJ, controlled, troops, 0)) Kernels Argument kernel: 1.  Matches two tokens, comparing each fix arguments (word, pos, type…) Bigram kernel: 2.  Matches token on a window of size 1 Link sequence kernel: 3.  Relations often occur in a short context. Kernels (2) 4. Dependency path kernel:  How similar are two paths? 5. Local dependency kernel:   Idem as path but more informative. Helpful if dependency path does not exist. Results: adding info into SVM   The more information we give, the better the result. Link Sequence Kernel boosts results. Results: SVM or KNN     SVM behaves globally better Polynomial extension has no consequence on KNN. Training problem in the last three. … good results over ACE official task… secret, no comparison available Conclusion      Really simple method Nice properties of Kernel/SVMs This method is generic!!! (tested on annotated text) Looks like SVM can process better, for this task. … but hard to compare the two methods as goals are different. References   [1] Template Sampling for Leveraging Domain Knowledge in Information Extraction. Cox, Nicolson, Finkel, Manning, Langley. Stanford University. [2] Extracting Relations with Integrated Information Using Kernel Methods. Zao, Grishman. New York University. 2005

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Relation Extraction - Computer Science Department