Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Artificial gene synthesis wikipedia , lookup
Gene regulatory network wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Secreted frizzled-related protein 1 wikipedia , lookup
Genome evolution wikipedia , lookup
List of types of proteins wikipedia , lookup
Molecular evolution wikipedia , lookup
Expanded genetic code wikipedia , lookup
Parallel Entity and Treebank Annotation Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive Science* Mark Mandel – Linguistic Data Consortium* *University of Pennsylvania New Frontiers in Corpus Annotation Workshop, 6/29/05 6/29/05 1 Mining the Bibliome: Information Extraction from the Biomedical Literature • NSF ITR grant EIA-0205448 • Collaboration with Division of Oncology, Children’s Hospital of Philadelpia • PubMed abstracts – mining cancer literature for associations that link variations in genes with malignancies • http://bioie.ldc.upenn.edu - release 0.9 available 1157 abstracts entity annotated, 318 also treebanked 6/29/05 2 Outline • Entity Annotation • Treebank Annotation – • Modifications from Penn Treebank guidelines • Annotation Process and Merged Format • Entity-Constituent Mapping – How successful? 6/29/05 3 Entity Annotation • Gene X with genomic Variation event Y is correlated with Malignancy Z • Gene – composite entity, can refer to gene or protein : Gene-generic, Gene-protein, Gene-RNA • (Malignancy – under development, not included in release 0.9) • Variation Event – Relation between entities representing different aspects of a variation 6/29/05 4 Entity Annotation - Variations • Variation – A relation between variation component entities • “a single nucleotide substitution at codon 249, predicting a serine to cysteine amino acid substitution” • • • • 6/29/05 Var-type – substitution Var-location –codon 249 Var-state-orig –serine Var-state-altered –cysteine 5 A Change in Tokenization • Tokenization – Many hyphenated words treated as separate tokens • “New York-based” • Old (Penn Treebank) tokenization: [New] [York-based] • New tokenization: [New][York][-][based] 6/29/05 6 Discontinuous Entities • E.g.: “K- and N-ras” • Tokenization: [K][-][and][N][-][ras] • Entity annotation: • [K][-]… [ras] – “chain” of discontinuous tokens • [N][-][ras] – Contiguous tokens • Splitting up not always done, depends on coordination 6/29/05 7 Treebank Annotation • • • • Default NP right-branching structure (NP (JJ primary) (NN liver) (NN cancer)) Simplifies multi-token nominal annotation Allows recovery of implicit constituents: • (NP (JJ primary) (newnode (NN liver) (NN cancer))) • Entities sometimes map to such implicit constituents 6/29/05 8 Treebank Annotation • Exceptions to right-branching marked by NML • So: Any two or more non-final elements that form a constituent are a NML • (ADJP (NML (NNP New) (NNP York)) (HYPH -) (VBN based)) • (ADJP (NML (NN breast) (NN cancer)) (HYPH -) (VBN associated)) • (NP (NML (NN human) (NN liver) (NN tumor)) (NN analysis) 6/29/05 9 Treebank Annotation • Placeholder *P* for distributed material in coordinated nominal structures • “K- and N-ras” NP CC NP NN HYPH NML-1 K - -NONE*P* 6/29/05 and NP NN HYPH NML-1 N - -NONEras 10 Treebank Annotation • To the left or right • “codon 12 or 13” NP 6/29/05 NP CC NML-1 CD NN 12 codon NP or NML-1 CD -NONE- 13 *P* 11 First Release • Goal – let users choose how to handle the integration of entity and treebank levels • Standoff annotation for entity and treebank • Identical tokenization • Merged representation • Penn Treebank style • (POSTag:[from..to] terminal) • Entity listing before each tree. 6/29/05 12 Merged Output Example sentence 4 Span:331..605 ;In the present study, we screened for ;the K-ras exon 2 point mutations in a ;group of 87 gynecological neoplasms ;[373..378]:gene-rna:"K-ras" ;[379..385]:variation-location:"exon 2" ;[386..401]:variation-type: "point mutations“ 6/29/05 13 Merged Output Example […] ((VP (VBD:[356..364] screened) (PP-CLR (IN:[365..368] for) (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations))) […] 6/29/05 14 Merged Output Example ;[373..378]:gene-rna:"K-ras" ;[379..385]:variation-location:"exon 2" ;[386..401]:variation-type: "point mutations" ((VP (VBD:[356..364] screened) (PP-CLR (IN:[365..368] for) (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations))) 6/29/05 15 Entity-Constituent Mapping : Exact Match • Exact Match: A node in the tree yields exactly the entity: ;[379..385]:variation-location:"exon 2" (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations))) 6/29/05 16 Entity-Constituent Mapping : Missing Node • Missing Node – Possible to add a node to yield exactly the entity ;[386..401]:variation-type: "point mutations" (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (NN:[386..391] point) (NNS:[392..401] mutations))) 6/29/05 17 Entity-Constituent Mapping : Missing Node (NP (DT:[369..372] the) (NN:[373..378] K-ras) (NML (NN:[379..383] exon) (CD:[384..385] 2)) (newnode(NN:[386..391] point) (NNS:[392..401] mutations)))) • Done for internal research purposes, not in release (implicit constituents) • NML already in release (explicit constituents) 6/29/05 18 Entity-Constituent Mapping : Crossing • Crossing: Cuts across constituent boundaries, so cannot even add a node yielding the entity • Typical case: entity containing text corresponding to a prepositional phrase One ER showed a G-to-T mutation in the second position of codon 12 [1280..1307]: variation-location: “second position of codon 12” 6/29/05 19 Entity-Constituent Mapping : Crossing [1280..1307]: variation-location: “second position of codon 12” (NP (NP (DT:[1276..1279] the) (JJ:[1280..1286] second) (NN:[1287..1295] position)) (PP (IN:[1296..1298] of) (NP (NN:[1299..1304] codon) (CD:[1305..1307] 12))))) • Crossing - Determiner in NP but not in entity. • Could relax matching, or modify entity or treebank annotation. Didn’t do that. 6/29/05 20 Entity-Constituent Mapping – Chain Exact Match • “codon 12 or 13” • Entities: “codon 12”, “codon..13” NP NP NML-1 CD NN 12 codon 6/29/05 NP CC or NML-1 CD -NONE- 13 *P* 21 Entity-Constituent Mapping – Chain Not a Exact Match • “specific codons (12, 13, and 61) • Entities: “codons…12”, “codons..13”, “codons..61” (NP (JJ specific) (NNS codons) (PRN (-LRB- -LRB-) (NP (NP (CD 12)) (, ,) (NP (CD 13)) (, ,) (CC and) (NP (CD 61))) (-RRB- -RRB-))) 6/29/05 22 Multiple Token Entities (Non-Chained) Entity Type Total Exact Match Gene-generic 6 4 Gene-protein 349 236 Gene-RNA 156 115 Var-location 445 348 Var-state-orig 5 3 Var-state-altered 10 8 Var-type 271 123 Total 1242 837 6/29/05 Missing Node 1 103 35 68 1 0 142 350 Crossing 1 10 6 29 1 2 6 55(4.4%) 23 Multiple Token Entities (Chained) Entity Type Total Gene-generic Gene-protein Gene-RNA Var-location Var-state-orig Var-state-altered Var-type Total 0 6 36 125 0 0 1 168 6/29/05 Exact Match 0 4 29 103 0 0 0 136 Not Exact Match 0 2 7 22 0 0 1 32(19%) 24 Conclusion • Annotation of entities and treebank done together • Identical tokenization for entities and trees, with standoff annotation • Allows flexibility in use of integrated annotation • Only 6.2% of the entities cannot be mapped to an implicit or explicit constituent node • Changes in Treebank guidelines • Use of Relations for potentially large entities • Next: Relation annotation and integrated taggers 6/29/05 25 References • Ryan’s tagger • Dan’s parser • Web page again 6/29/05 26 Entity Annotation - Variations • “(S249C)” • • • • Var-type – none Var-location –249 Var-state-orig –S Var-state-altered –C • Gene-{RNA,generic,protein} disambiguates gene metonymy • Var-{type,location,state-orig,state-altered} are different kinds of entities 6/29/05 27 Entities --Multiple Tokens-Entity Type Gene-generic Gene-protein Gene-RNA Var-location Var-state-orig Var-state-altered Var-type 6/29/05 Single Tokens 104 921 1987 95 151 162 235 Nonchains 6 349 156 445 5 10 271 Chains 0 6 36 125 0 0 1 28 Introduction • Corpus for biomedical IE with several levels of annotation: • Entity • Syntactic Structure (Treebank) • Relations (McDonald et al, ACL 2005) • Ideal - entities mapped to treebank constituents • Allow users to choose how to integrate the levels 6/29/05 29 Annotation Process • Tokenization Entity POS Treebanking Merged Representation • Minimal requirement: identical tokenization for entity and treebank annotation • Did not require an entity/constituent correspondence – but how did it work out? 6/29/05 30