Download Customization of Gene Taggers for BeeSpace

Customizing Gene Taggers for BeeSpace Jing Jiang [email protected] March 9, 2005 Entity Recognition in BeeSpace • Types of entities we are interested in: – – – – – – Genes Sequences Proteins Organisms Behaviors … • Currently, we focus on genes Mar 9, 05 BeeSpace 2 Input and Output • Input: free text (w/ simple XML tags) – <?xml version=“1.0” encoding=“UTF-8”><Document id=“1”>…We have cloned and sequenced a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and examined its responses to JH. …</Document> • Output: tagged text (XML format) – <?xml version=“1.0” encoding=“UTF-8”> <Document id=“1”> …<Sent><NP>We</NP> have <VP>cloned</VP> and <VP>sequenced</VP> <NP>a cDNA encoding <Gene>Apis mellifera ultraspiracle</Gene><NP> (<Gene>AMUSP</Gene>) and <VP>examined</VP> <NP>its responses to JH</NP>.</Sent>…</Document> Mar 9, 05 BeeSpace 3 Challenges • No complete gene dictionary • Many variations: – Acronyms: hyperpolarization-activated ion channel (Amih) – Synonyms: octopamine receptor (oa1, oar, amoa1) – Common English words: at (arctops), by (3R-B) • Different genes or gene and protein may share the same name/symbol Mar 9, 05 BeeSpace 4 Automatic Gene Recognition: Characteristics of Gene Names • • • • Capitalization (especially acronyms) Numbers (gene families) Punctuation: -, /, :, etc. Context: – Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc. – Global: same noun phrase occurs several times in the same article Mar 9, 05 BeeSpace 5 Existing Tools • KeX (Fukuda) – Based on hand-crafted rules – Recognizes proteins and other entities – Human efforts, not easy to modify • ABNER & YAGI (Settles) – Based on conditional random fields (CRFs) to learn the “rules” – ABNER identifies and classifies different entities including proteins, DNAs, RNAs, cells – YAGI recognizes genes and gene products – No training Mar 9, 05 BeeSpace 6 Existing Tools (cont.) • LingPipe (Alias-i, Inc.) – Uses a generative statistical model based on word trigrams and tag bigrams – Can be trained – Has two trained models • Others – NLProt (SVM) – AbGene (rule-based) – GeneTaggerCRF (CRFs) Mar 9, 05 BeeSpace 7 Comparison of Existing Tools • Performance on a few manually annotated, public data sets (protein names): – GENIA (2000 abstracts on “human & blood cell & transcription factor”) – Yapex (99 abstracts on “protein binding & interaction & molecular”) – UTexas (750 abstracts on “human”) • Performance on a honeybee sample data set: – Biosis search “apis mellifera gene” Mar 9, 05 BeeSpace 8 Comparison of Existing Tools (cont.) GENIA Yapex UTexas KeX P: 0.3644 R: 0.4191 F1: 0.3898 P: 0.3451 R: 0.3931 F1: 0.3675 P: 0.1775 R: 0.3445 F1: 0.2343 ABNER P: 0.7876 R: 0.7485 F1: 0.7675 P: 0.4351 R: 0.4441 F1: 0.4396 P: 0.3916 R: 0.4314 F1: 0.4105 LingPipe P: 0.9298 R: 0.7388 F1: 0.8234 P: 0.4168 R: 0.4619 F1: 0.4382 P: 0.3633 R: 0.3918 F1: 0.3770 Mar 9, 05 BeeSpace 9 Comparison of Existing Tools (cont.) • KeX on honeybee data – False positives: company name, country name, etc. – Does not differentiate between genes, proteins, and other chemicals • YAGI on honeybee data – False negatives: occurrences of the same gene name are not all tagged – Entity types and boundary detection • LingPipe on honeybee data – Similar to YAGI Mar 9, 05 BeeSpace 10 Lessons Learned • Machine learning methods outperform handcrafted rule-based system • Machine learning methods have over-fitting problem • Existing tools need to be customized for BeeSpace – LingPipe is a good choice • There is still room for better feature selection – E.g., global context Mar 9, 05 BeeSpace 11 Customization • Train LingPipe on a better training data set – Use fly (Drosophila) genes – F1 increased from 0.2207 to 0.7226 on heldout fly data – Tested on honeybee data: results • Some gene names are learned (Record 13) • Some false positives are removed (proteins, RNAs) • Some false positives are introduced – The noisy training data can be further cleaned • E.g., exclude common English words Mar 9, 05 BeeSpace 12 Customization (cont.) • Exploit more features such as global context – Occurrences of the same word/phrase should be tagged all positive or all negative • Differentiate between domain-independent features and domain-specific features – E.g., prefix “Am” is domain-specific for Apis mellifera – Features can be weighted based on their contribution across domains Mar 9, 05 BeeSpace 13 Maximum Entropy Model for Gene Tagging • Given an observation (a token or a noun phrase), together with its context, denoted as x • Predict y  {gene, non-gene} • Maximum entropy model: P(y|x) = K exp(ifi(x, y)) • Typical f: – y = gene & candidate phrase starts with a capital letter – y = gene & candidate phrase contains digits • Estimate i with training data Mar 9, 05 BeeSpace 14 Plan: Customization with Feature Adaptation • i: trained on large set of data in domain A (e.g., human or fly) • i: trained on small set of data in domain B (e.g., bee) • i’ = i•i + (1 - i)•i: used for domain B • i: based on how useful fi is across different domains – Large i if fi is domain-independent – Small i if fi is domain-specific Mar 9, 05 BeeSpace 15 Issues to Discuss • Definition of gene names: – Gene families? (e.g., cb1 gene family) – Entities with a gene name? (e.g., Ks-1 transcripts) • Difference between genes and proteins? – E.g., “CREB (cAMP response element binding protein)” and “AmCREB”? • How to evaluate the performance on honeybee data? Mar 9, 05 BeeSpace 16 The End • Questions? • Thank You! Mar 9, 05 BeeSpace 17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Customization of Gene Taggers for BeeSpace