* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Customization of Gene Taggers for BeeSpace
Epigenetics in learning and memory wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Point mutation wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Public health genomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Protein moonlighting wikipedia , lookup
Genome evolution wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene therapy wikipedia , lookup
The Selfish Gene wikipedia , lookup
Genome (book) wikipedia , lookup
Gene desert wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression programming wikipedia , lookup
Helitron (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Microevolution wikipedia , lookup
Gene nomenclature wikipedia , lookup
Customizing Gene Taggers for BeeSpace Jing Jiang [email protected] March 9, 2005 Entity Recognition in BeeSpace • Types of entities we are interested in: – – – – – – Genes Sequences Proteins Organisms Behaviors … • Currently, we focus on genes Mar 9, 05 BeeSpace 2 Input and Output • Input: free text (w/ simple XML tags) – <?xml version=“1.0” encoding=“UTF-8”><Document id=“1”>…We have cloned and sequenced a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and examined its responses to JH. …</Document> • Output: tagged text (XML format) – <?xml version=“1.0” encoding=“UTF-8”> <Document id=“1”> …<Sent><NP>We</NP> have <VP>cloned</VP> and <VP>sequenced</VP> <NP>a cDNA encoding <Gene>Apis mellifera ultraspiracle</Gene><NP> (<Gene>AMUSP</Gene>) and <VP>examined</VP> <NP>its responses to JH</NP>.</Sent>…</Document> Mar 9, 05 BeeSpace 3 Challenges • No complete gene dictionary • Many variations: – Acronyms: hyperpolarization-activated ion channel (Amih) – Synonyms: octopamine receptor (oa1, oar, amoa1) – Common English words: at (arctops), by (3R-B) • Different genes or gene and protein may share the same name/symbol Mar 9, 05 BeeSpace 4 Automatic Gene Recognition: Characteristics of Gene Names • • • • Capitalization (especially acronyms) Numbers (gene families) Punctuation: -, /, :, etc. Context: – Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc. – Global: same noun phrase occurs several times in the same article Mar 9, 05 BeeSpace 5 Existing Tools • KeX (Fukuda) – Based on hand-crafted rules – Recognizes proteins and other entities – Human efforts, not easy to modify • ABNER & YAGI (Settles) – Based on conditional random fields (CRFs) to learn the “rules” – ABNER identifies and classifies different entities including proteins, DNAs, RNAs, cells – YAGI recognizes genes and gene products – No training Mar 9, 05 BeeSpace 6 Existing Tools (cont.) • LingPipe (Alias-i, Inc.) – Uses a generative statistical model based on word trigrams and tag bigrams – Can be trained – Has two trained models • Others – NLProt (SVM) – AbGene (rule-based) – GeneTaggerCRF (CRFs) Mar 9, 05 BeeSpace 7 Comparison of Existing Tools • Performance on a few manually annotated, public data sets (protein names): – GENIA (2000 abstracts on “human & blood cell & transcription factor”) – Yapex (99 abstracts on “protein binding & interaction & molecular”) – UTexas (750 abstracts on “human”) • Performance on a honeybee sample data set: – Biosis search “apis mellifera gene” Mar 9, 05 BeeSpace 8 Comparison of Existing Tools (cont.) GENIA Yapex UTexas KeX P: 0.3644 R: 0.4191 F1: 0.3898 P: 0.3451 R: 0.3931 F1: 0.3675 P: 0.1775 R: 0.3445 F1: 0.2343 ABNER P: 0.7876 R: 0.7485 F1: 0.7675 P: 0.4351 R: 0.4441 F1: 0.4396 P: 0.3916 R: 0.4314 F1: 0.4105 LingPipe P: 0.9298 R: 0.7388 F1: 0.8234 P: 0.4168 R: 0.4619 F1: 0.4382 P: 0.3633 R: 0.3918 F1: 0.3770 Mar 9, 05 BeeSpace 9 Comparison of Existing Tools (cont.) • KeX on honeybee data – False positives: company name, country name, etc. – Does not differentiate between genes, proteins, and other chemicals • YAGI on honeybee data – False negatives: occurrences of the same gene name are not all tagged – Entity types and boundary detection • LingPipe on honeybee data – Similar to YAGI Mar 9, 05 BeeSpace 10 Lessons Learned • Machine learning methods outperform handcrafted rule-based system • Machine learning methods have over-fitting problem • Existing tools need to be customized for BeeSpace – LingPipe is a good choice • There is still room for better feature selection – E.g., global context Mar 9, 05 BeeSpace 11 Customization • Train LingPipe on a better training data set – Use fly (Drosophila) genes – F1 increased from 0.2207 to 0.7226 on heldout fly data – Tested on honeybee data: results • Some gene names are learned (Record 13) • Some false positives are removed (proteins, RNAs) • Some false positives are introduced – The noisy training data can be further cleaned • E.g., exclude common English words Mar 9, 05 BeeSpace 12 Customization (cont.) • Exploit more features such as global context – Occurrences of the same word/phrase should be tagged all positive or all negative • Differentiate between domain-independent features and domain-specific features – E.g., prefix “Am” is domain-specific for Apis mellifera – Features can be weighted based on their contribution across domains Mar 9, 05 BeeSpace 13 Maximum Entropy Model for Gene Tagging • Given an observation (a token or a noun phrase), together with its context, denoted as x • Predict y {gene, non-gene} • Maximum entropy model: P(y|x) = K exp(ifi(x, y)) • Typical f: – y = gene & candidate phrase starts with a capital letter – y = gene & candidate phrase contains digits • Estimate i with training data Mar 9, 05 BeeSpace 14 Plan: Customization with Feature Adaptation • i: trained on large set of data in domain A (e.g., human or fly) • i: trained on small set of data in domain B (e.g., bee) • i’ = i•i + (1 - i)•i: used for domain B • i: based on how useful fi is across different domains – Large i if fi is domain-independent – Small i if fi is domain-specific Mar 9, 05 BeeSpace 15 Issues to Discuss • Definition of gene names: – Gene families? (e.g., cb1 gene family) – Entities with a gene name? (e.g., Ks-1 transcripts) • Difference between genes and proteins? – E.g., “CREB (cAMP response element binding protein)” and “AmCREB”? • How to evaluate the performance on honeybee data? Mar 9, 05 BeeSpace 16 The End • Questions? • Thank You! Mar 9, 05 BeeSpace 17