Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Entity Recognition: Current Status and Summer Plan Jing Jiang May 12, 2006 Update since last meeting • Met with Nyla (the biologist) to talk about training/evaluation data – Most annotated genes in the BioCreative data set are reasonable – To manually annotate a sample set of bee literature for evaluation and tuning purpose • Tagged some other collections (fly-bcb, songbird, Wnt pathway) • Identified some common errors and came up with some heuristics to fix the errors Current performance • On BIOSIS honey bee: waiting to hear from Nyla for judgment on the honey bee sample • On Wnt pathway full-text articles (a sample of 100 sentences, judged by Xin) – Precision: 92% (207 / 224) – Recall: 84% (207 / 245) • Examples: – fly, songbird, Wnt pathway Common errors and heuristics • Same word/phrase tagged differently within the same article – Because of the different contexts – Heuristic: force the tagging to be consistent • Long form and its abbreviation tagged differently – E.g.: …a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and… – Heuristic: force the tagging to be consistent • Easily detectable false positives – E.g.: Roughly half of Drosophila genes currently… – Heuristic: compile a list (of species names, chemical names, etc.) and some heuristic rules Common errors and heuristics (cont.) • Conjunctive words/phrases tagged differently – E.g.: …three cbl genes (c-cbl , cblb , and cblc) which… – Heuristic: use some rules to capture such conjunctive words, and tag them consistently • Tokenization errors: – E.g.: There is no difference in AmTRP-expressing cells among worker, … – Heuristic: compile a list of typical suffixes (such as “expressing”, “-dependent”, etc.) that should be separated from their prefixes Common errors and heuristics • Mistakes caused by citations: – Only in certain text (Wnt pathway collection has this problem. BIOSIS collections don’t.) – E.g.: Among the downstream targets of PI 3-kinase are phospholipase C (6-9) , protein kinase C (10, 11) , Rac (12-14) , and… – Heuristic: remove these citations(?) • Controversial cases: domain, subunit, etc. – E.g.: Alternating proline / alanine sequence of beta B1 subunit originates… – BioCreative data set tags these as part of gene names Summer plan • Evaluate the performance on honey bee data based on Nyla’s judgments • Implement and tune the heuristics to capture the common errors, and evaluate their effectiveness – Some heuristics may cause new errors – Tune on the annotated sample honey bee data • Based on the need of BeeSpace, find a good balance between precision and recall • Work with Todd on the input/output format of the entity recognizer Discussion