Download PPT

A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature Hui Yang, Goran Nenadic, John Keane School of Computer Science Manchester Interdisciplinary BioCentre [email protected] Identification of gene names • Gene/protein names are essential for integrating and exploring bio-literature – e.g. building/browsing regulatory networks • Two step process 1) recognise gene mentions in text 2) map these to a referent database • Gene name variability and ambiguity “biologists would rather share a toothbrush than a gene name” Outline 1. Overview – why a cascaded approach 2. Dictionary for matching 3. Exact-like matching 4. Approximate matching 5. Experiments 6. Summary and conclusions Why a cascaded approach? • Overall aim: improve recall, but with a controlled loss of precision • Give your best shot first – intuitively: try “exact” and exact-like matches first, and then try with approximations – experimentally: find an optimal sequence • Apply further (less-reliable) steps only on (still) unmatched gene mentions Dictionary re-engineering • Automatically generate gene name synonyms from existing DBs (e.g. Entrez Gene, UniProt) – use set of (generic, non-organism specific) rules to generate canonical representations of synonyms alphaCP-4 protein alphaCP4_protein RP13-16H11.4 RP13_6H11.4 Rev-ErbAalpha Rev_ErbAalpha ST3GALVI ST3GALVI • Two versions: preserve original synonyms as well as normalised canonical forms Pre-processing gene mentions • Generate a set of canonical representations of a gene mention – analogous to dictionary re-engineering – but, some add-ons and differences •resolving potential acronyms interleukin (IL)-17E  interleukin -17E, IL-17E •resolving gene name coordinations ORP 3 to 6 ORP 3 and 6 – token-based normalisation •Roman numbers, acronyms, Greek letters 1st stage: exact matching • Step E1: Match original dictionary and original mentions • Step E2: Match normalised dictionary and normalised mentions • Step E3: Match normalised dictionary and token-based normalised mentions 2nd stage: approximate matching • component-based comparisons – relevant (specific) component classes (Digit, Greek-Letter, Roman-Number, Chemical etc. tokens) • Step A1a: Component permutation (order) is ignored • Step A1b: Non-relevant components missing from a synonym are ignored 2nd stage: approximate matching • Step A2a: One non-relevant extra component in a synonym is ignored • Step A2b: One non-relevant extra component in a synonym is ignored if all relevant components are matched Original synonyms Original vs. Original Normalised vs. Normalised Normalised vs. Token-normalised Normalised synonyms Ignore word permutations Ignore one missing non-relevant component Original mentions Ignore one extra non-relevant component Normalised mentions Token normalised mentions Ignore one extra non-relevant component if all relevant components are matched Experiments Experimental context • BioCreative II data set • Map human genes to Entrez Gene # abstract s # gene mentions Set-1 (training data) Set-2 (test data) 281 262 985 1092 995 1100 Set-0 (total) 543 2077 2095 # matched gene identifiers Results: exact-like matching TP Original vs. original Normalised vs Normalised Normalised vs. Token-normalised total FP prec recall 1044 24 0.98 0.50 Organism prefix (hSPRY1) 28 0 1.00 0.01 Coordinations 50 23 0.69 0.02 Parentheses 18 9 0.67 0.01 Canonical forms 148 3 0.98 0.07 total 244 35 0.86 0.12 47 9 0.84 0.02 1335 68 0.95 0.64 Results: approximate matching Ignore component permutations Ignore one missing non-relevant component Ignore one extra non-relevant component Ignore one extra non-relevant component if all relevant components are matched TP FP prec recall 23 1 0.96 0.01 16 6 0.73 0.01 65 38 0.63 0.03 13 2 0.01 117 47 0.71 0.87 0.06 Cumulative performance • Precision: 0.93 • Recall: 0.69 • F-measure: 0.79 • For comparisons (BioCreative II test data) – Precision: 0.94 – Recall: 0.72 – F-measure: 0.81 Some conclusions • Exact-like matching achieves 0.76 F-measure (0.96 P, 0.64 R) • Approximate matching improve recall only 10-15% – ignoring word order is effective (both recall and precision-wise), as well as ignoring one extra nonrelevant component (recall) • Some approaches consistent across different test sets, some not – e.g. precision of approximate match: 0.63 – 0.78 recall of exact matching: 0.59 – 0.68 Summary • Simple yet effective approach – cascaded approach with reliable matching strategies which can be switched on and off – some are good for precision, some for recall – can be easily used for other species • More work needed on – gene name coordination and enumerations – acronyms/symbols embedded in mentions – species identification Acknowledgements • Partially funded by UK BBSRC (Project “Mining Term Associations from Literature to Support Knowledge Discovery in Biology”) • Manchester Interdisciplinary Biocentre (Irena Spasic) • Faculty of Life Sciences (Casey Bergman) • National Centre for Text Mining (NaCTeM) A Cascaded Approach to Normalising Gene Mentions in Biomedical Literature Hui Yang, Goran Nenadic, John Keane School of Computer Science Manchester Interdisciplinary BioCentre [email protected]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPT