Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text Mining Group Lead [email protected] http://compbio.ucdenver.edu/Hunter_lab/Cohen More projects than people • • Ongoing: – – – – – – Coreference resolution Software engineering perspectives on natural language processing Odd problems of full text Tuberculosis and translational medicine Discourse analysis annotation OpenDMAP In need of fresh blood: – – – – – – – – Metagenomics/Microbiome studies Temporality in clinical documents Translational medicine from the clinical side Summarization Negation Question-answering: Why? Nominalizations Metamorphic testing for natural language processing Tuberculosis and translational medicine • Pathogen/host interactions – Pathogens – Hosts – Genes –? • Information retrieval alone is difficult if • strain must be considered Current status: evaluating dictionarybased approaches to gene name recognition; 0.69 and climbing Tuberculosis and translational medicine • How different are the eukaryotic and prokaryotic (literature) domains, really? – Kullback-Leibler divergence to measure difference – Log likelihood to see what makes them different • Effector protein prediction – Retrieve lists of effector proteins and other proteins from one or more pathogens – Build language models for each—what would Tuberculosis and translational medicine • Feature types (non-clever): – Words/stems – N-grams (bigrams, trigrams) • Feature types (clever): conceptual – Gene Ontology terms (especially related to effector proteins?) – Eukaryotic domains – Signals – Hosts/biomes – You tell me Translational medicine from the clinical side • Factors affecting inclusion/exclusion from • • • clinical trials Sharpening phenotypes (7% of patients in Schwarz’s PIF study) ICD9-CM prioritization Gazillions of named entity recognition problems (drugs, assays, signs, symptoms, vital signs, …) Translational medicine from the clinical side • History: foundational • Practice: difficult—access issues • Technical problems related to data availability (e.g. will you have enough for machine learning?) – TREC EMR track: yes – i2b2 obesity data: probably • Time for a renaissance • Strategy: break in via TREC and i2b2; deadline: summer/fall Translational medicine from the clinical side • Inclusion/exclusion from effectiveness studies—Text REtrieval Conference (TREC) 2011/2012 – Given: • 110,000 clinical notes • List of topics – Return: relevant records • Successful methods: pay attention to • document structure; model as questions; build more training data Hard to beat Lucene! (We did) Translational medicine from the clinical side • • • • • • • • • • • Patients with hearing loss Patients with complicated GERD who receive endoscopy Hospitalized patients treated for methicillin-resistant Staphylococcus aureus (MRSA) endocarditis Patients diagnosed with localized prostate cancer and treated with robotic surgery Patients with dementia Patients who had positron emission tomography (PET), magnetic resonance imaging (MRI), or computed tomography (CT) for staging or monitoring of cancer Patients with ductal carcinoma in situ (DCIS) Patients treated for vascular claudication surgically Women with osteopenia Patients being discharged from the hospital on hemodialysis Patients with chronic back pain who receive an intraspinal pain-medicine pump Temporality in clinical documents • Did this patient have a headache within the past ten days? She had a migraine 2 weeks ago, lasting a few days with headache, dizziness, visual changes and vomiting. •Subtasks: –Recognize that she had a headache –Figure out the temporal relation to the document creation date –Average accuracy 0.76-0.78 Temporality in clinical documents All varieties of time expressions in a single clinic visit note Conventional time Logical time Anchored July 2009 in August in September On the day of the visit present Unanchored a few days in three months 2 weeks ago Since then Temporality in clinical documents All varieties of time expressions relevant to current approaches to NLP in a single clinic visit note Events TIMEX3 TLINKs ALINKs July 2009 in August in September a few days in 3 months on the day of the visit present Since then 2 weeks ago Seen migraine headache dizziness visual changes vomiting period (menses) soreness activity malar rash Soreness BEFORE office visit malar rash OVERLAP office visit return AFTER office visit Continue CONTINUE medications Examination INITIATE BP Temporality in clinical documents • Approach to event recognition: use • • • structure of ontologies and definitions in ontologies to recognize that event occurred (Need robust handling of negation and context) Regular expressions for temporal expressions Logic for translating from temporal expressions to within-10-days-or-not Summarization • Task: Given one or more documents, • • produce a shorter version that preserves information Difficulties (multi-document): Duplication, aggregation, presentation Holy grail: abstraction Extraction abstraction An abstract is avs. summary least some of whose •at Extraction: – “Extract” strings from present the input text in material is not – Assemble them to produce the summary the input.” (Mani) Abstraction: “•An extract is a summary – Find meaning consisting entirely of – Produce text that communicates the meaning material copied from the input.” (Mani) “ Extract, or abstract? Abstract Relationship between summarization and generation • Natural language generation: producing • textual output Coherence (good) – Redundancy (bad) – Unresolvable anaphora (bad) – Gaps in reasoning (bad) – Lack of organization (bad) Summarization and generation • GENE: BRCA1 • SPECIES: – Hs. BRCA1 is found in humans. BRCA1 plays a role in breast cancer. • DISEASE_ASSOC.: Breast cancer BRCA1 is found in humans. It plays a role in breast cancer. A multi-document summary Caenorhabditis elegans p53: role in apoptosis, meiosis, and stress resistance. Bcl-2 and p53: role in dopamine-induced apoptosis and differentiation. P53 role in DNA repair and tumorigenesis. Another multi-document summary P53: role in apoptosis, meiosis, and stress resistance, dopamine-induced apoptosis and differentiation, DNA repair and tumorigenesis. Another multi-document summary P53 has a role in apoptosis, meiosis, and stress resistance. It also has a role in dopamine-induced apoptosis and differentiation, DNA repair, and tumorigenesis. Summarization and generation • Examples of non-coherent summaries that wouldn’t be bad… – A table – A table of contents? – An index? – A diagram? Unique problem in summarization for tuberculosis and host/pathogen interactions • How do you build a single summary that • • covers data about two different species? Start with relations—bridge sentences, even if extractive Ordering: temporal? We know the course of disease… Negation • Classic problem • Reasonably well-studied in clinical domain • • (NegEx), but heavily restricted by semantic class Biological domain: 0.20-0.43 F-measure Pattern-learning for OpenDMAP, machine learning, semantic role labelling… Semantic role labelling Arg1: experiencer Arg2: origin Arg3: distance Arg4: destination Figure adapted from Haghighi et al. (2005) Question-answering: Why? • Why did David Koresh ask for a • • • typewriter? Why did I have a Clif bar for breakfast? versus Why did I have a Clif bar for breakfast instead of cereal? Need for data set collection Need novel methods—pattern-matching doesn’t work well Question-answering: Why? • Overall performance is poor – 0.00 MRR versus 0.69 on birthyear (Ravichandran and Hovy 2002) – 0.33 MRR versus 0.75 on location (Ravichandran and Hovy 2002) – 45% at least partially correct (Higashinaka and Isozaki (2007) – 0.35 mean reciprocal rank (2010, Verberne et al.) • Pattern-based approaches outperformed Question-answering: Why? • …why-questions are one of the most • complex types. This is mainly because the answers to why-questions are not named entities (which are in general clearly identifiable), but text passages giving a (possibly implicit) explanation (Maybury 2002 in Verberne 2007) Answers to why-questions cannot be stated in a single phrase but they are passages of text that contain some form of Question-answering: Why? • How can we improve on machine learning methods? – Don’t try—improve pattern learning, instead – Apply what we’re learning about inference and knowledge representation from Hanalyzer-related work – Improved recognition of semantic classes in text (more on this later) Nominalization • Nominalization: noun derived from a verb – Verbal nominalization: activation, inhibition, induction – Argument nominalization: activator, inhibitor, inducer, mutant Nominalizations are dominant in biomedical texts Predicate Nominalization All verb forms Express 2,909 1,233 Develop 1,408 597 Analyze 1,565 364 Observe 185 809 Differentiate 737 166 Describe 10 621 Compare 185 668 Lose 556 74 Perform 86 599 Form 533 511 Data from CRAFT corpus Relevant points for text mining • Nominalizations are an obvious route for scaling up recall • Nominalizations are more difficult to handle than verbs… • …but can yield higher precision (Cohen et al. 2008) Alternations of nominalizations: positions of arguments • Any combination of the set of positions for each argument of a nominalization – Pre-nominal: phenobarbital induction, trkA expression – Post-nominal: increases of oxygen – No argument present: Induction followed a slower kinetic… – Noun-phrase-external: this enzyme can undergo activation Result 1: attested alternations are extraordinarily diverse • Inhibition, a 3-argument predicate— Arguments 0 and 1 only shown Implications for systembuilding • Distinction between absent and noun-phrase- • • external arguments is crucial and difficult, and finite state approaches will not suffice; merging data from different clauses and sentences may be useful Pre-nominal arguments are undergoer by ratio of 2.5:1 For predicates with agent and patient, post/post and pre/post patterns predominate, but others are common as well What can be done? • External arguments: – semantic role labelling approach • …but, very important to recognize the absent/external distinction, especially with machine learning – pattern-based approach • …but, approaches to external arguments (RLIMSP) are so far very predicate-specific What can be done? • Pre-nominal arguments: – apply heuristic that we have identified based on distributional characteristics – for most frequent nominalizations, manual encoding may be tractable Metagenomics/microbiome studies • Experiments not interpretable/comparable • without large amounts of metadata Metadata in various places – (fielded) – GenBank isolation_source field – GOLD description fields – Journal articles (full text) Metagenomics/microbiome studies • Various standards: – MIMS – MIMARKS (Nat Biotech, forthcoming) • Ontology terms • Continuous variables • ?? Metagenomics/microbiome studies • • • • • • • • • “Metagenomic sequence data that lack an environmental context have no value.” – Crucial to replication, analysis Do microbial gene richness and evenness patterns (at some specific sampling density) correlate with other environmental characteristics? Which microbial phylotypes or functional guilds co-occur with high statistical probability in different environments? Do specific phylotypes track particular geographic or physico-chemical clines (latitudes, isotherms, isopycnals, etc.)? Do specific microbial community ORFs (functionally identified or not) track specific bioenergetic gradients (solar, geothermal, digestive tracts, etc.)? What is the percentage of genes with a given role, as a function of some physical feature, e.g. the average temperature of the sample sites? Do microbial community protein families, amino acid content, or sequence motifs vary systemically as a function of habitat of origin? Are specific protein sequence motifs characteristic of specific habitats? What is the “resistome” in soil? (Phenotype) Habitat change over time, host-to-host variation, within-host variation— biodefense and forensics applications Metagenomics/microbiome studies • • • • • • • • • • • • • • • • • Investigation type: eukaryote, bacteria, virus, plasmid, organelle, metagenome Experimental factor: Experimental Factor Ontology, Ontology for Biomedical Investigations Latitude, longitude, depth, elevation, humidity, CO2, CO, salinity, temperature, … Geographic location (country, region, sea) from Gaz Ontology Collection date/time Environment, biome and features, material: Environment Ontology Trophic level; aerobe/anaerobe Sample collection device or method Sample material processing: Ontology for Biomedical Investigations Amount or size of sample Targeted gene or locus name PCR primer, conditions Sequencing method Chemicals administered: ChEBI Diseases: Disease Ontology Body site Phenotype: PATO Metagenomics/microbiome studies • Where do you find this stuff? – Text fields in databases • Isolation_source in GenBank • Description in GOLD • TBD in microbiome studies, but hopefully coming • Full text of journal articles – Marine secondary products corpus coming (pharmacogenomics connection) – Problem of tables – Multiple sentences, coreference Metamorphic testing for NLP • Metamorphic testing motivation: situations • where input/output space is intractably large and it’s not clear what would constitute right answers Use domain knowledge to specify broad categories of changes to output that should occur with broad categories of changes to input Metamorphic testing for NLP • Gene regulatory networks: – Add an unconnected node—G should be subsumed by G’ • SeqMap: – Given a reference string p and a set of sequence reads T = {t1, t2, ..., tn}, and a genome p, we form a new genome p' by deleting an arbitrary portion of either the beginning or ending of p. After mapping T to both p and p' independently, all reads in T that are unmappable to p should also be Metamorphic testing for NLP • Non-linguistic – Add non-informative feature, see if feature selection screens it out – Subtract informative features, see if performance goes down • Linguistic –? Coreference defined • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Coreference resolution • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Coreference defined • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Sophia Loren, she, The actress, her, she Coreference defined • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Bono, the U2 singer How do humans do this? • Linguistic factors: • Knowledge about the world: • – Kevin saw Larry. He liked him. – Sophia Loren will always be grateful to Bono. The actress… – Sophia Loren will always be grateful to Bono. The singer… – Sophia Loren will always be grateful to Bono. The storm… A combination of world knowledge and linguistic factors: – Sophia Loren says she will always be grateful to Bono… – Sophia Loren says he will always be grateful to Bono… Computers are bad at this • Linguistic features don’t always help. – Each child ate a biscuit. They were delicious. – Each child ate a biscuit. They were delighted. • Programming enough knowledge about the world into a computer has proven to be very difficult. Our approach • Matching semantic categories helps – BRCA1, the gene – Cell proliferation, leukocyte proliferation • Minimal work on using ontologies – WordNet (General English, mostly) – Replacing ontology with web search • We’re going to use ontologies, and more • than anyone First step: broad semantic class assignment Our approach • Broad semantic class assignment – Coreference resolution benefits from knowing whether semantic classes match – Semantic class ≈ what ontology you should belong to – Looking at headwords, frequent words, informativeness measures Why assign broad semantic classes? • Coreference resolution • Information extraction • Document classification To be clear about what I mean by “broad semantic class…” If you were going to be part of an ontology, which ontology would you be part of? Target semantic classes • Chosen for relevance to mouse genomics Gene Ontology Sequence Ontology Foundational Model of Anatomy NCBI Taxonomy Chemical Entities of Biological Interest Phenotypic Quality BRENDA Tissue/Enzyme Source Cell Type Ontology Gene Regulation Ontology Homology Ontology Human Disease Ontology Mammalian Phenotype Ontology Molecule Role Ontology Mouse Adult Gross Anatomy Ontology Mouse Pathology Ontology Protein Modification Ontology Protein-Protein Interaction Ontology Suggested Ontology for Pharmacogenomics Sample Processing and Separation Techniques Ontology Method for class assignment • Exact match • “Stripping” • Head noun • Stemmed head noun “Stripping” • Delete all non-alphanumeric characters • Cadmium-binding, cadmium binding • cadmiumbinding • X of… Head nouns: two simple heuristics – Where of represents any preposition • Rightmost word • Positive regulation of growth • Positive regulation • Regulation Evaluation • Annotated corpus • Ontology-against-itself • Structured test suite Two potential baselines • NCBO Annotator – Exact match and substring only—not strong enough • MetaMap – Future work CRAFT corpus • Colorado Richly Annotated Full Text • 97 full-text journal articles • 597,000 words • Evidence for MGI Gene Ontology • annotations 119,783 annotations across five ontologies: – Gene Ontology – Sequence Ontology – Cell Type Ontology Ontology against itself • Use the terms from the ontologies • • • themselves. Seems obvious, but… …every term should return its own ontology as its semantic class. Used the head noun technique only…. …since exact match and stripping are guaranteed to give the right answer. Structured test suite • 300 canonical and non-canonical forms • Categorized according to features of terms and features of changes to terms Structured test suite Non-canonical forms • Ordering and other syntactic variants • Inserted text • Coordination • Singular/plural variants • Verbal versus nominal • Adjectival versus nominal • Unofficial synonyms Features of terms • Length • Punctuation • Presence of stopwords • Ungrammatical • Presence of numerals • Official synonyms • Ambiguous terms Structured test suite • Syntax – induction of apoptosis apoptosis induction • Part of speech – cell migration cell migrated • Inserted text – ensheathment of neurons ensheathment of some neurons Results on the CRAFT corpus when only CRAFT ontologies are used as input Ontology Annotations Precision Recall F-measure Gene Ontology 39,626 66.31 73.06 69.52 Sequence Ontology 40,692 63.00 72.21 67.29 Cell Type Ontology 8,383 53.58 87.27 66.40 NCBI Taxonomy 11,775 96.24 92.51 94.34 ChEBI 19,307 70.07 90.53 79.00 119,783 67.06 78.49 72.32 69.84 83.12 75.31 Total (microaveraged) Total (macroaveraged) Ontology Accuracy on CRAFT corpus when all 20 ontologies are Exact Strippedused Head noun Stemmed head Gene Ontology 24.26 24.68 59.18 77.12 Sequence Ontology 44.28 47.63 56.63 73.33 Cell Type Ontology 25.26 25.80 70.09 88.38 NCBI Taxonomy 84.67 84.71 90.97 95.73 ChEBI 86.93 87.44 92.43 95.49 Results on ontology-against-itself • 97-100% for 18/20 ontologies (no • • surprise), but… …found much lower performance on two ontologies (Sequence Ontology and Molecule Role Ontology) due to preprocessing errors and omissions, indicating that… …this evaluation method is robust! Results on structured test suite • Nota bene: this analysis is on the level of individual terms, but don’t lose track of the fact that we’re trying to recognize broad semantic classes, not individual terms Results on structured test suite • Headword technique works very well in presence of syntactic variation – induction of apoptosis/apoptosis induction • Headword technique works in the presence of inserted text – ensheathment of neurons/ensheathment of some neurons Results on structured test suite • Headword stemming allows catching verb phrases – cell migration/cells migrate • Headword stemming fails when verb/noun relationship is irregular – X growth/grows • Stemming is always necessary for • recognizing plurals, regardless of term length Porter stemmer fails on irregular plurals Results on structured test suite • Approach handles “ungrammatical” terms like transposition, DNA-mediated – Important because exact match will always fail on these Software engineering perspectives on natural language processing Two paradigms of evaluation • Traditional approach: use a corpus • • • • • Expensive Time-consuming to produce Redundancy for some things… …underrepresentation of others (Oepen et al. 1998) Slow run-time (Cohen et al. 2008) • Non-traditional approach: structured test suite • • • • Controls redundancy Ensures representation of all phenomena Easy to evaluate results and do error analysis Used successfully in grammar engineering Structured test suite Canonical • • • • • • • • • GO:0000133 GO:0000108 GO:0000786 GO:0001660 GO:0001726 GO:0005623 GO:0005694 GO:0005814 GO:0005874 Non-canonical Polarisome Repairosome Nucleosome Fever Ruffle Cell Chromosome Centriole Microtubule • • • • • • • • • GO:0000133 GO:0000108 GO:0000786 GO:0001660 GO:0001726 GO:0005623 GO:0005694 GO:0005814 GO:0005874 Polarisomes Repairosomes Nucleosomes Fevers Ruffles Cells Chromosomes Centrioles Microtubules Structured test suite Features of terms • Length • Punctuation • Presence of stopwords • Ungrammatical terms • Presence of numerals • Official synonyms • Ambiguous terms Types of changes • Singular/plural variants • Ordering and other syntactic variants • Inserted text • Coordination • Verbal versus nominal constructions • Adjectival versus nominal constructions • Unofficial synonyms Structured test suite • Syntax – induction of apoptosis apoptosis induction • Part of speech – cell migration cell migrated • Inserted text – ensheathment of neurons ensheathment of some neurons Results • No non-canonical terms were recognized • 97.9% of canonical terms were recognized – All exceptions contain the word in • What would it take to recognize the error pattern with canonical terms with a corpusbased approach?? Cohen (2010) Other uses to date • Broad characterization of • • successes/failures (JULIE lab) Parameter tuning (Hinterberg) Semantic class assignment (see coreference resolution) • • Weird stuff that comes up Background with full text: parentheses – – – – – – – – – Distinguishing feature of full text (Cohen et al.) Confusing to patients/laypeople, useful to us (Elhadad) Ignorable in gene names (Cohen et al.) Problems for parsers (Jang et al. 2006) Problems with hedge scope assignment (Morante and Daelemans) Abbreviation definition (Schwartz and Hearst) Gene symbol grounding (Lu) “Citances” (Nakov et al.) 17,063 in 97-document corpus Use cases – – – – – – Use P-value to set weighting in networks Target for information extraction applications Coreference resolution within text Gene normalization Meta-analyses Table and figure mentions often indicators of assertions with experimental validation – Mapping text to sub-figures – Citations useful for establishing rhetorical relations between papers, synonym identification, and curation data Weird stuff that comes up with full text: parentheses Category Use case Gene symbol or abbreviation Gene normalization, coreference resolution Citation Summaries, high-value sentences, bibliometrics Data value Information extraction P-value Link weighting, meta-analysis Figure/table pointer Strong indicator of good evidence List element Mapping sub-figures to text Singular/plural Distinguish from other categories Part of gene name Gene normalization Parenthetical statement Potentially ignorable, or IE target Discourse annotation • Want to be able to follow and perform • • abductive reasoning Methods under development for labelling aspects of the structure of an argument Currently building large data set from the CRAFT corpus—97 articles More projects than people • • Ongoing: – – – – – Coreference resolution Software engineering perspectives on natural language processing Tuberculosis and translational medicine Discourse analysis annotation OpenDMAP In need of fresh blood: – – – – – – – – Metagenomics/Microbiome studies Temporality in clinical documents Translational medicine from the clinical side Summarization Negation Question-answering: Why? Nominalizations Metamorphic testing for natural language processing