* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene desert wikipedia , lookup
Gene therapy wikipedia , lookup
Gene nomenclature wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text Mining Group Lead [email protected] http://compbio.ucdenver.edu/Hunter_lab/Cohen More projects than people • • Ongoing: – – – – – Coreference resolution Software engineering perspectives on natural language processing Odd problems of full text Tuberculosis and translational medicine Discourse analysis annotation In need of fresh blood: – – – – – – – Metagenomics/Microbiome studies Translational medicine from the clinical side Summarization Negation Question-answering: Why? Nominalizations Metamorphic testing for natural language processing Metagenomics/microbiome studies • Experiments not interpretable/comparable • without large amounts of metadata Metadata in various places – (fielded) – GenBank isolation_source field – GOLD description fields – Journal articles (full text) Metagenomics/microbiome studies • Various standards: – MIMS – MIMARKS (Nat Biotech, forthcoming) • Ontology terms • Continuous variables • ?? Metagenomics/microbiome studies • • • • • • • • • “Metagenomic sequence data that lack an environmental context have no value.” – Crucial to replication, analysis Do microbial gene richness and evenness patterns (at some specific sampling density) correlate with other environmental characteristics? Which microbial phylotypes or functional guilds co-occur with high statistical probability in different environments? Do specific phylotypes track particular geographic or physico-chemical clines (latitudes, isotherms, isopycnals, etc.)? Do specific microbial community ORFs (functionally identified or not) track specific bioenergetic gradients (solar, geothermal, digestive tracts, etc.)? What is the percentage of genes with a given role, as a function of some physical feature, e.g. the average temperature of the sample sites? Do microbial community protein families, amino acid content, or sequence motifs vary systemically as a function of habitat of origin? Are specific protein sequence motifs characteristic of specific habitats? What is the “resistome” in soil? (Phenotype) Habitat change over time, host-to-host variation, within-host variation— biodefense and forensics applications Metagenomics/microbiome studies • • • • • • • • • • • • • • • • • Investigation type: eukaryote, bacteria, virus, plasmid, organelle, metagenome Experimental factor: Experimental Factor Ontology, Ontology for Biomedical Investigations Latitude, longitude, depth, elevation, humidity, CO2, CO, salinity, temperature, … Geographic location (country, region, sea) from Gaz Ontology Collection date/time Environment, biome and features, material: Environment Ontology Trophic level; aerobe/anaerobe Sample collection device or method Sample material processing: Ontology for Biomedical Investigations Amount or size of sample Targeted gene or locus name PCR primer, conditions Sequencing method Chemicals administered: ChEBI Diseases: Disease Ontology Body site Phenotype: PATO Metagenomics/microbiome studies • Where do you find this stuff? – Text fields in databases Timeline: July 2011 Timeline: July 2011 • Isolation_source in GenBank • Description in GOLD • TBD in microbiome studies, but hopefully coming • Full text of journal articles – Marine secondary products corpus coming (pharmacogenomics connection) – Problem of tables – Multiple sentences, coreference Translational medicine from the clinical side • Factors affecting inclusion/exclusion from • • • clinical trials Sharpening phenotypes (7% of patients in Schwarz’s PIF study) ICD9-CM prioritization Gazillions of named entity recognition problems (drugs, assays, signs, symptoms, vital signs, …) Translational medicine from the clinical side • History: foundational • Practice: difficult—access issues • Technical problems related to data availability (e.g. will you have enough for machine learning?) – TREC EMR track: yes – i2b2 obesity data: probably • Time for a renaissance • Strategy: break in via TREC; deadline: summer/fall Summarization • Task: Given one or more documents, • • produce a shorter version that preserves information Difficulties (multi-document): Duplication, aggregation, presentation Holy grail: abstraction Extraction abstraction An abstract is avs. summary least some of whose •at Extraction: – “Extract” strings from present the input text in material is not – Assemble them to produce the summary the input.” (Mani) Abstraction: “•An extract is a summary – Find meaning consisting entirely of – Produce text that communicates the meaning material copied from the input.” (Mani) “ Extract, or abstract? Abstract Relationship between summarization and generation • Natural language generation: producing • textual output Coherence (good) – Redundancy (bad) – Unresolvable anaphora (bad) – Gaps in reasoning (bad) – Lack of organization (bad) Summarization and generation • GENE: BRCA1 • SPECIES: – Hs. BRCA1 is found in humans. BRCA1 plays a role in breast cancer. • DISEASE_ASSOC.: Breast cancer BRCA1 is found in humans. It plays a role in breast cancer. A multi-document summary Caenorhabditis elegans p53: role in apoptosis, meiosis, and stress resistance. Bcl-2 and p53: role in dopamine-induced apoptosis and differentiation. P53 role in DNA repair and tumorigenesis. Another multi-document summary P53: role in apoptosis, meiosis, and stress resistance, dopamine-induced apoptosis and differentiation, DNA repair and tumorigenesis. Another multi-document summary P53 has a role in apoptosis, meiosis, and stress resistance. It also has a role in dopamine-induced apoptosis and differentiation, DNA repair, and tumorigenesis. Summarization and generation • Examples of non-coherent summaries that wouldn’t be bad… – A table – A table of contents? – An index? – A diagram? Timeline: no pressure Negation • Classic problem • Reasonably well-studied in clinical domain • • (NegEx), but heavily restricted by semantic class Biological domain: 0.20-0.43 F-measure Pattern-learning for OpenDMAP, machine learning, semantic role labelling… Semantic role labelling Arg1: experiencer Arg2: origin Arg3: distance Arg4: destination Figure adapted from Haghighi et al. (2005) Timeline: no pressure Question-answering: Why? • Why did David Koresh ask for a • • • typewriter? Why did I have a Clif bar for breakfast? versus Why did I have a Clif bar for breakfast instead of cereal? Need for data set collection Need novel methods—pattern-matching doesn’t work well Question-answering: Why? • Overall performance is poor – 0.00 MRR versus 0.69 on birthyear (Ravichandran and Hovy 2002) – 0.33 MRR versus 0.75 on location (Ravichandran and Hovy 2002) – 45% at least partially correct (Higashinaka and Isozaki (2007) – 0.35 mean reciprocal rank (2010, Verberne et al.) • Pattern-based approaches outperformed Question-answering: Why? • …why-questions are one of the most • complex types. This is mainly because the answers to why-questions are not named entities (which are in general clearly identifiable), but text passages giving a (possibly implicit) explanation (Maybury 2002 in Verberne 2007) Answers to why-questions cannot be stated in a single phrase but they are passages of text that contain some form of Question-answering: Why? • How can we improve on machine learning methods? – Don’t try—improve pattern learning, instead – Apply what we’re learning about inference and knowledge representation from Hanalyzer-related work – Improved recognition of semantic classes in text (more on this later) Nominalization • Nominalization: noun derived from a verb – Verbal nominalization: activation, inhibition, induction – Argument nominalization: activator, inhibitor, inducer, mutant Nominalizations are dominant in biomedical texts Predicate Nominalization All verb forms Express 2,909 1,233 Develop 1,408 597 Analyze 1,565 364 Observe 185 809 Differentiate 737 166 Describe 10 621 Compare 185 668 Lose 556 74 Perform 86 599 Form 533 511 Data from CRAFT corpus Relevant points for text mining • Nominalizations are an obvious route for scaling up recall • Nominalizations are more difficult to handle than verbs… • …but can yield higher precision (Cohen et al. 2008) Alternations of nominalizations: positions of arguments • Any combination of the set of positions for each argument of a nominalization – Pre-nominal: phenobarbital induction, trkA expression – Post-nominal: increases of oxygen – No argument present: Induction followed a slower kinetic… – Noun-phrase-external: this enzyme can undergo activation Result 1: attested alternations are extraordinarily diverse • Inhibition, a 3-argument predicate— Arguments 0 and 1 only shown Implications for systembuilding • Distinction between absent and noun-phrase- • • external arguments is crucial and difficult, and finite state approaches will not suffice; merging data from different clauses and sentences may be useful Pre-nominal arguments are undergoer by ratio of 2.5:1 For predicates with agent and patient, post/post and pre/post patterns predominate, but others are common as well What can be done? • External arguments: – semantic role labelling approach • …but, very important to recognize the absent/external distinction, especially with machine learning – pattern-based approach • …but, approaches to external arguments (RLIMSP) are so far very predicate-specific What can be done? • Pre-nominal arguments: – apply heuristic that we have identified based on distributional characteristics – for most frequent nominalizations, manual encoding may be tractable Timeline: no pressure Metamorphic testing for NLP • Metamorphic testing motivation: situations • where input/output space is intractably large and it’s not clear what would constitute right answers Use domain knowledge to specify broad categories of changes to output that should occur with broad categories of changes to input Metamorphic testing for NLP • Gene regulatory networks: – Add an unconnected node—G should be subsumed by G’ • SeqMap: – Given a reference string p and a set of sequence reads T = {t1, t2, ..., tn}, and a genome p, we form a new genome p' by deleting an arbitrary portion of either the beginning or ending of p. After mapping T to both p and p' independently, all reads in T that are unmappable to p should also be Metamorphic testing for NLP • Non-linguistic – Add non-informative feature, see if feature selection screens it out – Subtract informative features, see if performance goes down • Linguistic –? Timeline: no pressure Wide range of projects over the past few years • • • • • • • Named entity recognition: – Information extraction: – William A. Baumgartner, Jr., ...K. Bretonnel Cohen, and Lawrence Hunter (submitted) Leveraging concept recognition to extract protein interaction relations from biomedical text. Genome Biology. Summarization: – Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter (2006) Finding GeneRIFs via Gene Ontology annotations. Pacific Symposium on Biocomputing 11:52-63. Word sense disambiguation: – William A. Baumgartner, Jr., ...K. Bretonnel Cohen, and Lawrence Hunter (2007) An integrated approach to concept recognition in biomedical text. Proceedings of BioCreative II.. Question-answering/IR: – J. Gregory Caporaso, William A. Baumgartner Jr., Hyunmin Kim, Zhiyong Lu, Helen L. Johnson, Olga Medvedeva, Anna Lindemann, Lynne Fox, Elizabeth White, K. Bretonnel Cohen, and Lawrence Hunter (2006) Concept recognition, information retrieval, and machine learning in genomics question-answering (2006) Proceedings of the Fifteenth Text Retrieval Conference. Document classification/IR: – J. Gregory Caporaso, William A. Baumgartner Jr., K. Bretonnel Cohen, Helen L. Johnson, Jesse Paquette, and Lawrence Hunter (2005) Concept recognition and the TREC Genomics tasks. Proceedings of the Fourteenth Text Retrieval Conference, National Institute of Standards and Technology. Computational lexical semantics: – – – • Shuhei Kinoshita, K. Bretonnel Cohen, Philip V. Ogren, and Lawrence Hunter (2005). BioCreative Task 1A: entity identification with a stochastic tagger. BMC Bioinformatics 6(Suppl. 1):S4. Philip V. Ogren, K. Bretonnel Cohen, George K. Acquaah-Mensah, Jens Eberlein, and Lawrence Hunter (2004). The compositional structure of Gene Ontology terms. Pacific Symposium on Biocomputing 2004, pp. 214-225. Philip V. Ogren, K. Bretonnel Cohen, and Lawrence Hunter (2005). Implications of compositionality in the Gene Ontology for its curation and usage. Pacific Symposium on Biocomputing 2005, pp. 174-185. Helen L. Johnson, K. Bretonnel Cohen, William A. Baumgartner Jr., Zhiyong Lu, Michael Bada, Todd Kester, Hyunmin Kim, and Lawrence Hunter (2006) Evaluation of lexical methods for detecting relationships between concepts from multiple ontologies. Pacific Symposium on Biocomputing 11:28-39. Corpus linguistics: – – Cohen, K. Bretonnel; Lynne Fox; Philip Ogren; and Lawrence Hunter (2005). Empirical data on corpus design and usage in biomedical natural language processing. AMIA 2005 symposium proceedings, pp. 156-160. K. Bretonnel Cohen, Lynne Fox, Philip V. Ogren, and Lawrence Hunter (2005). Corpus design for biomedical natural language processing. Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases, pp. 38-45. Association for Computational Linguistics. Other recent projects • • • • • • • Characterizing biomedical language – Open Access versus traditional journals – Full text versus abstracts – Nominalization and alternations Biological event extraction Ontology quality assurance Evaluation from many angles—shared task organization and participation; many angles on testing SciKnowMine and BASILISK evaluation (with Ellen Riloff) GO term recognition (with Michael and Karin) Grant-writing Coreference defined • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Coreference resolution • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Coreference defined • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Sophia Loren, she, The actress, her, she Coreference defined • Sophia Loren says she will always be grateful to Bono. The actress revealed that the U2 singer helped her calm down when she became scared by a thunderstorm while travelling on a plane. Bono, the U2 singer How do humans do this? • Linguistic factors: • Knowledge about the world: • – Kevin saw Larry. He liked him. – Sophia Loren will always be grateful to Bono. The actress… – Sophia Loren will always be grateful to Bono. The singer… – Sophia Loren will always be grateful to Bono. The storm… A combination of world knowledge and linguistic factors: – Sophia Loren says she will always be grateful to Bono… – Sophia Loren says he will always be grateful to Bono… Computers are bad at this • Linguistic features don’t always help. – Each child ate a biscuit. They were delicious. – Each child ate a biscuit. They were delighted. • Programming enough knowledge about the world into a computer has proven to be very difficult. Our approach • Matching semantic categories helps – BRCA1, the gene – Cell proliferation, leukocyte proliferation • Minimal work on using ontologies – WordNet (General English, mostly) – Replacing ontology with web search • We’re going to use ontologies, and more • than anyone First step: broad semantic class assignment Our approach • Broad semantic class assignment – Coreference resolution benefits from knowing whether semantic classes match – Semantic class ≈ what ontology you should belong to – Looking at headwords, frequent words, informativeness measures Timeline (coref, not semantic class assignment): this spring Software engineering perspectives on natural language processing Two paradigms of evaluation • Traditional approach: use a corpus • • • • • Expensive Time-consuming to produce Redundancy for some things… …underrepresentation of others (Oepen et al. 1998) Slow run-time (Cohen et al. 2008) • Non-traditional approach: structured test suite • • • • Controls redundancy Ensures representation of all phenomena Easy to evaluate results and do error analysis Used successfully in grammar engineering Structured test suite Canonical • • • • • • • • • GO:0000133 GO:0000108 GO:0000786 GO:0001660 GO:0001726 GO:0005623 GO:0005694 GO:0005814 GO:0005874 Non-canonical Polarisome Repairosome Nucleosome Fever Ruffle Cell Chromosome Centriole Microtubule • • • • • • • • • GO:0000133 GO:0000108 GO:0000786 GO:0001660 GO:0001726 GO:0005623 GO:0005694 GO:0005814 GO:0005874 Polarisomes Repairosomes Nucleosomes Fevers Ruffles Cells Chromosomes Centrioles Microtubules Structured test suite Features of terms • Length • Punctuation • Presence of stopwords • Ungrammatical terms • Presence of numerals • Official synonyms • Ambiguous terms Types of changes • Singular/plural variants • Ordering and other syntactic variants • Inserted text • Coordination • Verbal versus nominal constructions • Adjectival versus nominal constructions • Unofficial synonyms Structured test suite • Syntax – induction of apoptosis apoptosis induction • Part of speech – cell migration cell migrated • Inserted text – ensheathment of neurons ensheathment of some neurons Results • No non-canonical terms were recognized • 97.9% of canonical terms were recognized – All exceptions contain the word in • What would it take to recognize the error pattern with canonical terms with a corpusbased approach?? Cohen (2010) Other uses to date • Broad characterization of • • successes/failures (JULIE lab) Parameter tuning (Hinterberg) Semantic class assignment (see coreference resolution) Timeline: ongoing, lots of work already done • • Weird stuff that comes up Background with full text: parentheses – – – – – – – – – Distinguishing feature of full text (Cohen et al.) Confusing to patients/laypeople, useful to us (Elhadad) Ignorable in gene names (Cohen et al.) Problems for parsers (Jang et al. 2006) Problems with hedge scope assignment (Morante and Daelemans) Abbreviation definition (Schwartz and Hearst) Gene symbol grounding (Lu) “Citances” (Nakov et al.) 17,063 in 97-document corpus Use cases – – – – – – Use P-value to set weighting in networks Target for information extraction applications Coreference resolution within text Gene normalization Meta-analyses Table and figure mentions often indicators of assertions with experimental validation – Mapping text to sub-figures – Citations useful for establishing rhetorical relations between papers, synonym identification, and curation data Weird stuff that comes up with full text: parentheses Category Use case Gene symbol or abbreviation Gene normalization, coreference resolution Citation Summaries, high-value sentences, bibliometrics Data value Information extraction P-value Link weighting, meta-analysis Figure/table pointer Strong indicator of good evidence List element Mapping sub-figures to text Singular/plural Distinguish from other categories Part of gene name Gene normalization Parenthetical statement Potentially ignorable, or IE target Timeline: Mid-March (AMIA) Tuberculosis and translational medicine • Pathogen/host interactions – Pathogens – Hosts – Genes –? Timeline: Not pressing • Information retrieval alone is difficult if • strain must be considered Current status: evaluating dictionarybased approaches to gene name recognition; 0.69 and climbing Discourse annotation • Want to be able to follow and perform • • abductive reasoning Methods under development for labelling aspects of the structure of an argument Currently building large data set from the CRAFT corpus—97 articles Timeline: Soon for annotation, September for doing something with it More projects than people • • Ongoing: – – – – – – – Coreference resolution (spring) Software engineering perspectives on natural language processing Odd problems of full text (mid-March) Semantic classification (March) Tuberculosis and translational medicine Discourse analysis annotation (months) SciKnowMine and BASILISK (with Ellen Riloff) In need of fresh blood: – – – – – – – Metagenomics/Microbiome studies (July) Translational medicine from the clinical side (summer work, due fall) Summarization Negation Question-answering: Why? Nominalizations Metamorphic testing for natural language processing