Download Molecular Biology Databases

Biomedical natural language processing and text mining Kevin Bretonnel Cohen, Ph.D. Instructor, Department of Pharmacology University of Colorado School of Medicine Adjunct Assistant Professor Department of Linguistics University of Colorado at Boulder [email protected] http://compbio.ucdenver.edu/Hunter_lab/Cohen What is natural language processing? NLP, text mining, computational linguistics – Computational modeling of human language – Access to knowledge in linguistic form • Information retrieval • Information extraction • Document classification • Machine translation • Summarization •… 850 20 800 19 18 y = ~e0.0418x R2 = 0.99 750 17 700 16 650 15 600 14 550 13 12 500 ~e0.031x y= R2 = 0.95 450 11 10 400 9 350 8 2008 2007 2006 Pubmed Growth Rate 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989 6 1988 250 1987 7 1986 300 Total Entries (millions) New Entries (thousands) Why Biomedical NLP? Exponential knowledge growth • 1,170 peer-reviewed • • • gene-related databases in 2009 NAR db issue 804,399 PubMed entries in 2008 (> 2,200/day) Breakdown of disciplinary boundaries; more of it relevant to each of us “Like drinking from a firehose” – Jim Ostell Slide from Larry Hunter The Biological Data Cycle Experimental Data Ontologies Databases Genbank SwissProt Literature Collections MEDLINE Expert Curation Bottleneck: getting knowledge from literature to databases Solution: text mining 1 Model Organism Curation Pipeline 3. Curate genes from paper 2. List genes for curation 1. Select papers MEDLINE From Hirschman et al. BMC Bioinformatics 1 2005 6(Suppl 1):S1 The world’s best justification for BioNLP Baumgartner et al. (2007b) Descriptive versus prescriptive approaches to language • Descriptive: Goal is to describe language accurately and understand how it works. – Not easy. • Prescriptive: Goal is to tell people how language should be. – Completely irrelevant to computational approaches to language. Two families of approaches • Rule-based • Grammars • Patterns • Machine learning / statistical • Labelled training data • Noisy/unlabelled training data Why they’re difficult • Rule-based: – Some expertise in linguistics and in the domain is very useful – Domain adaptation is time-consuming – Complete solutions may be out of reach • Machine learning: – Large amounts of training data required and expensive – Difficult to do anything novel Why they work • Rule-based: – Rules are in some sense psychologically and empirically real – Patterns actually exist in the real world – Large training sets not necessary • Machine learning: – Some things that we care about occur often enough to be tractable How to pick one • Money • Most approaches are actually hybrids – Rule-based for post-processing – Rule-based feature extraction Text mining improves biological data analysis • • Leverage information from the literature in the biological data mining process Homology searches: – Filter unlikely sequence alignments through assessment of literature – • • similarity Score literature similarity independently of sequence similarity, and combine into unified score Subcellullar localization – Build literature term vectors based on PubMed/MEDLINE abstracts or SWISS-PROT textual annotations Gene expression clusters: – Assign biological explanations through extraction of significant – literature terms for genes in cluster Measure literature correlations independently, and combine with microarray correlations before clustering Evaluation of NLP systems • Precision (aka specificity) and recall (aka • sensitivity). Tradeoffs between them. Against a “gold standard” of human generated representations of texts – Humans don’t always agree, therefore calculate inter-annotator agreement • Post-hoc judgments (particularly of IR • relevance) “Shared task” paradigm – TREC Genomics (IR) – BioCreative (IE) Evaluation of NLP systems • Precision: – True positives / (True positives + False positives) • Recall: – True positives / (True positives + False negatives) • F-measure: “harmonic mean” of precision and recall Evaluation of NLP systems • Formal definition: (1 + β2) * precision * recall Fβ = (β2 * precision) + recall • Typical definition: β = 1, so… Evaluation of NLP systems • Typical definition: 2 * precision * recall F1 = precision + recall • …or just F: β is usually assumed to be 1 Evaluation of NLP systems • β allows you to weight precision and recall differently – Increasing β weights recall more highly – Decreasing β weights precision more highly • Rarely used, but designated by value of β, e.g. F0.5 or F2 Chang et al.’s improvement on PSI-BLAST (2001) Ng (2006) Significant improvement in precision P R Standard PSIBLAST .84 .33 Chang et al. .95 .32 Goal: Predict subcellular localization to understand function • Signal peptides and other sequences are • • indicative of localization Machine learning based predictors are moderately accurate Try adding text… Subcellular localization (Stapley et al. 2002, Eskin and Agichtein 2004) Single SVM Build specialized amino acid and text kernels, then build combined kernel Ng (2006) Text improves clustering of gene expression profiles, too • Create per-gene distance matrices based on • • • expression data Create per-gene distance matrices based on literature data Combine using Fisher’s omnibus …then cluster Matrix merging (Glenisson et al. 2003) Ng (2006) More sophisticated text analysis can improve these results See the YouTube Hanalyzer demo for a better sense of the process Leach et al. (2009) APPLICATIONS TextPresso Chilibot (www.chilibot.net) Chen and Sharp (2004) Chilibot Chen and Sharp (2004) Chilibot Chen and Sharp (2004) iHop (http://www.ihop-net.org/UniPub/iHOP) Reflect (www.reflect.ws) • Firefox plug-in • Recognises proteins and small molecules mentioned in a web page, and links them to information-rich summaries. Karin Verspoor GoPubMed Doms, A. et al. Nucl. Acids Res. 2005 33:W783-W786; doi:10.1093/nar/gki470 BIOMEDICAL LANGUAGE PROCESSING Surely Shuy jests... “There is little reason for the data on which a linguist works to have the right to name that work.” Tokenization is different • Commas – 2,6-diaminohexanoic acid – tricyclo(3.3.1.13,7)decanone • Hyphens – “Syntactic”(Calcium-dependent, Hsp-60) – Knocked-out gene: lush-- flies – Negation: -fever – Electric charge: Cl- • PMID: 10516078 B-cell-CD4(+)-T-cell interactions Named Entity Recognition is different • Genes have names? lot white maggie Breast cancer 1 (BRCA1) scott of the antarctic ring always early -> british rail Ribosomal protein S27 asp -> cleopatra p53 tudor -> vasa -> gustavus Heat shock protein 110 nanos -> smaug Mitogen activated protein kinase 15 pray for elves Mitogen activated protein kinase kinase kinase 5 to, the, there, a, I, … sema domain, seven thrombospondin repeats (type 1 and type 1like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A [SEMA5A] Karin Verspoor It really is different on every level •Corpus construction •Semantic representation … Ultimately, we need specific knowledge of the domain to do a good job with the language. Linguistic Levels of Analysis From Hunter & Cohen, Biomedical Language Processing: What’s Beyond PubMed?, Molecular Cell 21, 589–594, 2006 DOI 10.1016/j.molcel.2006.02.012 SUBTASKS AND TOOLS Information Retrieval • Retrieving from a collection of indexed documents – Indices based on • Words (perhaps without “stop words”) • Stems (e.g. expresses, expressed, expression ⇒ express) • Synonyms and expansions • Meta-data fields (e.g. author, title) • Keywords or “controlled vocabularies” (e.g. MeSH) – Retrieval rankings based on • Number of matching terms • TF*IDF • Independent document characteristics (citations, links, etc.) • Familiar as Google, PubMed, etc. Karin Verspoor TF*IDF • Term frequency * Inverse Document Frequency – TF = how many times a term appears in a document – IDF = reciprocal of number of times a term appears in all documents • Measure of how informative a term is – Occurrence of rare term is more informative than that of a widely used term – Terms used frequently in a document are more informative that terms used only once • Lots of variants Karin Verspoor Documents as queries • Use a whole document to define a query (find • things similar to…) Represent the document as: – “Bag of words” • Binary or frequency based vector of words or stems – Can add bigrams or trigrams – Reduced dimensionality (Latent Semantic Analysis) • Calculate distance to all other documents in a collection (various metrics) Karin Verspoor Named entity recognition HSP60 Hsp-60 heat shock protein 60 Cerberus wingless Ken and Barbie the 3 Entity normalization Entity normalization: find concepts in text and map them to unique identifiers A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. 3 Entity normalization • Perfect named entity recognition finds 5 mentions; they correspond to just 2 genes: – FBgn0000592 (esterase 6) – FBgn0026412 (leucine aminopeptidase) A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. 3 Entity normalization • Partial list of synonyms for FBgn0000592: – Esterase 6 – Carboxyl ester hydrolase – CG6917 – Est6 – Est-D – Est-5 3 Biological Nomenclature: “V-SNARE” V-SNARE Vesicle SNARE SNAP Receptor Soluble NSF Attachment Protein N-Ethylmaleimide-Sensitive Fusion Protein Maleic acid N-ethylimide Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor (Alex Morgan, MITRE) Domain-specific stopword lists may be necessary • Stopword list for species/taxonomy identification: – bear – seal • Stopword list for environments and habitats for metagenomics studies: – spring – well – range Examples from Evangelos Pafilis Information/relation extraction Information extraction: relationships between things BINDING_EVENT Binder: Bound: 2 Information/relation extraction Met28 binds to DNA. BINDING_EVENT Binder: Met28 Bound: DNA 2 Document clustering • For browsing large numbers of relevant documents – In biomedicine, unlike most Google searches, the goal is not one relevant document, but many • Statistical measures of document distance – Cosine distance over term (or stem) vectors – PubMed document neighbors (TF*IDF clustering) – Latent Semantic Analysis (LSA) • Knowledge-based approaches: – Mapping documents to a predefined set of types – Use information extraction as basis for clustering Karin Verspoor Automated summarization • Useful for browsing retrieved documents • Multidocument summarization can • characterize document clusters Select the “best” sentence/passage – Based on appearance of query terms (a la Google) – Other useful criteria: • Cues (“we conclude”, “demonstrating that”…) • Presence of supporting data (“Figure 6 shows that…”) • Sentence position (last sentence of abstract) – Frequency in multiple documents Karin Verspoor Document zoning • Different “sections” or zones of a document – Introduction vs. methods vs. references, etc. • Many want to focus on (or exclude) certain • zones from search or other processing No straightforward way to identify zones – Journals often have their own structures – Section titles, HTML/XML/SGML formatting helps (PubMedCentral DTD) – Treat as discrimination problem Karin Verspoor Extracting factual information from text • Information extraction (IE) involves parsing text for patterns encoding particular facts – Biomedical literature is full of useful information potentially amenable to IE (e.g. consequences of mutations) – BioCreative 2006/2009 Competitions on extracting protein-protein interaction statements from literature • Subtasks: – Entity identification / normalization – Finding relationships – Filling in predefined schemata Karin Verspoor Named entity recognition (again) • Finding references to particular concepts (e.g. genes, drugs, diseases) in text – Difficult because of ambiguity [genes with normal English names, variations in expression, anaphoric reference, etc.] Karin Verspoor http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.1000361&imageURI=info:doi/10.1371/journal.pcbi.1000361.g003 Variation in expression • There are a very large number of ways of expressing a single “concept” – Morphological: NF Kappa B1, NFKB1, NFKB-1, NFkB1, NFkB-1, NFK-B1, NFKB-I, NFkB(1) – Synonyms: KBF1, EBP-1, MGC54151, NFKBp50, NFKB-p105, NF-kappa-B, DKFZp686C01211 – Syntactic: “X regulates Y”, “Y is regulated by X”, “regulation of Y by X”, “X regulation of Y”, “Y-regulating X” • People don’t even tend to notice these… Karin Verspoor Ambiguity • Most important problem in NLP • Example: Hunk – Cell type: HUman Natural Killer cells – Gene: Hormonally Upregulated Neu-associated Kinase – English word: piece or lump of substance • Correct construal requires knowledge to interpret: – “Hunk expression” versus “Hunk phagocytosis” • Can be structural, too, e.g. “regulation of cell proliferation and motility” Karin Verspoor Gene Normalization (again) • Mapping a gene or protein name to an • • identifier (e.g. in GenBank) Very important task for using extracted information (more useful than just a name) Ambiguity – with English words (“to” “dunce” “wingless”) – in naming (1168 genes in Entrez named “p60”) – in species (949 species have a gene named “p53”) Karin Verspoor Normalization methods • Heuristic approach is necessary – Edit distance is too coarse (some characters matter more than others) • Some heuristics that appear to work – Ignore hyphens, commas, some other interrupting punctuation (but not, e.g., ' ) – Ignore parenthetical elements – Consider translations among arabic/roman numerals, and latin/greek letters – Special words for compound noun phrases: receptor, precursor, mRNA, gene, protein, greek letter names, etc. Karin Verspoor Other entities • Genes (and their products) are particularly valuable to recognize, but are not the only entities of interest: – Diseases – Drugs and other treatments – Anatomical and other locations – Time and temporal relationships – Methods and evidence Karin Verspoor Recognize what? • To map texts to unambiguous representations, • we need an underlying set of concepts to recognize. An Ontology is a set of concepts in a subsumption hiearachy – If all instances of concept X are also instances of concept Y, then Y subsumes X. The “is a” relationship – Subsumption is a many-to-many relationship – E.g. “nucleus” is-a “cell component” Karin Verspoor Open Biological Ontologies • The Gene Ontology (GO) project started in 2001 – Model organism • database annotators agreed on common representation to facilitate sharing OBO is extending this to other topics – Sequence features, cell types, mammalian phenotypes, etc. From ontologies to knowledge-bases • Knowledge-bases (KBs) – Provide horizontal relationships (“slots”) among concepts (not just is-a, part-of), e.g.: • Regulation of cell cycle controls cell cycle • DNA transcription takes place in the nucleus – Can be used for inference beyond just inheritance • E.g. Relationships between molecular function and subcellular localization can be used to infer missing information • Many of these relationships can be extracted semiautomatically (need manual verification) Syntactic parsing • Groups together words, tags parts of speech. “This effect of cyclosporin A or herbimycin A on the downregulation of ERCC-1 correlates with enhanced cytotoxicity of cisplatin in this system.” [this effect]NP [of [cyclosporin A]NP]PrepP [or]CONJ [herbimycin A]NP [on [the down-regulation]NP]PrepP [of [ERCC-1]NP]PrepP [correlates]V [with [enhanced cytotoxicity]NP]PrepP [of [cisplatin]NP]PrepP [in [this system]NP]PrepP Karin Verspoor Syntax helps • 125I-labeled C3b was covalently deposited on CR2, when hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase> • CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule> • • The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein> Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex> Larry Hunter Coordination is particularly hard In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA. <mannose receptor> BINDS <man bsa> <s4ggnm - r> BINDS <man bsa> Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex. <authentic nc1> BINDS <laminin 5 / 6 complex> <authentic nc1> BINDS <collagen type I> <authentic nc1> BINDS <fibronectin> <purified recombinant nc1> BINDS <laminin 5 / 6 complex> <purified recombinant nc1> BINDS <collagen type I> <purified recombinant nc1> BINDS <fibronectin> The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. * <Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein> <beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein> <nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein> • Documents as evidence of function or other relationships Cooccurrence statistics. How often are two or more genes (or other entities) mentioned in the same document? – PubGene is a large database of co-occurrence statistics • http://www.pubgene.org Functional coherence measure (Altman et al.) – For each article mentioning a gene from a putatively functional group, score the article's relevance based on whether similar articles also mention genes in the group – Compare the number of high scoring articles that a group generates to an expected number from random genes. • Literature-based groupings combined with other data Using literature-based assessments of groupings or coherence can improve quality of other clustering tasks – Chang, et al, uses literature similarity measures to improve quality of PSI-BLAST searches for distant homologs – Blashke's GEISHA system, associates clusters of genes from expression array experiments with medline abstracts, extracting keywords to annotate the gene clusters. – Masys, et al, use UMLS to score subtrees of various hierarchical medical ontologies, based on how frequently genes in an expression array cluster are tied to them. Knowledge-based data analysis • 3R systems – Reading: Integrate multiple databases & extract knowledge from the literature – Reasoning: infer additional knowledge and relate the knowledge to data – Reporting: provide information helps biologist explain the phenomena in their data and generate new hypotheses Slide from Larry Hunter More sophisticated text analysis can improve these results See the YouTube Hanalyzer demo for a better sense of the process Leach et al. (2009) More projects than people • Ongoing: • In need of fresh blood: – – – – – – – – – – Coreference resolution Temporal relations in journal articles and clinical records Epilepsy outcome prediction Suicide Proteomics of spinal cord injury and regeneration Software engineering perspectives on natural language processing Odd problems of full text Tuberculosis and translational medicine Discourse analysis annotation OpenDMAP – – – – – – – Metagenomics/Microbiome studies Translational medicine from the clinical side Summarization Negation Question-answering: Why? Nominalizations Metamorphic testing for natural language processing

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Molecular Biology Databases