Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Biomedical Informatics Data ➜ Information ➜ Knowledge Biomedical Named Entity Recognition Ramakanth Kavuluru NLP Seminar – 8/21/2012 Citation BMI What are named entities? • The benefits of taking cholesterol lowering statin drugs outweigh the risks even among people who are likely to develop diabetes. • Acute exposure to resveratrol inhibits AMPK activity in human skeletal muscle cells BMI What are named entities? Biologically Active Substance Drug • The benefits of taking cholesterol lowering statin drugs outweigh the risks even among people who are likely to develop diabetes. Disorder Organic Chemical Enzyme • Acute exposure to resveratrol inhibits AMPK activity in human skeletal muscle cells Cell BMI What are named entities? • The benefits of taking cholesterol lowering statin drugs outweigh the risks even among people who are likely to develop diabetes. Cholesterol lowering drugs Drug • Acute exposure to resveratrol inhibits AMPK activity in human skeletal muscle cells Biological Function BMI Why do we need to extract them? • To provide effective semantic search – Find all discharge summaries of patients that have a history of diabetes and obesity and have taken statins as part of their treatment. Clinical Trial Recruitment – Find all biomedical articles that discuss the dopamine neurotransmitter in the context of depressive disorders. Literature Review BMI Why do we need to extract them? • To use as features in machine learning for effective text classification • To build semantic clusters of textual documents to understand evolving themes • Reduce noise by avoiding key words that are not indicative of the classes or clusters • Recently, as a first step in relation extraction and hence in knowledge discovery BMI A major task in text mining • Extract information from textual data • Use this information to solve problems • What type of information? – relevant concepts - a medical condition or finding, a drug, a gene or protein, an emotion (hope, love, …) – Relevant (binary) relations – drug TREATS a condition, protein CAUSES a disease • What are the typical questions? – Does a pathology report indicate a reportable case? – Which patients satisfy the criteria for a clinical trial? BMI Knowledge Discovery • VIP Peptide – increases – Catecholamine Biosynthesis In Cattle • Catecholamines – induce – β-adrenergic receptor activity In Rats • β-adrenergic receptors – are involved – fear conditioning In Humans VIP Peptide – affects – fear conditioning ????? BMI Clinical NER Concept Type Attributes • Disorder/Symptom Present/historical/absent, Acute? Uncertain? • Medication Present/historical/future • Procedures BMI Why is NER Hard? BMI Linguistic Variation • Derivational variation: cranial, cranium • Inflectional variation: coughed, coughing • Synonymy – nuerofibromin 2, merlin, NF2 protein, and schwannomin. – Addison’s disease, adrenal insufficiency, hypocortisolism, bronzed disease – Feeding problems in newborn – The mother said she was having trouble feeding the baby. BMI Polysemy • Merlin – both a bird and protein in UMLS • Discharge – Patient was prescribed codeine upon discharge – The discharge was yellow and purulent • Abbreviations – APC: Activated protein C, Adenomatosis polyposis coli, antigen presenting cell, aerobic plate count, advanced pancreatic cancer, age period cohort, antibody producing cells, atrial premature complex BMI Negation • Nearly half of all clinical concepts in dictated narratives are negated – There is no maxillary sinus tenderness • Implied absence without negation – Lungs are clear upon auscultation So, – Rales: Absent – Rhonchi: Absent – Wheezing: Absent BMI Controlled Terminologies Controlled vocabularies or taxonomies – Gene Ontology (gene products) • most cited, 450 per year in PubMed • Total of 33000+ terms – SNOMED CT (about 300K+ concepts) – NCI Thesaurus , ICD-9/10, ICD-0-3, LOINC, MedlinePlus – UMLS Metathesaurus (integration of 140+ vocabularies) • 2.3 million concepts BMI more Metathesaurus • CUIs • LUIs • SUIs • AUIs BMI Semantic Types and Relations • NLM Semantic Network, the type system behind UMLS Metathesaurus – Semantic Types (135) • Semantic Groups (15) – Semantic Relations (54) • Specialist Lexicon – Malaria, malarial – Hyperplasia, hyperplastic How do we extract named entities? BMI Metamap from NLM Identify phrases: Use SPECIALIST parser Map to CUIs: Use SPECIALIST Lexicon, Metathesaurus and Semantic Network BMI Output of syntactic analysis • Syntactic Analysis – “ocular complications of myasthenia gravis” – Ocular (adj), complications (noun), of (prep), myasthenia (noun), gravis (noun) – gives noun phrases (NP): “Ocular complications” and “Myasthenia gravis” – Prepositions are ignored – In a given NP, you have a head and modifiers: • Ocular (mod) and complications (head) • How about “male pattern baldness”? BMI Variant Generation BMI Variant Generation BMI Candidate identification • Look for all variants in Metathesaurus strings and identify those candidate concepts (CUIs) that contain at least one variant as a substring • Example: For ocular complication, obtain all Metathesaurus strings that contain any of the following as substrings – Optic complication – Eyes complication – Opthalmic complicated – …. BMI Mapping and Evaluation • So now we have a bunch of candidate CUIs based on presence of variants of the given phrase in Metathesaurus strings. How do we select the best candidate. • Use several measures to compute a rank – Centrality (involvement of head) – Variation (average of inverse distance scores) – Coverage – Cohesivness BMI Final Score BMI Metamap Options • Types of variants: include or exclude derivational variants • Word sense disambiguation – Discharge (bodily secretion VS release the patient) • Concept gaps – Obstructive apnea mapping to “obstructive sleep apnea” or “obstructive neonatal apnea” • Term processing – Process the input string as a single concept, that is, don’t split it into noun phrases BMI Output options • Human readable format • XML format • Restrictions based on certain vocabularies: consider only ICD-9 • Restrictions based on certain types: consider only pharmacological substances (i.e., drugs) DEMO TIME: Daniel Harris BMI References • An overview of Metamap: Historical Perspectives and Recent Advances, Alan Aronson and Francois Lang • Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program, Alan Aronson • Comparison of LVG and Metamap Functionality, Alan Aronson • Lexical, Terminological, and Ontological Resources for Biological Text Mining, Olivier Bodenreider BMI