Download BMI - Network Protocols Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

List of types of proteins wikipedia , lookup

JADE1 wikipedia , lookup

Transcript
Biomedical Informatics
Data ➜ Information ➜ Knowledge
Biomedical Named Entity Recognition
Ramakanth Kavuluru
NLP Seminar – 8/21/2012
Citation
BMI
What are named entities?
• The benefits of taking cholesterol lowering statin
drugs outweigh the risks even among people
who are likely to develop diabetes.
• Acute exposure to resveratrol inhibits AMPK
activity in human skeletal muscle cells
BMI
What are named entities?
Biologically Active Substance
Drug
• The benefits of taking cholesterol lowering statin
drugs outweigh the risks even among people
who are likely to develop diabetes. Disorder
Organic Chemical
Enzyme
• Acute exposure to resveratrol inhibits AMPK
activity in human skeletal muscle cells
Cell
BMI
What are named entities?
• The benefits of taking cholesterol lowering statin
drugs outweigh the risks even among people
who are likely to develop diabetes.
Cholesterol lowering drugs
Drug
• Acute exposure to resveratrol inhibits AMPK
activity in human skeletal muscle cells
Biological Function
BMI
Why do we need to extract them?
• To provide effective semantic search
– Find all discharge summaries of patients that have a
history of diabetes and obesity and have taken statins
as part of their treatment.
Clinical Trial Recruitment
– Find all biomedical articles that discuss the dopamine
neurotransmitter in the context of depressive
disorders.
Literature Review
BMI
Why do we need to extract them?
• To use as features in machine learning for
effective text classification
• To build semantic clusters of textual documents
to understand evolving themes
• Reduce noise by avoiding key words that are not
indicative of the classes or clusters
• Recently, as a first step in relation extraction and
hence in knowledge discovery
BMI
A major task in text mining
• Extract information from textual data
• Use this information to solve problems
• What type of information?
– relevant concepts - a medical condition or finding, a
drug, a gene or protein, an emotion (hope, love, …)
– Relevant (binary) relations – drug TREATS a
condition, protein CAUSES a disease
• What are the typical questions?
– Does a pathology report indicate a reportable case?
– Which patients satisfy the criteria for a clinical trial?
BMI
Knowledge Discovery
• VIP Peptide – increases – Catecholamine Biosynthesis
In Cattle
• Catecholamines – induce – β-adrenergic receptor activity
In Rats
• β-adrenergic receptors – are involved – fear conditioning
In Humans
VIP Peptide – affects – fear conditioning ?????
BMI
Clinical NER
Concept Type
Attributes
• Disorder/Symptom Present/historical/absent,
Acute? Uncertain?
• Medication
Present/historical/future
• Procedures
BMI
Why is NER Hard?
BMI
Linguistic Variation
• Derivational variation: cranial, cranium
• Inflectional variation: coughed, coughing
• Synonymy
– nuerofibromin 2, merlin, NF2 protein, and
schwannomin.
– Addison’s disease, adrenal insufficiency,
hypocortisolism, bronzed disease
– Feeding problems in newborn – The mother said she
was having trouble feeding the baby.
BMI
Polysemy
• Merlin – both a bird and protein in UMLS
• Discharge
– Patient was prescribed codeine upon discharge
– The discharge was yellow and purulent
• Abbreviations
– APC: Activated protein C, Adenomatosis polyposis
coli, antigen presenting cell, aerobic plate count,
advanced pancreatic cancer, age period cohort,
antibody producing cells, atrial premature complex
BMI
Negation
• Nearly half of all clinical concepts in dictated
narratives are negated
– There is no maxillary sinus tenderness
• Implied absence without negation
– Lungs are clear upon auscultation
So,
– Rales: Absent
– Rhonchi: Absent
– Wheezing: Absent
BMI
Controlled Terminologies
Controlled vocabularies or taxonomies
– Gene Ontology (gene products)
• most cited, 450 per year in PubMed
• Total of 33000+ terms
– SNOMED CT (about 300K+ concepts)
– NCI Thesaurus , ICD-9/10, ICD-0-3, LOINC,
MedlinePlus
– UMLS Metathesaurus (integration of 140+ vocabularies)
• 2.3 million concepts
BMI
more Metathesaurus
• CUIs
• LUIs
• SUIs
• AUIs
BMI
Semantic Types and Relations
• NLM Semantic Network, the type system
behind UMLS Metathesaurus
– Semantic Types (135)
• Semantic Groups (15)
– Semantic Relations (54)
• Specialist Lexicon
– Malaria, malarial
– Hyperplasia, hyperplastic
How do we extract named entities?
BMI
Metamap from NLM
Identify phrases: Use
SPECIALIST parser
Map to CUIs: Use SPECIALIST
Lexicon, Metathesaurus and
Semantic Network
BMI
Output of syntactic analysis
• Syntactic Analysis – “ocular complications of
myasthenia gravis”
– Ocular (adj), complications (noun), of (prep),
myasthenia (noun), gravis (noun)
– gives noun phrases (NP): “Ocular complications” and
“Myasthenia gravis”
– Prepositions are ignored
– In a given NP, you have a head and modifiers:
• Ocular (mod) and complications (head)
• How about “male pattern baldness”?
BMI
Variant Generation
BMI
Variant Generation
BMI
Candidate identification
• Look for all variants in Metathesaurus strings
and identify those candidate concepts (CUIs)
that contain at least one variant as a substring
• Example: For ocular complication, obtain all
Metathesaurus strings that contain any of the
following as substrings
– Optic complication
– Eyes complication
– Opthalmic complicated
– ….
BMI
Mapping and Evaluation
• So now we have a bunch of candidate CUIs
based on presence of variants of the given
phrase in Metathesaurus strings. How do we
select the best candidate.
• Use several measures to compute a rank
– Centrality (involvement of head)
– Variation (average of inverse distance scores)
– Coverage
– Cohesivness
BMI
Final Score
BMI
Metamap Options
• Types of variants: include or exclude
derivational variants
• Word sense disambiguation
– Discharge (bodily secretion VS release the patient)
• Concept gaps
– Obstructive apnea mapping to “obstructive sleep
apnea” or “obstructive neonatal apnea”
• Term processing
– Process the input string as a single concept, that is,
don’t split it into noun phrases
BMI
Output options
• Human readable format
• XML format
• Restrictions based on certain vocabularies:
consider only ICD-9
• Restrictions based on certain types: consider
only pharmacological substances (i.e., drugs)
DEMO TIME: Daniel Harris
BMI
References
• An overview of Metamap: Historical Perspectives and Recent
Advances, Alan Aronson and Francois Lang
• Effective Mapping of Biomedical Text to the UMLS Metathesaurus:
The MetaMap Program, Alan Aronson
• Comparison of LVG and Metamap Functionality, Alan Aronson
• Lexical, Terminological, and Ontological Resources for Biological
Text Mining, Olivier Bodenreider
BMI