Download Molecular Biology Databases

Document related concepts

Community fingerprinting wikipedia , lookup

Magnesium transporter wikipedia , lookup

Interactome wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Point mutation wikipedia , lookup

Secreted frizzled-related protein 1 wikipedia , lookup

Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Ridge (biology) wikipedia , lookup

Signal transduction wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Biochemical cascade wikipedia , lookup

Paracrine signalling wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene expression wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Expression vector wikipedia , lookup

Gene regulatory network wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Transcript
Biomedical natural language
processing and text mining
Kevin Bretonnel Cohen, Ph.D.
Instructor, Department of Pharmacology
University of Colorado School of Medicine
Adjunct Assistant Professor
Department of Linguistics
University of Colorado at Boulder
[email protected]
http://compbio.ucdenver.edu/Hunter_lab/Cohen
What is natural language
processing?
NLP, text mining, computational linguistics
– Computational modeling of human language
– Access to knowledge in linguistic form
• Information retrieval
• Information extraction
• Document classification
• Machine translation
• Summarization
•…
850
20
800
19
18
y = ~e0.0418x
R2 = 0.99
750
17
700
16
650
15
600
14
550
13
12
500
~e0.031x
y=
R2 = 0.95
450
11
10
400
9
350
8
2008
2007
2006
Pubmed Growth Rate
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
6
1988
250
1987
7
1986
300
Total Entries (millions)
New Entries (thousands)
Why Biomedical NLP?
Exponential knowledge growth
• 1,170 peer-reviewed
•
•
•
gene-related databases in
2009 NAR db issue
804,399 PubMed entries
in 2008 (> 2,200/day)
Breakdown of disciplinary
boundaries; more of it
relevant to each of us
“Like drinking from a
firehose” – Jim Ostell
Slide from Larry Hunter
The Biological Data Cycle
Experimental
Data
Ontologies
Databases
Genbank
SwissProt
Literature
Collections
MEDLINE
Expert
Curation
Bottleneck: getting
knowledge from literature to
databases
Solution: text mining
1
Model Organism Curation Pipeline
3. Curate genes from paper
2. List genes for curation
1. Select papers
MEDLINE
From Hirschman et al. BMC Bioinformatics
1
2005 6(Suppl 1):S1
The world’s best justification for
BioNLP
Baumgartner et al. (2007b)
Descriptive versus prescriptive
approaches to language
• Descriptive: Goal is to describe language
accurately and understand how it works.
– Not easy.
• Prescriptive: Goal is to tell people how
language should be.
– Completely irrelevant to computational
approaches to language.
Two families of approaches
• Rule-based
• Grammars
• Patterns
• Machine learning / statistical
• Labelled training data
• Noisy/unlabelled training data
Why they’re difficult
• Rule-based:
– Some expertise in linguistics and in the domain is
very useful
– Domain adaptation is time-consuming
– Complete solutions may be out of reach
• Machine learning:
– Large amounts of training data required and
expensive
– Difficult to do anything novel
Why they work
• Rule-based:
– Rules are in some sense psychologically and
empirically real
– Patterns actually exist in the real world
– Large training sets not necessary
• Machine learning:
– Some things that we care about occur often
enough to be tractable
How to pick one
• Money
• Most approaches are actually hybrids
– Rule-based for post-processing
– Rule-based feature extraction
Text mining improves biological
data analysis
•
•
Leverage information from the literature in the biological data
mining process
Homology searches:
– Filter unlikely sequence alignments through assessment of literature
–
•
•
similarity
Score literature similarity independently of sequence similarity, and
combine into unified score
Subcellullar localization
– Build literature term vectors based on PubMed/MEDLINE abstracts or
SWISS-PROT textual annotations
Gene expression clusters:
– Assign biological explanations through extraction of significant
–
literature terms for genes in cluster
Measure literature correlations independently, and combine with
microarray correlations before clustering
Evaluation of NLP systems
• Precision (aka specificity) and recall (aka
•
sensitivity). Tradeoffs between them.
Against a “gold standard” of human generated
representations of texts
– Humans don’t always agree, therefore calculate
inter-annotator agreement
• Post-hoc judgments (particularly of IR
•
relevance)
“Shared task” paradigm
– TREC Genomics (IR)
– BioCreative (IE)
Evaluation of NLP systems
• Precision:
– True positives / (True positives + False positives)
• Recall:
– True positives / (True positives + False negatives)
• F-measure: “harmonic mean” of precision and
recall
Evaluation of NLP systems
• Formal definition:
(1 + β2) * precision * recall
Fβ =
(β2 * precision) + recall
• Typical definition: β = 1, so…
Evaluation of NLP systems
• Typical definition:
2 * precision * recall
F1 =
precision + recall
• …or just F: β is usually assumed to be 1
Evaluation of NLP systems
• β allows you to weight precision and recall
differently
– Increasing β weights recall more highly
– Decreasing β weights precision more highly
• Rarely used, but designated by value of β,
e.g. F0.5 or F2
Chang et al.’s improvement on PSI-BLAST
(2001)
Ng (2006)
Significant improvement in precision
P
R
Standard PSIBLAST
.84
.33
Chang et al.
.95
.32
Goal: Predict subcellular localization to
understand function
• Signal peptides and other sequences are
•
•
indicative of localization
Machine learning based predictors are
moderately accurate
Try adding text…
Subcellular localization (Stapley et al. 2002,
Eskin and Agichtein 2004)
Single SVM
Build specialized
amino acid and
text kernels,
then build
combined kernel
Ng (2006)
Text improves clustering of gene expression
profiles, too
• Create per-gene distance matrices based on
•
•
•
expression data
Create per-gene distance matrices based on
literature data
Combine using Fisher’s omnibus
…then cluster
Matrix merging
(Glenisson et al. 2003)
Ng (2006)
More sophisticated text analysis
can improve these results
See the YouTube Hanalyzer demo for
a better sense of the process
Leach et al. (2009)
APPLICATIONS
TextPresso
Chilibot (www.chilibot.net)
Chen and Sharp (2004)
Chilibot
Chen and Sharp (2004)
Chilibot
Chen and Sharp (2004)
iHop (http://www.ihop-net.org/UniPub/iHOP)
Reflect (www.reflect.ws)
• Firefox plug-in
• Recognises proteins and small molecules
mentioned in a web page, and links them to
information-rich summaries.
Karin Verspoor
GoPubMed
Doms, A. et al. Nucl. Acids Res. 2005 33:W783-W786; doi:10.1093/nar/gki470
BIOMEDICAL LANGUAGE
PROCESSING
Surely Shuy jests...
“There is little reason
for the data on which
a linguist works to
have the right to
name that work.”
Tokenization is different
• Commas
– 2,6-diaminohexanoic acid
– tricyclo(3.3.1.13,7)decanone
• Hyphens
– “Syntactic”(Calcium-dependent, Hsp-60)
– Knocked-out gene: lush-- flies
– Negation: -fever
– Electric charge: Cl-
• PMID: 10516078
B-cell-CD4(+)-T-cell interactions
Named Entity Recognition is different
• Genes have names?
lot
white
maggie
Breast cancer 1 (BRCA1)
scott of the antarctic
ring
always early -> british rail
Ribosomal protein S27
asp -> cleopatra
p53
tudor -> vasa -> gustavus
Heat shock protein 110
nanos -> smaug
Mitogen activated protein kinase 15
pray for elves
Mitogen activated protein kinase kinase kinase 5
to, the, there, a, I, …
sema domain, seven thrombospondin repeats (type 1 and type 1like), transmembrane domain (TM) and short cytoplasmic domain,
(semaphorin) 5A
[SEMA5A]
Karin Verspoor
It really is different on every level
•Corpus construction
•Semantic representation
…
Ultimately, we need specific knowledge of the
domain to do a good job with the language.
Linguistic Levels of Analysis
From Hunter & Cohen, Biomedical Language Processing: What’s Beyond PubMed?, Molecular Cell 21, 589–594, 2006 DOI 10.1016/j.molcel.2006.02.012
SUBTASKS AND TOOLS
Information Retrieval
• Retrieving from a collection of indexed documents
– Indices based on
• Words (perhaps without “stop words”)
• Stems (e.g. expresses, expressed, expression ⇒ express)
• Synonyms and expansions
• Meta-data fields (e.g. author, title)
• Keywords or “controlled vocabularies” (e.g. MeSH)
– Retrieval rankings based on
• Number of matching terms
• TF*IDF
• Independent document characteristics (citations, links, etc.)
• Familiar as Google, PubMed, etc.
Karin Verspoor
TF*IDF
• Term frequency * Inverse Document Frequency
– TF = how many times a term appears in a document
– IDF = reciprocal of number of times a term appears in all
documents
• Measure of how informative a term is
– Occurrence of rare term is more informative than that of
a widely used term
– Terms used frequently in a document are more
informative that terms used only once
• Lots of variants
Karin Verspoor
Documents as queries
• Use a whole document to define a query (find
•
things similar to…)
Represent the document as:
– “Bag of words”
• Binary or frequency based vector of words or stems
– Can add bigrams or trigrams
– Reduced dimensionality (Latent Semantic Analysis)
• Calculate distance to all other documents in a
collection (various metrics)
Karin Verspoor
Named entity recognition
HSP60
Hsp-60
heat shock protein 60
Cerberus
wingless
Ken and Barbie
the
3
Entity normalization
Entity normalization: find concepts in text and
map them to unique identifiers
A locus has been found, an allele of which causes a modification of some allozymes of the
enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of
which is dominant to the other and results in increased electrophoretic mobility of affected
allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map
(Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase
is affected by the modifier locus. Neuraminidase incubations of homogenates altered the
electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are
not large enough to conclude that esterase 6 is sialylated.
3
Entity normalization
• Perfect named entity recognition finds 5 mentions; they
correspond to just 2 genes:
– FBgn0000592 (esterase 6)
– FBgn0026412 (leucine aminopeptidase)
A locus has been found, an allele of which causes a modification of some allozymes of the
enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of
which is dominant to the other and results in increased electrophoretic mobility of affected
allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map
(Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase
is affected by the modifier locus. Neuraminidase incubations of homogenates altered the
electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are
not large enough to conclude that esterase 6 is sialylated.
3
Entity normalization
• Partial list of synonyms for FBgn0000592:
– Esterase 6
– Carboxyl ester hydrolase
– CG6917
– Est6
– Est-D
– Est-5
3
Biological Nomenclature: “V-SNARE”
V-SNARE
Vesicle SNARE
SNAP Receptor
Soluble NSF Attachment Protein
N-Ethylmaleimide-Sensitive Fusion Protein
Maleic acid N-ethylimide
Vesicle Soluble Maleic acid N-ethylimide Sensitive
Fusion Protein Attachment Protein Receptor
(Alex Morgan, MITRE)
Domain-specific stopword
lists may be necessary
• Stopword list for species/taxonomy
identification:
– bear
– seal
• Stopword list for environments and habitats
for metagenomics studies:
– spring
– well
– range
Examples from Evangelos Pafilis
Information/relation extraction
Information extraction: relationships between
things
BINDING_EVENT
Binder:
Bound:
2
Information/relation extraction
Met28 binds to DNA.
BINDING_EVENT
Binder: Met28
Bound: DNA
2
Document clustering
• For browsing large numbers of relevant documents
– In biomedicine, unlike most Google searches, the goal is
not one relevant document, but many
• Statistical measures of document distance
– Cosine distance over term (or stem) vectors
– PubMed document neighbors (TF*IDF clustering)
– Latent Semantic Analysis (LSA)
• Knowledge-based approaches:
– Mapping documents to a predefined set of types
– Use information extraction as basis for clustering
Karin Verspoor
Automated summarization
• Useful for browsing retrieved documents
• Multidocument summarization can
•
characterize document clusters
Select the “best” sentence/passage
– Based on appearance of query terms (a la Google)
– Other useful criteria:
• Cues (“we conclude”, “demonstrating that”…)
• Presence of supporting data (“Figure 6 shows that…”)
• Sentence position (last sentence of abstract)
– Frequency in multiple documents
Karin Verspoor
Document zoning
• Different “sections” or zones of a document
– Introduction vs. methods vs. references, etc.
• Many want to focus on (or exclude) certain
•
zones from search or other processing
No straightforward way to identify zones
– Journals often have their own structures
– Section titles, HTML/XML/SGML formatting helps
(PubMedCentral DTD)
– Treat as discrimination problem
Karin Verspoor
Extracting factual
information from text
• Information extraction (IE) involves parsing text for
patterns encoding particular facts
– Biomedical literature is full of useful information
potentially amenable to IE (e.g. consequences of
mutations)
– BioCreative 2006/2009 Competitions on extracting
protein-protein interaction statements from literature
• Subtasks:
– Entity identification / normalization
– Finding relationships
– Filling in predefined schemata
Karin Verspoor
Named entity recognition
(again)
• Finding references to particular concepts (e.g.
genes, drugs, diseases) in text
– Difficult because of ambiguity [genes with normal English names,
variations in expression, anaphoric reference, etc.]
Karin Verspoor
http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.1000361&imageURI=info:doi/10.1371/journal.pcbi.1000361.g003
Variation in expression
• There are a very large number of ways of
expressing a single “concept”
– Morphological: NF Kappa B1, NFKB1, NFKB-1,
NFkB1, NFkB-1, NFK-B1, NFKB-I, NFkB(1)
– Synonyms: KBF1, EBP-1, MGC54151, NFKBp50, NFKB-p105, NF-kappa-B, DKFZp686C01211
– Syntactic: “X regulates Y”, “Y is regulated by
X”, “regulation of Y by X”, “X regulation of Y”,
“Y-regulating X”
• People don’t even tend to notice these…
Karin Verspoor
Ambiguity
• Most important problem in NLP
• Example: Hunk
– Cell type: HUman Natural Killer cells
– Gene: Hormonally Upregulated Neu-associated Kinase
– English word: piece or lump of substance
• Correct construal requires knowledge to interpret:
– “Hunk expression” versus “Hunk phagocytosis”
• Can be structural, too, e.g.
“regulation of cell proliferation and motility”
Karin Verspoor
Gene Normalization (again)
• Mapping a gene or protein name to an
•
•
identifier (e.g. in GenBank)
Very important task for using extracted
information (more useful than just a name)
Ambiguity
– with English words (“to” “dunce” “wingless”)
– in naming (1168 genes in Entrez named “p60”)
– in species (949 species have a gene named “p53”)
Karin Verspoor
Normalization methods
• Heuristic approach is necessary
– Edit distance is too coarse (some characters matter more
than others)
• Some heuristics that appear to work
– Ignore hyphens, commas, some other interrupting
punctuation (but not, e.g., ' )
– Ignore parenthetical elements
– Consider translations among arabic/roman numerals, and
latin/greek letters
– Special words for compound noun phrases: receptor,
precursor, mRNA, gene, protein, greek letter names, etc.
Karin Verspoor
Other entities
• Genes (and their products) are particularly
valuable to recognize, but are not the only
entities of interest:
– Diseases
– Drugs and other treatments
– Anatomical and other locations
– Time and temporal relationships
– Methods and evidence
Karin Verspoor
Recognize what?
• To map texts to unambiguous representations,
•
we need an underlying set of concepts to
recognize.
An Ontology is a set of concepts in a
subsumption hiearachy
– If all instances of concept X are also instances of
concept Y, then Y subsumes X. The “is a”
relationship
– Subsumption is a many-to-many relationship
– E.g. “nucleus” is-a “cell component”
Karin Verspoor
Open Biological Ontologies
• The Gene Ontology (GO) project started in 2001
– Model organism
•
database
annotators agreed
on common
representation to
facilitate sharing
OBO is extending
this to other topics
– Sequence features,
cell types, mammalian
phenotypes, etc.
From ontologies to
knowledge-bases
• Knowledge-bases (KBs)
– Provide horizontal relationships (“slots”) among concepts
(not just is-a, part-of), e.g.:
• Regulation of cell cycle controls cell cycle
• DNA transcription takes place in the nucleus
– Can be used for inference beyond just inheritance
• E.g. Relationships between molecular function and subcellular
localization can be used to infer missing information
• Many of these relationships can be extracted semiautomatically (need manual verification)
Syntactic parsing
• Groups together words, tags parts of speech.
“This effect of cyclosporin A or herbimycin A on the downregulation of ERCC-1 correlates with enhanced cytotoxicity of
cisplatin in this system.”
[this effect]NP [of [cyclosporin A]NP]PrepP [or]CONJ [herbimycin A]NP
[on [the down-regulation]NP]PrepP
[of [ERCC-1]NP]PrepP [correlates]V
[with [enhanced cytotoxicity]NP]PrepP
[of [cisplatin]NP]PrepP [in [this system]NP]PrepP
Karin Verspoor
Syntax helps
•
125I-labeled C3b was covalently deposited on CR2, when
hemolytically active 125I-labeled C3 was added to Raji cells
preincubated with iC3, factor B, properdin, and factor D,
thus proving functionality of CR2-bound C3 convertase.
<cr2> BINDS <c3 convertase>
•
CD8alpha(alpha) binds one HLA-A2/peptide molecule,
interfacing with the alpha2 and alpha3 domains of HLA-A2 and
also contacting beta2-microglobulin.
<cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule>
•
•
The binding of 109Cd to metallothionein and the thiol density
of the protein were determined after incubation of a purified
Zn/Cd-metallothionein preparation with either hydrogen
peroxide alone, or with a number of free radical generating
systems.
<109cd> BINDS <metallothionein>
Although these shifts in alpha3 may provide a synergistic
modulation of affinity, the binding of CD8 to MHC is clearly
consistent with an avidity-based contribution from CD8 to
TCR- peptide-MHC interactions.
<Cd8> BINDS <major histocompatibility complex>
Larry Hunter
Coordination is
particularly hard
In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA.
<mannose receptor> BINDS <man bsa>
<s4ggnm - r> BINDS <man bsa>
Purified recombinant NC1, like authentic NC1, also bound specifically
to fibronectin, collagen type I, and a laminin 5/6 complex.
<authentic nc1> BINDS <laminin 5 / 6 complex>
<authentic nc1> BINDS <collagen type I>
<authentic nc1> BINDS <fibronectin>
<purified recombinant nc1> BINDS <laminin 5 / 6 complex>
<purified recombinant nc1> BINDS <collagen type I>
<purified recombinant nc1> BINDS <fibronectin>
The nonvisual arrestins, beta-arrestin and arrestin3, but not visual
arrestin, bind specifically to a glutathione S-transferase-clathrin
terminal domain fusion protein. *
<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain
fusion protein>
<beta arrestin> BINDS <glutathione s-transferase-clathrin terminal
domain fusion protein>
<nonvisual arrestin> BINDS <glutathione s-transferase-clathrin
terminal domain fusion protein>
•
Documents as evidence of
function or other relationships
Cooccurrence statistics. How often are two or more genes
(or other entities) mentioned in the same document?
– PubGene is a large database of co-occurrence statistics
•
http://www.pubgene.org
Functional coherence measure (Altman et al.)
– For each article mentioning a gene from a putatively functional
group, score the article's relevance based on whether similar
articles also mention genes in the group
– Compare the number of high scoring articles that a group
generates to an expected number from random genes.
•
Literature-based groupings
combined with other data
Using literature-based assessments of groupings or
coherence can improve quality of other clustering tasks
– Chang, et al, uses literature similarity measures to improve
quality of PSI-BLAST searches for distant homologs
– Blashke's GEISHA system, associates clusters of genes from
expression array experiments with medline abstracts, extracting
keywords to annotate the gene clusters.
– Masys, et al, use UMLS to score subtrees of various hierarchical
medical ontologies, based on how frequently genes in an
expression array cluster are tied to them.
Knowledge-based data analysis
• 3R systems
– Reading: Integrate multiple databases
& extract knowledge
from the literature
– Reasoning: infer additional knowledge
and relate the knowledge to data
– Reporting: provide information helps biologist explain the
phenomena in their data and generate new hypotheses
Slide from Larry Hunter
More sophisticated text analysis
can improve these results
See the YouTube Hanalyzer demo for
a better sense of the process
Leach et al. (2009)
More projects than people
•
Ongoing:
•
In need of fresh blood:
–
–
–
–
–
–
–
–
–
–
Coreference resolution
Temporal relations in journal articles and clinical records
Epilepsy outcome prediction
Suicide
Proteomics of spinal cord injury and regeneration
Software engineering perspectives on natural language processing
Odd problems of full text
Tuberculosis and translational medicine
Discourse analysis annotation
OpenDMAP
–
–
–
–
–
–
–
Metagenomics/Microbiome studies
Translational medicine from the clinical side
Summarization
Negation
Question-answering: Why?
Nominalizations
Metamorphic testing for natural language processing