Download - Information Extraction and Text Mining Group

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Circular dichroism wikipedia , lookup

Implicit solvation wikipedia , lookup

Protein wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Structural alignment wikipedia , lookup

Protein domain wikipedia , lookup

Rosetta@home wikipedia , lookup

Protein design wikipedia , lookup

Proteomics wikipedia , lookup

Protein moonlighting wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Cyclol wikipedia , lookup

Protein folding wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Western blot wikipedia , lookup

Protein purification wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Transcript
Using Weakly Labeled Data to
Learn Models for Extracting
Information from Biomedical Text
Mark Craven
Department of Biostatistics & Medical Informatics
Department of Computer Sciences
University of Wisconsin
U.S.A.
[email protected]
www.biostat.wisc.edu/~craven
The Information Extraction Task
Analysis of Yeast PRP20 Mutations and Functional Complementation by the
Human Homologue RCC1, a Protein Involved in the Control of Chromosome
Condensation
Fleischmann M, Clark M, Forrester W, Wickens M, Nishimoto T, Aebi M
Mutations in the PRP20 gene of yeast show a pleitropic phenotype, in which both mRNA
metabolishm and nuclear structure are affected. SRM1 mutants, defective in the same
gene, influence the signal transduction pathway for the pheromone response . . .
By immunofluorescence microscopy the PRP20 protein was localized in the nucleus.
Expression of the RCC1 protein can complement the temperature-sensitive phenotype
of PRP20 mutants, demonstrating the functional similarity of the yeast and
mammalian proteins
protein(PRP20)
subcellular-localization(PRP20, nucleus)
Motivation
• assisting in the construction and updating of
databases
• providing structured summaries for queries
What is known about protein X
(subcellular & tissue localization, associations with
diseases, interactions with drugs, …)?
• assisting scientific discovery by detecting previously
unknown relationships, annotating experimental data
Three Themes in Our
IE Research
1. Using “weakly” labeled training data
2. Representing sentence structure in learned models
3. Combining evidence when making predictions
1. Using “Weakly” Labeled Data
• why use machine learning methods in building
information-extraction systems?
– hand-coding IE systems is expensive, timeconsuming
– there is a lot of data that can be leveraged
• where do we get a training set?
– by having someone hand-label data (expensive)
– by coupling tuples in an existing database with
relevant documents (cheap)
“Weakly” Labeled Training Data
• to get positive examples, match DB tuples to
passages of text referencing constants in tuples
YPD database
P1, L1
MEDLINE abstracts
…P1…L1…
P2, L2
P3, L3
…P2…L2…
…P1…L1…
…L3…P3…
Weakly Labeled Training Data
• the labeling is weak in that many sentences with
co-occurrences wouldn’t be considered positive
examples if we were hand-labeling them
• consider the sentences associated with the relation
subcellular-localization(VAC8p, vacuole) after weak
labeling
VAC8p is a 64-kD protein found on the vacuole membrane, a site
consistent with its role in vacuole inheritance.
In analogy, VAC8p may link the vacuole to actin during vacuole
partitioning.
In addition to its role in early vacuole inheritance, VAC8p is required
to target aminopeptidase I from the cytoplasm to the vacuole.
Learning Context Patterns for
Recognizing Protein Names
• We use AutoSlog [Riloff ’96] to find “triggers” that commonly
occur before and after tagged proteins in a training corpus
selections from the training corpus
…gene encoding <p>gamma-glutamyl kinase</p> was…
…recognized genes encoding <p>vimentin</p>, heat…
…found that <p>E2F</p> binds specifically…
…<p>IleRS</p> binds to the acceptor…
…of <p>CPB II</p> binds 1 mol of…
…purified C/<p>EBP</p> binds at the same position…
…which interacts with <p>CD4</p>: both…
…14-3-3tau interacts with <p>protein kinase C mu</p>, a subtype…
encoding [X]
2/4
[X] binds
4/5
interacts with [X]
2/6
“Weak” Labeling Example
SwissProt dictionary
...
D-AKAP-2
D-amino acid oxidase
D-aspartate oxidase
D-dopachrome tautomerase
…
DAG kinase zeta
DAMOX
DASOX
DAT
DB83 protein
…
PubMed abstract
Two distinct forms of oxidases catalysing the
oxidative deamidation of D-alpha-amino
acids have been identified in human tissues:
<p>D-amino acid oxidase</p> and
<p>D-aspartate oxidase</p>. The
enzymes differ in their electrophoretic
properties, tissue distribution, binding with
flavine adenine denucleotide, heat stability,
molecular size and possibly in subunit
structure. Neither enzyme exhibits genetic
polymorphism in European populations, but
a rare electrophoretic variant phenotype
(<p>DASOX</p> 2-1) was identified
which suggests that the <p>DASOX</p>
locus is autosomal and independent
of the <p>DAMOX</p>
locus.
Protein Name Extraction
Approach
select noun phrases
that match Autoslog patterns
Two distinct forms of oxidases
catalysing the oxidative
deamidation of D-alpha-amino acids
have been identified in human
tissues: D-amino acid oxidase and
classify noun phrases
using a naïve Bayes model
encoding [X]
[X] binds
interacts with [X]
extract positive
classifications
D-amino acid oxidase
Experimental Evaluation
• hypothesis: we get more accurate models by using
weakly labeled data in addition to manually labeled
data
• models use Autoslog-induced context patterns +
naïve Bayes on morphological/syntax features of
candidate names
• compare predictive accuracy resulting from
– fixed amount of hand-labeled data
– varying amounts of weakly labeled data + handlabeled data
Extraction Accuracy:
Yapex Data Set
1
Precision
0.8
TP
TP  FP
0.6
0.4
NB model only
NB + Autoslog: 0 weak abstracts
NB + Autoslog: 90 weak abstracts
NB + Autoslog: 2,000 weak abstracts
NB + Autoslog: 25,100 weak abstracts
0.2
0
0
0.2
0.4
0.6
0.8
Recall
TP
TP  FN
1
Extraction Accuracy:
Texas Data Set
1
Precision
0.8
TP
TP  FP
0.6
0.4
NB model only
NB + Autoslog: 0 weak abstracts
NB + Autoslog: 1800 weak abstracts
NB + Autoslog: 2,000 weak abstracts
NB + Autoslog: 25,100 weak abstracts
0.2
0
0
0.2
0.4
0.6
Recall
0.8
TP
TP  FN
1
2. Representing Sentence
Structure in Learned Models
• hidden Markov models (HMMs) have proven to be
perhaps the best family of methods for learning IE
models
• typically these HMMs have a “flat” structure, and are
able to represent relatively little about grammatical
structure
• how can we provide HMMs with more information about
sentence structure?
Hidden Markov Models:
Example
.1
the
.001
protein .00005
.00005
...
start
1
q4
.2
.4
.1
.1
.1/.8
.1/.8
.3
q1
the
.007
.007
protein .02
...
the
.00001
protein .00002
...
bed1
bed1
.001
.001
q2
.4
.3/.6
q5
Pr(“... the Bed1 protein ...” | ... q1,q4,q2 ...)
.8
q3
.2
.2
.1
the
.0001
protein .03
......
the
.0001
protein .0003
...
end
Hidden Markov Models for
Information Extraction
• there are efficient algorithms for doing the following
with HMMs:
– determining the likelihood of a sentence given a
model
– determining the most likely path through a model
for a sentence
– setting the parameters of the model to maximize
the likelihood of a set of sentences
Representing Sentences
• we first process sentences by analyzing them with a
shallow parser (Sundance, [Riloff et al., 98])
sentence
clause
noun phrase
clause
verb phrase
c_m
noun phrase
verb phrase
prep phrase
noun phrase
adjective noun
Our
results
verb
c_m
suggest
that
noun
unk
cop
verb
protein Bed1 is found
prep
art
unk
in
the
ER
Hierarchical HMMs for IE
(Part 1)
•
•
•
•
[Ray & Craven, IJCAI 01; Skounakis et al, IJCAI 03]
states have types, emit phrases
some states have labels (PROTEIN, LOCATION)
our models have  25 states at this level
NP-SEGMENT
PROTEIN
NP-SEGMENT
END
START
PREP
LOCATION
NP-SEGMENT
Hierarchical HMMs for IE
(Part 2)
positive model
NP-SEGMENT
PROTEIN
NP-SEGMENT
END
START
PREP
null model
LOCATION
NP-SEGMENT
NP-SEGMENT
START
END
PREP
Hierarchical HMMs for IE
(Part 3)
PREP
PROTEIN
NP-SEGMENT
END
START
NP-SEGMENT
START
ALL
END
Pr(the) = 0.0003
Pr(and) = 0.0002
…
Pr(cell) = 0.0001
LOCATION
NP-SEGMENT
START
BEFORE
LOCATION
BETWEEN
AFTER
END
Hierarchical HMMs
in
PP-SEGMENT
consider emitting:
END
START
“. . . is found in the ER”
VP-SEGMENT
the
START
ALL
is found
END
PROTEIN
NP-SEGMENT
START
BEFORE
LOCATION
NP-SEGMENT
ER
LOCATION
BETWEEN
AFTER
END
Extraction with our HMMs
• extract a relation instance if
– sentence is more probable
under positive model
– Viterbi (most probable)
path goes through special
extraction states
NP-SEGMENT
PROTEIN
NP-SEGMENT
END
START
PP-SEGMENT
LOCATION
NP-SEGMENT
NP-SEGMENT
START
END
PP-SEGMENT
Representing More Local Context
• we can have the word-level states represent more about
the local context of each emission
• partition sentence into overlapping trigrams
“... the/ART Bed1/UNK protein/N is/COP located/V ...”
w1 , p1
w0 , p0
w1 , p1
w1 , p1
w0 , p0
w1 , p1
w1 , p1
w0 , p0
w1 , p1
Representing More Local Context
• states emit trigrams t  w1 , w0 , w1 , p1 , p0 , p1 with
probability:
Pr(t )  Pr( w1 ) Pr( w0 ) Pr( w1 ) Pr( p1 ) Pr( p0 ) Pr( p1 )
• note the independence assumption above: we
compensate for this naïve assumption by using a
discriminative training method [Krogh ’94] to learn
parameters
Experimental Evaluation
• hypothesis: we get more accurate models by using a
richer representation of sentence structure in HMMs
• compare predictive accuracy of various types of
representations
– hierarchical w/context features
more grammatical
– hierarchical
information
– phrases
– tokens w/part of speech
– tokens
• 5-fold cross validation on 3 data sets
Weakly Labeled Data Sets for
Learning to Extract Relations
• subcellular_localization(PROTEIN, LOCATION)
– YPD database
– 769 positive, 6193 negative sentences
– 939 tuples (402 distinct)
• disorder_association(GENE, DISEASE)
– OMIM database
– 829 positive, 11685 negative sentences
– 852 tuples (143 distinct)
• protein_protein_interaction(PROTEIN, PROTEIN)
– MIPS database
– 5446 positive, 41377 negative
– 8088 (819 distinct)
Extraction Accuracy (YPD)
1
Context HHMMs
HHMMs
Phrase HMMs
POS HMMs
Token HMMs
Precision
0.8
TP
TP  FP
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Recall
0.8
TP
TP  FN
1
Extraction Accuracy (MIPS)
1
Context HHMMs
HHMMs
Phrase HMMs
POS HMMs
Token HMMs
Precision
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Recall
0.8
1
Extraction Accuracy (OMIM)
1
Context HHMMs
HHMMs
Phrase HMMs
POS HMMs
Token HMMs
Precision
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Recall
0.8
1
3. Combining Evidence
when Making Predictions
• in processing a large corpus, we are likely to see the
same entities, relations in multiple places
• in making extractions, we should combine evidence
across different occurrences/contexts in we see
some entity/relation
Combining Evidence:
Organizing Predictions into Bags
occurrence
predicted
actual
CAT is a 64-kD protein…
…the cat activated the mouse...
CAT was established to be…
…were removed from cat brains.
let nb be the number of instances in bag b
pb be the number of positive predictions
ab be the number of actual positives
Combining Evidence when
Making Predictions
• given a bag of predictions, estimate the probability
that the bag contains at least one actual positive
example:
Pr(ab  0 | nb , pb )
nb

 Pr( p
j 1
nb
b
 Pr( p
i 0
b
| ab  j , nb ) Pr( ab  j | nb )
| ab  i, nb ) Pr( ab  i | nb )
Combining Evidence:
Estimating Relevant Probabilities
nb

 Pr( p
j 1
nb
b
 Pr( p
i 0
b
| ab  j , nb ) Pr( ab  j | nb )
| ab  i, nb ) Pr( ab  i | nb )
can model with two
binomial distributions
based on estimated TP-rate,
FP-rate of model
can do something simple here
(e.g. assume uniform priors)
or can make estimate this
from data w/ a few assumptions
Evidence Combination:
Protein-Protein Interactions
1
BECIP
Soft-Count
Noisy-OR
WM
Soft-OR
NC
Precision
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
Recall
0.8
1
Evidence Combination:
Protein Names
1
Precision
0.8
0.6
0.4
BECIP
Soft-Count
Noisy-OR
WM
Soft-OR
NC
0.6
0.8
0.2
0
0
0.2
0.4
Recall
1
Conclusions
• machine learning methods provide a means for
learning/refining models for information extraction
• learning is inexpensive when unlabeled/weakly labeled
sources can be exploited
– learning context patterns for protein names
– learning HMMs for relation extraction
• we can learn more accurate models by giving HMMs
more information about syntactic structure of
sentences
– hierarchical HMMs
• we can improve the precision of our predictions by
carefully combining evidence across extractions
Acknowledgments
my graduate students
Soumya Ray
Burr Settles
Marios Skounakis
NIH/NLM grant 1R01 LM07050-01
NSF CAREER grant IIS-0093016