Download ppt - University of California, Berkeley

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Copy-number variation wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Essential gene wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Genomic imprinting wikipedia , lookup

History of genetic engineering wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene desert wikipedia , lookup

Minimal genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Predicting Gene Functions
from Text
Using a Cross-Species Approach
Emilia Stoica and Marti Hearst
School of Information
University of California, Berkeley
Research Supported by NSF DBI-0317510 and a gift from Genentech
Annotate genes with functional
information derived from journal articles.
Gene Ontology (GO)
 Gene Ontology (GO) controlled vocabulary for
functional annotation
 ~ 17,600 terms (circa July 2004)
 Organized into 3 distinct acyclic graphs
molecular functions
biological processes
cellular locations
 More general terms are “parents” of less general
development (GO:0007275) is the parent of embryonic
development (GO:0001756)
 GO tokens might not appear explicitly
Example: PubMed 10692450
GO:0008285: negative regulation of cell proliferation
Occurs as:
inhibition of cell proliferation
 GO tokens might not occur contiguously
Example: PubMed 10734056,
G-protein coupled receptor protein signaling pathway
Occurs as:
Results indicate that CCR1-mediated responses are
regulated …in the signaling pathway, by receptor
phosphorylation at the level of receptor G/protein
coupling … CCR1 binds MIP-1 alpha.
 The simplest strategy (assigning GO codes to
genes simply because the GO tokens occur near
the gene) yields a large number of false positives.
 Issues:
a) The text does not contain evidence to support the
b) The text contains evidence for the annotation, but
the curator knows the gene to be involved in a
function that is more general or more specific than
the GO code matched in text.
 GO contains hints about what kinds of evidence are
required for annotation, e.g.:
The text should mention co-purification, coimmunoprecipitation experiments
 Requiring these evidence terms does not seem to
improve algorithms.
Related Work
 Mainly in the context of BioCreative competition (2004)
 Chiang and Yu 2003, 2004:
 Find phrase patterns commonly used in sentences describing
gene functions
(e.g., “gene plays an important role in”, “gene is
involved in”)
 Final assignments made with a Naïve Bayes classifier
 Ray and Craven 2004, 2005:
 Learn a statistical model for each GO code (which words are
likely to co-occur in the paragraphs containing GO codes);
 Decide among candidates via a multinomial Naïve Bayes
 Rice et al. 2004:
 Train an SVM for each GO code.
 Target genes assigned best-scoring GO code.
Related Work, cont.
 Couto et al. 2004
 Determine if the “information content” of the matching GO
terms is larger than for all the candidate GO terms.
 Verspoor et al. 2004
 Expand GO tokens with words that frequently co-occur in a
training set; use a categorizer that explores the structure of
the Gene Ontology to find best hits.
 Ehler and Ruch 2004:
 Treat each document as a query to be categorized
 Create a score based on a combination of pattern matching
and TF*IDF weighting
 Annotate gene with top-scoring GO codes.
Our Approach
 Two main contributions:
Use cross-species information (CSM)
Check for biological (in) consistencies (CSC)
Cross-Species Match
Main Idea
 Use orthologous genes
[Genes of different species that have evolved
directly from a common ancestor.]
Since there is an overlap between the genomes of
the two species, their orthologs may share some
functions, and consequently some GO codes
 Idea: to predict GO codes for target genes in target
species, use the GO codes assigned to their
orthologous genes
We use Mouse vs. Human genes
General procedure
 Analyze text at sentence level
 Eliminate stop words, punctuation characters and
divide the text into tokens using space as delimiter
 Normalize and match different variations of gene
names using the algorithm of Bhalotia et al.’03
 For every sentence that contains the target gene:
A GO code is matched if the sentence contains a
percentage of GO tokens larger than a threshold
(0.75 for CSM and 1 for CSC)
Cross Species Match
 CSM(g, a): For a target gene g, search in article a
for only the GO codes annotated to its ortholog
 If at least 75% of the GO code terms are found in a
sentence containing the gene name, the code is
 Note: we must eliminate annotations of orthologs
marked with IEA and ISS codes to avoid circular
Cross-Species Correlation
Main Idea
 Observation:
 Since GO codes indicate gene function, it is logical for some to
often co-occur in annotations and for others to rarely do so.
 Assumption:
 If one GO code tends to occur in the orthologous genes’
annotations when another one does not, then assume the
second is not a valid assignment for the target species
 Example:
 If text seems to contain evidence for rRNA transcription
(GO:0009303) nucleolus (GO:0005737) and extracellular
(GO:0005576), then extracellular is suspicious.
 The algorithm identifies the “suspicious” cases.
Cross-Species Correlation
 For every pair of GO codes in the orthologous
genes database, compute a X2 coefficient.
 N: the total number of GO codes
 O11: # of times the ortholog is annotated with both GO 1 and GO2
 O12: # of times the ortholog is annotated with GO1 but not GO2
 O21: # of times the ortholog is annotated with GO2 but not GO1
 O12: # of times the ortholog is not annotated with GO1 or GO2
N * (O11 * O 22  O12 * O 21) * (O11 * O 22  O12 * O 21)
(O11  O12) * (O11  O 21) * (O12  O 22) * (O 21  O 22)
Cross-Species Correlation
 M(g,a) = GO codes matched in article a for gene g
 O(g) = GO codes assigned to the ortholog of g
 o = size of O(g), p = percentage (0.2)
 For every potentially matching GO code GO1 in M(g,a)
For every GO code GO2 in O(g)
Count how often X2(GO1,GO2) is significant
 If this count is < p*o then assume GO1 is not valid.
 Else assign GO1 to g
Information Flow
Evaluation using
 Task 2.2:
 Annotate 138 human genes with GO codes using 99
full text articles;
 For each annotation, provide the passage of text that
the annotation was based upon.
 Annotations from participants were manually
judged by human curators
 A prediction was considered “perfect” if the text
contained the gene name, and
provided evidence for annotating the gene with the
GO code
Results on
 Our research was conducted after the competition
had past, so our annotations could not be judged by
the same curators
 Used the “perfect predictions”
(unfair to our system; ignores relevant predictions we
find that other systems do not)
 Our prediction is correct if it matches a perfect
prediction (e.g., vhl is annotated with transcription
(GO:0006350) in PubMed 12169961 “vhl inhibits
transcription elongation, mRNA stability and PKC activity”)
BioCreative Results
Precision TP (Recall)
16 (0.07)
44 (0.19)
51 (0.21)
Ray and Craven
Chiang and Yu
Ehler and Ruch
52 (0.22)
37 (0.16)
78 (0.33)
Couto et al.
Verspoor et al.
Rice et al.
58 (0.25)
19 (0.08)
16 (0.07)
Results on Larger Dataset
 A much larger test set has been made publicly
available by Chiang and Yu.
 EBI human test set
4,410 genes
13,626 GO code annotations
 MGI mouse test set
2,188 genes
6,338 GO code annotations
 Note that Chiang and Yu used the same data for
both training and testing.
Results on EBI Human
and MGI datasets
 EBI human: 4,410 genes and 5,714 abstracts
 MGI: 2,188 genes and 1,947 abstracts
Dataset System
Precision Recall
Chiang and Yu
Chiang and Yu
Conclusions and
Future Work
 We propose an algorithm that annotates genes with GO
codes using the information available from other species
 Experimental results on three datasets show that our
algorithm consistently achieves higher F-measures than
other solutions
 Future improvements to our algorithm:
- combine or use a voting scheme between the predictions
our system makes and the predictions of a machine learning
- investigate how effective are other genes with sequences
similar to the target gene (but not orthologous to the gene)
for predicting the GO codes
Thank you!
Research Supported by NSF DBI-0317510 and a gift from Genentech