Download Annotating Gene List From Literature

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

X-inactivation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Epistasis wikipedia , lookup

Genetic engineering wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy wikipedia , lookup

Essential gene wikipedia , lookup

Public health genomics wikipedia , lookup

NEDD9 wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

The Selfish Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

RNA-Seq wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Annotating Gene List From
Literature
Xin He
Department of Computer Science
UIUC
Motivation



Biologists often need to understand the
commonalities of a list of genes (e.g. whether
they are involved in the same pathway).
These genes typically come from clustering
results in microarray expression
Given a list of gene names, is there any
automatic way to find the common themes
from literature articles?
Related Work



The most popular way is based on the analysis of GO
terms associated with genes.
Method: each gene is associated with a set of GO
terms. Find the GO terms that are overrepresented in
the input list
Hypergeometric test: p-value of a GO term
 M  N  M 
 

k 1 
i  n  i 

P  1 
N
i 0
 
n 
N: total number of genes
M: total number of genes annotated with this term
n: number of genes in the list
k: number of genes in the list annotated with this
term
Problems with GO-based Approach


GO cannot cover all the important concepts
in the literature. E.g. GO has relatively low
coverage for behavior terms (compared with
specialized behavior ontology)
The associations of genes and concepts
change very rapidly. E.g. new functions of
known genes are constantly found..
Text-based Gene List Annotation

Hypothesis testing approach:



find terms that are overrepresented for each gene:
Poisson distribution
find common terms across the gene list:
hypergeometric distribution
Comparative text mining approach: find the
common themes in multiple collections (one
for each gene)
Comparative Text Mining



For each gene, find a collection of articles
that discuss this gene
Each article in a collection is a mixture of two
distributions: a theme common to all
collections; and a collection-specific theme
Parameter estimation in the mixture model:
the standard EM algorithm
Results: Pelle System


Pelle system in Drosophila: Saptzle, Toll,
Pelle, Tube, Cacus, Dorsal
Among the top-50 words: signaling, pathway,
receptor, embryo, ventral, dorsoventral,
patterning, embryonic
Results: MET cluster


MET cluster from yeast cell-cycle data:
MET28, MET14, MET16, MET10, MET2,
MUP1
Among the top-50 words: amino, met25,
sulphite
Problems and Plan

Many common words (such as stop words) in
the top-list, not properly normalized



Use the entire Medline corpus as background: not
working
Hypothesis testing approach as alternative
Single words not very suggestive

Phrase extraction as the postprocessing step