* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Annotating Gene List From Literature
X-inactivation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Genetic engineering wikipedia , lookup
Oncogenomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene therapy wikipedia , lookup
Essential gene wikipedia , lookup
Public health genomics wikipedia , lookup
Pathogenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene desert wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
The Selfish Gene wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Annotating Gene List From Literature Xin He Department of Computer Science UIUC Motivation Biologists often need to understand the commonalities of a list of genes (e.g. whether they are involved in the same pathway). These genes typically come from clustering results in microarray expression Given a list of gene names, is there any automatic way to find the common themes from literature articles? Related Work The most popular way is based on the analysis of GO terms associated with genes. Method: each gene is associated with a set of GO terms. Find the GO terms that are overrepresented in the input list Hypergeometric test: p-value of a GO term M N M k 1 i n i P 1 N i 0 n N: total number of genes M: total number of genes annotated with this term n: number of genes in the list k: number of genes in the list annotated with this term Problems with GO-based Approach GO cannot cover all the important concepts in the literature. E.g. GO has relatively low coverage for behavior terms (compared with specialized behavior ontology) The associations of genes and concepts change very rapidly. E.g. new functions of known genes are constantly found.. Text-based Gene List Annotation Hypothesis testing approach: find terms that are overrepresented for each gene: Poisson distribution find common terms across the gene list: hypergeometric distribution Comparative text mining approach: find the common themes in multiple collections (one for each gene) Comparative Text Mining For each gene, find a collection of articles that discuss this gene Each article in a collection is a mixture of two distributions: a theme common to all collections; and a collection-specific theme Parameter estimation in the mixture model: the standard EM algorithm Results: Pelle System Pelle system in Drosophila: Saptzle, Toll, Pelle, Tube, Cacus, Dorsal Among the top-50 words: signaling, pathway, receptor, embryo, ventral, dorsoventral, patterning, embryonic Results: MET cluster MET cluster from yeast cell-cycle data: MET28, MET14, MET16, MET10, MET2, MUP1 Among the top-50 words: amino, met25, sulphite Problems and Plan Many common words (such as stop words) in the top-list, not properly normalized Use the entire Medline corpus as background: not working Hypothesis testing approach as alternative Single words not very suggestive Phrase extraction as the postprocessing step