* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download GoFigure: Automated Gene Ontology annotation
Genomic imprinting wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Protein moonlighting wikipedia , lookup
Transposable element wikipedia , lookup
Non-coding DNA wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Human genome wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
History of genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Point mutation wikipedia , lookup
Gene therapy wikipedia , lookup
Metagenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genome (book) wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene nomenclature wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Genome evolution wikipedia , lookup
Genome editing wikipedia , lookup
Gene expression profiling wikipedia , lookup
Helitron (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
BIOINFORMATICS APPLICATIONS NOTE Vol. 19 no. 18 2003, pages 2484–2485 DOI: 10.1093/bioinformatics/btg338 GoFigure: Automated Gene Ontology™ annotation Salim Khan1 , Gang Situ1 , Keith Decker1 and Carl J. Schmidt2, ∗ 1 Department of Computer and Information Sciences, University of Delaware, Newark, DE 19717, USA and 2 Department of Animal and Food Sciences, University of Delaware, Newark, DE 19717, USA Received on August 19, 2002; revised on March 31, 2003; accepted on June 5, 2003 ABSTRACT Summary: We have developed a web tool to predict Gene Ontology (GO) terms. The tool accepts an input DNA or protein sequence, and uses BLAST to identify homologous sequences in GO annotated databases. A graph is returned to the user via email. Availability: The tool is freely available at: http://udgenome. ags.udel.edu/frm_go.html/ Contact: [email protected] The application of high throughput methods to modern biology has resulted in an explosion of raw data that has outstripped the pace at which human curation, annotation and interpretation of such data is possible (Riggins and Strausberg, 2001; Sterky and Lundeberg, 2000; Claverie, 1997). This has resulted in the application of computational methods to automate data analysis and annotation. One useful tool to aid computational knowledge acquisition is the Gene Ontology (GO) (The Gene Ontology Consortium, 2001). The purpose of GO is to provide a dynamic controlled vocabulary applicable to all organisms even as understanding of gene and protein roles in cells is accumulating and changing. We have developed GoFigure, a web-based graphical annotation tool that accepts as input either nucleotide or protein sequences and provides as output a clickable graph (Gansner and North, 2002) of the GO terms for all homologous sequences in the existing GO databases. A 4-step process constructs the GoFigure graph: homolog discovery, minimum covering graph construction, annotation term scoring, and annotation term assignment. Homolog discovery. Initially, BLAST (Altschul et al., 1990) is used to discover homologs within the GO-annotated databases of the gene or gene product we wish to annotate. Recovering the GO annotations for each hit is straightforward, except that each hit may have multiple GO annotations. Minimum covering graph construction. If set T is the union of all the annotation sets of the homologs of the uncharacterized ∗ To whom correspondence should be addressed. 2484 sequence, the minimum covering graph (MCG) is a sub-graph of the GO directed acyclic graph (DAG) rooted at a GO term that subsumes all the terms from the set T. The MCG is minimized in that the root of the MCG is the term with the greatest depth from the root of the GO DAG that covers all the terms in set T. Annotation term scoring. The final GO term(s) used to annotate the uncharacterized sequence are chosen on the basis of a simple voting scheme. Only the terms present within the MCG are eligible for selection, because the score assigned to each candidate is a weighted score of all the hits that map to it as well as the scores of all its children terms. The weighted score is derived from the Blast results with matches having low E-values receiving a higher weighted score than poorer matches that have relatively higher E-values. Thus, all the terms in the MCG contribute to the score of the root of MCG. Normalizing the score of the MCG root gives us a term score ratio for each term in the MCG. Annotation term assignment. Though the term score ratio of the MCG root (equal to 1 by normalization) is greater than the term score ratios of all the other terms within the MCG, the root term is also the least informative because it is relatively shallow compared to the other terms. Thus, the optimal assignment may be further down in the MCG. From manual inspection of 500 GoFigure results for genes with known GO terms, a score ratio cutoff of 0.2 was found to minimize the error between the GoFigure-assigned annotation and the human curated annotation. To test the accuracy of this approach, all genes present in the Saccharomyces Genome Database (SGD) (Selina et al., 2002) were analyzed using GoFigure. In this test, the Saccharomyces database was eliminated from the list of databases searched by GoFigure. By comparing the results of the GoFigure analysis, with those present in the SGD database, we observed that GoFigure matched, or was within one node, of the SGD GO annotation in 65% of the instances. Given the complexity of the Gene Ontology this result indicates that GoFigure can aid the investigator in assigning GO terms to gene products. Bioinformatics 19(18) © Oxford University Press 2003; all rights reserved. GoFigure: Automated Gene Ontology™ annotation GO term. As an example, the ontology prediction for the molecular function of the gene product encoding Hensin is shown in Figure 1. In an analysis of genes responding to herpesvirus infection, one responsive cellular gene encoded the protein Hensin (Schmidt, unpublished results). In coming across Hensin in a list of genes responding to infection, many biologists will derive no information about its function from the name. However, inspection of the molecular function graph suggests that Hensin is a scavenger receptor that may also contain peptidase activity. While these results do not replace an exhaustive literature search, they may help the biologist formulate hypotheses and design future experiments. ACKNOWLEDGEMENTS This work was supported by an award from the NSF (NSF0092336) to C.J.S. and K.D. REFERENCES Fig. 1. Output for the predicted GO molecular function terms for Hensin. An interface to GoFigure allows submission of either single or multiple DNA or protein sequences. The user chooses one or more ontology for GoFigure to display and supplies an email address for return of results. In the resulting graphs, colored boxes indicate term hits, with darker color indicating lower E-value. Colored boxes can be clicked on, and will bring up a table of all individual database hits and E-values. Links in this table are functional, and connect back to the original databases. Also, a link to the AmiGO entry for the resulting GO term is supplied. This permits users to retrieve all gene products that have been annotated with the given Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. Claverie,J.M. (1997) Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet., 6, 1735–1744. Gansner,E.R. and North,S.C. (2002) An open graph visualization system and its applications to software engineering. Software Practice and Experience, 30, 1203–1233, 2000. Riggins,G.J. and Strausberg,R.L. (2001) Genome and genetic resources from the Cancer Genome Anatomy Project. Hum. Mol. Genet., 10 663–667. Selina,S.D., Midori,A.H., Dolinski,K., Catherine,A.B., Binkley,G. Karen,R.C., Dianna,G.F., Issel-Tarver,L., Schroeder,M., Sherlock,G. et al. (2002) Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res., 30, 69–72. Sterky,F. and Lundeberg,J. (2000) Sequence analysis of genes and genomes. J. Biotechnol., 76, 1–31. The Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementation. Genome Res., 11, 1425–1433. 2485