Download GoFigure: Automated Gene Ontology annotation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic imprinting wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Protein moonlighting wikipedia , lookup

Transposable element wikipedia , lookup

Non-coding DNA wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Human genome wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

NEDD9 wikipedia , lookup

Genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Point mutation wikipedia , lookup

Gene therapy wikipedia , lookup

Metagenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genome (book) wikipedia , lookup

Gene desert wikipedia , lookup

Gene wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Genome evolution wikipedia , lookup

Genome editing wikipedia , lookup

Gene expression profiling wikipedia , lookup

Helitron (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
BIOINFORMATICS APPLICATIONS NOTE
Vol. 19 no. 18 2003, pages 2484–2485
DOI: 10.1093/bioinformatics/btg338
GoFigure: Automated Gene Ontology™
annotation
Salim Khan1 , Gang Situ1 , Keith Decker1 and Carl J. Schmidt2, ∗
1 Department
of Computer and Information Sciences, University of Delaware, Newark,
DE 19717, USA and 2 Department of Animal and Food Sciences, University of Delaware,
Newark, DE 19717, USA
Received on August 19, 2002; revised on March 31, 2003; accepted on June 5, 2003
ABSTRACT
Summary: We have developed a web tool to predict Gene
Ontology (GO) terms. The tool accepts an input DNA or
protein sequence, and uses BLAST to identify homologous
sequences in GO annotated databases. A graph is returned to
the user via email.
Availability: The tool is freely available at: http://udgenome.
ags.udel.edu/frm_go.html/
Contact: [email protected]
The application of high throughput methods to modern biology has resulted in an explosion of raw data that has
outstripped the pace at which human curation, annotation and
interpretation of such data is possible (Riggins and Strausberg,
2001; Sterky and Lundeberg, 2000; Claverie, 1997). This has
resulted in the application of computational methods to automate data analysis and annotation. One useful tool to aid
computational knowledge acquisition is the Gene Ontology
(GO) (The Gene Ontology Consortium, 2001). The purpose
of GO is to provide a dynamic controlled vocabulary applicable to all organisms even as understanding of gene and
protein roles in cells is accumulating and changing. We have
developed GoFigure, a web-based graphical annotation tool
that accepts as input either nucleotide or protein sequences
and provides as output a clickable graph (Gansner and North,
2002) of the GO terms for all homologous sequences in
the existing GO databases. A 4-step process constructs the
GoFigure graph: homolog discovery, minimum covering
graph construction, annotation term scoring, and annotation
term assignment.
Homolog discovery. Initially, BLAST (Altschul et al., 1990) is
used to discover homologs within the GO-annotated databases
of the gene or gene product we wish to annotate. Recovering
the GO annotations for each hit is straightforward, except that
each hit may have multiple GO annotations.
Minimum covering graph construction. If set T is the union of
all the annotation sets of the homologs of the uncharacterized
∗ To
whom correspondence should be addressed.
2484
sequence, the minimum covering graph (MCG) is a sub-graph
of the GO directed acyclic graph (DAG) rooted at a GO term
that subsumes all the terms from the set T. The MCG is minimized in that the root of the MCG is the term with the greatest
depth from the root of the GO DAG that covers all the terms
in set T.
Annotation term scoring. The final GO term(s) used to annotate the uncharacterized sequence are chosen on the basis of
a simple voting scheme. Only the terms present within the
MCG are eligible for selection, because the score assigned to
each candidate is a weighted score of all the hits that map to
it as well as the scores of all its children terms. The weighted
score is derived from the Blast results with matches having
low E-values receiving a higher weighted score than poorer
matches that have relatively higher E-values. Thus, all the
terms in the MCG contribute to the score of the root of MCG.
Normalizing the score of the MCG root gives us a term score
ratio for each term in the MCG.
Annotation term assignment. Though the term score ratio of
the MCG root (equal to 1 by normalization) is greater than
the term score ratios of all the other terms within the MCG,
the root term is also the least informative because it is relatively shallow compared to the other terms. Thus, the optimal
assignment may be further down in the MCG. From manual
inspection of 500 GoFigure results for genes with known
GO terms, a score ratio cutoff of 0.2 was found to minimize the error between the GoFigure-assigned annotation and
the human curated annotation.
To test the accuracy of this approach, all genes present in the
Saccharomyces Genome Database (SGD) (Selina et al., 2002)
were analyzed using GoFigure. In this test, the Saccharomyces
database was eliminated from the list of databases searched by
GoFigure. By comparing the results of the GoFigure analysis,
with those present in the SGD database, we observed that
GoFigure matched, or was within one node, of the SGD GO
annotation in 65% of the instances. Given the complexity of
the Gene Ontology this result indicates that GoFigure can aid
the investigator in assigning GO terms to gene products.
Bioinformatics 19(18) © Oxford University Press 2003; all rights reserved.
GoFigure: Automated Gene Ontology™ annotation
GO term. As an example, the ontology prediction for the
molecular function of the gene product encoding Hensin is
shown in Figure 1. In an analysis of genes responding to
herpesvirus infection, one responsive cellular gene encoded
the protein Hensin (Schmidt, unpublished results). In coming across Hensin in a list of genes responding to infection,
many biologists will derive no information about its function
from the name. However, inspection of the molecular function graph suggests that Hensin is a scavenger receptor that
may also contain peptidase activity. While these results do
not replace an exhaustive literature search, they may help the
biologist formulate hypotheses and design future experiments.
ACKNOWLEDGEMENTS
This work was supported by an award from the NSF
(NSF0092336) to C.J.S. and K.D.
REFERENCES
Fig. 1. Output for the predicted GO molecular function terms for
Hensin.
An interface to GoFigure allows submission of either single
or multiple DNA or protein sequences. The user chooses one
or more ontology for GoFigure to display and supplies an
email address for return of results. In the resulting graphs,
colored boxes indicate term hits, with darker color indicating
lower E-value. Colored boxes can be clicked on, and will
bring up a table of all individual database hits and E-values.
Links in this table are functional, and connect back to the
original databases. Also, a link to the AmiGO entry for the
resulting GO term is supplied. This permits users to retrieve
all gene products that have been annotated with the given
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.
(1990) Basic local alignment search tool. J. Mol. Biol., 215,
403–410.
Claverie,J.M. (1997) Computational methods for the identification
of genes in vertebrate genomic sequences. Hum. Mol. Genet., 6,
1735–1744.
Gansner,E.R. and North,S.C. (2002) An open graph visualization
system and its applications to software engineering. Software
Practice and Experience, 30, 1203–1233, 2000.
Riggins,G.J. and Strausberg,R.L. (2001) Genome and genetic
resources from the Cancer Genome Anatomy Project. Hum. Mol.
Genet., 10 663–667.
Selina,S.D., Midori,A.H., Dolinski,K., Catherine,A.B., Binkley,G.
Karen,R.C., Dianna,G.F., Issel-Tarver,L., Schroeder,M.,
Sherlock,G. et al. (2002) Saccharomyces Genome Database
(SGD) provides secondary gene annotation using the Gene
Ontology (GO). Nucleic Acids Res., 30, 69–72.
Sterky,F. and Lundeberg,J. (2000) Sequence analysis of genes and
genomes. J. Biotechnol., 76, 1–31.
The Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementation. Genome Res., 11,
1425–1433.
2485