Download Biological Ontologies in Rice Databases. An Introduction to the

Plant Cell Physiol. 46(1): 63–68 (2005) doi:10.1093/pcp/pci505, available online at www.pcp.oupjournals.org JSPP © 2005 Mini Review Biological Ontologies in Rice Databases. An Introduction to the Activities in Gramene and Oryzabase Yukiko Yamazaki 1, 3 and Pankaj Jaiswal 2 1 2 Center for Genetic Resource Information, National Institute of Genetics, Mishima, Shizuoka, 411-8540 Japan Department of Plant Breeding, Cornell University, Ithaca, NY 14853, U.S.A. ; gene ontology (GO) (Gene Ontology Consortium 2004) [see Appendix 1 (1)], for example, is one of the most successful biological ontologies. Most genome databases of model organisms as well as DNA/protein sequence databases such as SWISS-PROT/TrEMBL/Ensembl now share GO-IDs, identifiers for each term in the gene ontology. When several genes from a variety of organisms associate with a GO-ID, a set of genes associated with the GO-ID can be acquired from the GO database without a sequence similarity search. Furthermore, a set of genes associated with both higher and lower hierarchy of the ID is also retrievable. Without using GO-IDs, it takes a long time to obtain the same results, because of the problem of genes sharing the same function having different annotation (gene names or functional descriptions). Although the term ‘ontology’ is not familiar to biologists, the Enzyme Commission (EC) number and the taxonomic hierarchies, for example, can be regarded as pioneering ontologies. Before the GO started in 2000, FlyBase (FlyBase Consortium 2003) [see Appendix 1 (2)] had started a program called ‘Links to databases of other genomes’ in 1998, in which each Drosophila gene has a lineage of the homologous genes in other organisms, e.g. ECOGENE:EG10396|glpF|FBgn0000180|bib. At that time, the MENDEL Database (Price and Reardon 2001) provided the ‘Designations and families of sequenced genes primarily in plants’. These trials and others have been unified and have grown into the GO that covers all organisms. In addition to GO, various types of scientific knowledge have being structuralized under the name of ontology. While conventional classifications have limitations concerning traditional form, definition and logic, ontology is more flexible. Perhaps it would be more helpful to biologists to show examples instead of a general introduction to ontology. We will review the current status of biological ontologies in Gramene (Jaiswal et al. 2002, Ware et al. 2002) [see Appendix 1 (3)] and Oryzabase (Yamazaki et al. 2003) [see Appendix 1 (4)]; two comprehensive rice databases. All URL addresses cited in this review are listed in Appendix 1. An enormous amount of information and materials in the field of biology has been accumulating, such as nucleotide and amino acid sequences, gene and protein functions, mutants and their phenotypes, and literature references, produced by the rapid development in this field. Effective use of the information may strongly promote biological studies, and may lead to many important findings. It is, however, time-consuming and laborious for individual researchers to collect information from individual original sites and to rearrange it for their own purpose. A concept, ontology, has been introduced in biology to support and encourage researchers to share and reuse information among biological databases. Ontology has a glossary, named dynamic controlled vocabulary, in which relationships between terms are defined. Since each term is strictly defined and identified with an ID number, a set of data represented in biological ontology is easily accessible to automated information processing, even if the data sets are across several databases and/or different organisms. In this mini-review, we introduce activities in Gramene and Oryzabase, which provide biological ontologies for Oryza sativa (rice). Keywords: Database — Gramene — Ontology — Oryzabase — Oryza sativa (rice). Abbreviations: GO, gene ontology; GRO, growth stage ontology; PO, plant ontology; QTL, quantitative trait locus; TO, trait ontology. Introduction The word ‘ontology’ has originated historically from a term in philosophy, which means ‘subject of existence’. It has been used in the field of artificial intelligence as an explicit specification of a conceptualization, and bioinformaticians started to use this concept about 5 years ago. The purpose of biological ontology is to share and reuse knowledge with researchers, especially those studying different organisms. The 3 Corresponding author: E-mail, [email protected]; Fax, +81-55-981-6886. 63 64 Gramene and Oryzabase Ontologies in Gramene Gramene is a comparative genome mapping database for grasses, using the rice genome as an anchor. Both automatic and manual data curation are performed to combine and interrelate information on genomic and cDNA sequences, proteins, various maps (genetic, physical and molecular marker map), mutant phenotypes, quantitative trait loci (QTLs) and publications. As an information resource, the purpose of Gramene is to provide added value data from public databases to facilitate researchers’ ability to leverage the rice genomic sequence and genetic information, and to identify and understand characteristics of genes, genome organization, pathways and phenotypes in the cereals. The exponential increase in the quantity and diversity of information and the desire of the research community to query and extract information from available data resources in a consistent manner demand standardized methods for describing, classifying and inter-relating objects such as genetic markers and loci, qualitative and quantitative phenotypes, polymorphisms, germplasms and sequences of both DNA and protein. Users of the rice-specific databases including Gramene want to be able to find and extract information in a way that allows them to build useful biological statements based on comparative analyses. This demand from Gramene users has presented two fundamental challenges in the management of diverse data sets. One challenge involves the design and implementation of an appropriate and robust database structure, and the second is proper attribution of the data types, as a part of the annotation process. The process of attribution is very important, given that every data module in Gramene, such as the mutant, QTL and protein, typically provides carefully collected information on the phenotypes and deduced function of the gene product. As a part of the attribution process, these modules link freely to other databases, such as Oryzabase, GenBank (Benson et al. 2004) [see Appendix 1 (5)], Swiss-Prot (Boeckmann et al. 2003) [see Appendix 1 (6)], GenomeNet (Kanehisa 1997) [see Appendix 1 (7)] and PubMed (Thirup and Nielsen 2002) [see Appendix 1 (8)], to access supportive links for additional/original source information which was collected elsewhere. Appropriate attribution is critical in substantiating the validity and original source of the data acquired by a third party, and it also contributes an important dimension to the statements a user can reliably make based on a database query. Gramene, in collaboration with Oryzabase and various other plant databases, such as MaizeGDB (Lawrence et al. 2004) [see Appendix 1 (9)], GrainGenes (Matthews et al. 2003) [see Appendix 1 (10)], BarleyBase [see Appendix 1 (11)], IRIS (Bruskiewich et al. 2003) [see Appendix 1 (12)] and TAIR (Rhee et al. 2003) [see Appendix 1 (13)], has begun to work together to develop common vocabularies and exchange protocols to enhance the capacity for data sharing, and to provide plant researchers with the possibility of querying across various plant databases. In the Gramene database, we have integrated GO, PO and TO, i.e. gene, plant and trait ontologies. The plant ontology (PO) in Gramene The PO (Bruskiewich et al. 2002) provides a framework for comparative collection of phenotypic information across species by using a common vocabulary to describe morphologies and developmental stages of plants. It is also useful for describing the tissue- or developmental stage-specific expression patterns of genes so far reported and gene profiles obtained by microarray analyses. The development of a PO, now in progress by members of the plant research community, aims to provide a common, structured vocabulary for describing features of morphology, growth pattern and developmental stages of flowering plants. Since the growth and developmental stage ontology is still under development at present by the project in the community, Gramene curators have developed the growth stage ontology (GRO) of cereals, in order to perform the data collection work appropriately as well as to provide a platform for collecting information and comparing growth stages. The present GRO is in an initial stage such that plant growth stages have been defined only for rice, maize, sorghum, oat, and plants in Triticeae such as wheat and barley. In contrast, Oryzabase, as described below, has defined fine developmental stages of rice in embryogenesis and vegetative and reproductive phases and therefore this has been introduced into the GRO [see Appendix 1 (14)]. We expect to provide detailed information on the developmental stages of other plants, which is associated with gene expression patterns and mutant phenotypes, by the first quarter of 2005. The TO was developed to provide a reliable framework for describing the kinds of assays used to evaluate plant phenotypes, thus helping to standardize the descriptions of mutants, strains, polymorphisms and QTLs. This is particularly relevant since breeding and genetics communities that have a long history often use different terms for a certain trait or phenotype. The TO developed by Gramene restructures and standardizes methodology and terminology for evaluation of traits, which are familiar to crop breeders around the world (SES: standard evaluation system for rice) [see Appendix 1 (15)], GRIN [see Appendix 1 (16)], ICIS [see Appendix 1 (17)] and IRIS, etc.). Terms from the GO, PO and TO are all used to provide a coherent evaluation of a plant phenotype and function of a gene. Their useful implementation in the Gramene database allows the users to ask the following questions: find all mutant phenotypes evaluated for a trait ‘plant height’, and show which ones have a sequenced gene associated with it. If there is a sequenced gene, what is its (known/putative) molecular function and what biological process does it work in. In addition, what is the location of the gene in the rice genome and are there any known orthologs in other cereal genomes. It is easy for the databases as well as the researchers to keep track of the ontology terms associated with their gene or phenotype of interest. This is accomplished by assigning a unique identifier Gramene and Oryzabase 65 Fig. 1 Ontology map in Oryzabase. The six rectangles correspond to information categories which Oryzabase houses: (i) genetic resources including mutants and wild relatives of rice; (ii) gene dictionary/markers; (iii) linkage, physical and comparative maps; (iv) catalogue of developmental stages and gene expression patterns at those stages; (v) DNA sequences; and (vi) references. Each shape (circle, diamond and hexagon) shows the association of database contents with each ontology, GO, TO and PO. By sharing ontology IDs, Oryzabase has cross-references to Gramene and other external databases. or an accession, similar to a GenBank accession number, to each term in the ontology. The gene ontology (GO) in Gramene The GO is an example of a common vocabulary that has been adopted by many genome databases to describe the molecular function, role in a biological process and cellular localization of the gene products from diverse organisms. As of October 2004, Gramene provides information on molecular characteristics of all the rice protein entries in SWISS-PROT and TrEMBL. The information is extracted based on computational analyses as well as manual collection from peer-reviewed publications. The annotations are shared with research communities through databases of the GO Consortium. Ontologies and mutants in Gramene The mutant database provides collected information on mutant stocks of rice, which are publicly available. It includes descriptions of phenotypes concerning morphological, developmental and agronomically important traits. The current version of the mutant database houses information on >1,300 mutants, of which about 400 mutants have information collected from the literature and phenotype descriptions from Oryzabase and from T. Kinoshita (personal communication). Researchers can search the database using a gene symbol, gene name or a Gramene accession number as query [see Appendix 1 (18)]. Alternatively, the mutants can be searched either by browsing gene symbols in an alphabetical order or by searching the TO term by keywords to find the associated mutant genes. For example, if you enter the word semidwarf-1 in the mutant search page of Gramene, you will find a list that contains eight genes (loci) such as semidwarf-1 (sd1), semidwarf-11 and semidwarf-12. The list also includes gene symbols, synonyms and brief descriptions of phenotypes. Then, if you want to know details of semidwarf-1 (sd1), which has been well characterized among the eight semi-dwarf genes, you can move to the sd1 page that provides detailed information concerning this gene; name, allele, germplasm, more detailed description of the phenotype together with two images, GenBank accession number, gene product, map position, associated features and literature references [see Appendix 1 (19)]. The associated feature section tells more about the phenotype’s description by using the concept of a controlled vocabulary (ontology). The TO term (culm length) indicates the trait for which the gene was characterized, and the developmental stage (04-stem elongation stage) and anatomy location (stem) mean the stage and the part of the plant that the mutation affects, respectively. The links from these controlled vocabulary terms take you to other ontology browsers, where you can find the other mutant phenotypes that have been evaluated for having the same trait, e.g. the trait ‘culm length’ [see Appendix 1 (20)]. Similarly, you can find other mutants that have been known to express phenotype at a given developmental stage or in a given plant part. The map position and sequence information section provides information on alleles and their genetic backgrounds in which the alleles were observed. The map position section indicates a rough location of the gene on a genetic map and this also links to the CMap pages. If the gene corresponding to the mutation is cloned, you can obtain nucleotide and amino acid sequences and information on the gene and protein through GenBank and Gene Product. In order to learn more about the molecular characteristics of the protein encoded by the gene, the link from Gene Product takes you to detailed information. Ontologies and QTLs in Gramene In addition to the mutants, the Gramene database also provides information on the published QTL from rice, maize, oat, barley and wild relatives of rice (Ni et al. in preparation) [see Appendix 1 (21)]. The QTL information includes the follow- 66 Gramene and Oryzabase Fig. 2 (a) Overview of GO structure produced by GOALL and hierarchical list of GO terms. A dot in the center indicates the root of GO term hierarchy. The three organizing principles of GO are molecular function (blue–yellow), biological process (purple) and cellular component (orange–red); each tree extends away from the center toward a different direction. All nodes are called ‘GO terms’ and each term has a unique GO-ID. All nodes except the root have a parent node and some nodes have child node(s). Each GO-ID is unique, and more than two nodes can share the same GO-ID because a term has relationships to one or more terms. For example, a gene product has one or more functions, is used in one or more biological processes, and might be associated with one or more cellular components. (b) Comparison of the entire set of genes between two species by GOALL. The details are described in the GOALL section of this mini-review. ing. (i) Trait name, which is a controlled vocabulary term from the TO. (ii) Trait symbol (e.g. SDCL, ALSN and DTHD) which is assigned by Gramene in a controlled way. (iii) Trait category based on agronomic concepts. A trait such as ‘Chalkiness of endosperm’ is classified into the category ‘quality’. The categories are represented as top level parent terms in the TO. Gramene and Oryzabase The trait browser available from the QTL pages does not link back to the TO, but instead displays only the number of QTL associations with a trait. (iv) Information on the linkage group and the position of the QTLs on the genetic map, as described in the original articles. An important organizational principle of the collected portion of all modules related to phenotype in Gramene is the utilization of ontologies to provide standardized terms for describing phenotypes. This is critical, as it provides the backbone required to support the comparative querying functions provided by the database. Ontologies in Oryzabase Oryzabase, which originated as a database of rice genetic resources (established in 1995), has now grown into an integrated database of rice science. It provides a central location for integration of rice genetic, genomic and phenotypic data, and houses information such as (i) genetic resources including mutants and wild relatives of rice; (ii) gene dictionary; (iii) linkage, physical and comparative maps; (iv) catalogues of developmental stages and gene expression patterns at those stages; (v) DNA sequences; and (vi) basic information about rice science for students. Among these contents, genes, mutant phenotypes and stage-specific gene expression patterns are directly related to the biological ontologies. The structure of Oryzabase is schematically represented together with association with the biological ontologies (Fig. 1). Oryzabase GO There are three approaches concerning the GO in Oryzabase. The first one is the gene dictionary. The Oryzabase gene dictionary is separated into two main groups, the genes that are characterized biochemically and those related to phenotypic traits. Only the former genes have assigned GO-IDs. The latter genes, together with some genes in the former category, have associations with the existing TO (refer to Kurata et. al. in this issue). The second approach is to make GO associate predicted genes on the rice genomic sequence. GO-IDs are assigned to each predicted gene according to the sequence similarity by BLAST analysis against the TrEMBL protein sequence database. The third approach is to develop a GO viewer, GOALL, which enables users to search the vast amount of information in the GO data set efficiently. GOALL As of October 2004, about 18,000 terms and >1,300,000 associated genes are available from the GO database. Fig. 2(a) shows an overview of the structure of the GO world, i.e. a set of GO-IDs or GO-terms. As shown in Fig. 2(a), all terms in the GO as well as its hierarchy structure can be browsed with GOALL. Users can also retrieve genes associated with a term using the tool. GOALL provides both information from the GO database and the Oryzabase gene dictionary. 67 GOALL has a characteristic function which allows comparison of the entire set of genes between two species. As shown in Fig. 2(b), for example, a taxonomy data set is uploaded and then the two species to be compared can be selected from the pull-down menu, in this case Arabidopsis as data A and rice as data B. The tool then searches all associated genes in these two species and shows the terms which have at least one associated gene. In this case, there are 1,486 GO terms (green), indicating that associated genes are found in both species, 1,694 terms (red) only found in Arabidopsi, 95 terms (blue) only found in rice and 14,436 terms not found in either species. The number of terms reported here only reflects the current status of gene annotation, not the actual biological function. In order to support biologists to promote their studies by GOALL, rapid progress in precise annotation of gene functions is needed [see Appendix 1 (22)]. Mutant collection in Oryzabase The mutant collection of Oryzabase contains two classes of mutant. One is morphologically and/or physiologically characterized mutants collected for a few decades. The other is a collection from the large-scale mutant population of chemically mutagenized lines (Satoh and Omura 1981). The former collection is classified into seven classes according to the gene classification by Kurata et al. (this issue) and covers >2,000 mutants or variants. All these mutants are characterized with several features including trait gene name, specific phenotype and TO/PO/GO designations. The mutants in the latter collection are classified into three growth stages: seedling, vegetative and reproductive. In each stage, mutants are sorted further into 50 categories according to the abnormalities in phenotype. The Tos17 transposon-tagged lines generated by Hirochika et al. (1996) in the Rice Genome Research Program also share the latter mutant classification scheme. The number of these mutants will grow to 10,000 lines in a few years. Oryzabase GENE EXPRESSION SPECIFIC FOR DEVELOPMENTAL STAGES AND ORGANS The definition of developmental stages and genes related to development, which are described in detail in this issue (see Itoh et al. and Kurata et al., this issue), are available through Oryzabase. The defined stages in embryo, leaf, stomata, inflorescence, spikelet and ovule development are explained as events characteristic of each stage together with images. The list of enhancer trap lines, cell markers (genes expressed stage specifically) and mutants, which covers all stages and all organs, is also retrievable from the same web site. Although the current release provides only a limited number of genes and mutants, all items which appear in this issue will be incorporated into the database in the near future. A data submission system for this content is being developed in order to encourage researchers to submit their data easily. 68 Gramene and Oryzabase Future Direction of Biological Ontologies in Plants Recent achievements in genomic science for Arabidopsis have brought a new era into the biological ontology of plants. According to the current release of GO, about 3,000 terms and >80,000 associated genes are recorded in Arabidopsis (the estimated number of Arabidopsis genes is approximately 30,000; associated genes in the GO database are derived from different databases, resulting in redundancy) and there are almost half as many in rice. With the completion of rice genome sequencing, the number of annotated genes will be expected to increase rapidly in rice and these two model plants will become actual anchors for dicot and monocot plant sciences. Gramene will continue to work toward extending its annotation of rice gene, structure, trait and developmental stages in collaboration with other plant databases. Ontology built in one domain is desirable but it can be greatly improved by international collaborations. Accurate ontologies strongly rely on support from biological researchers. Biological researchers, however, are not familiar with indices or conceptual hierarchy. Also, biologists may prefer a one-stop database rather than wandering through dispersed websites. To respond to these demands, Oryzabase is now developing a new platform for ontology, which is designed to display as many existing ontologies and ontology-related contents as possible without contradiction. On the other hand, several new and challenging approaches are underway in the bioinformatics community on the development of automatic annotation techniques such as text mining and information extraction. Ontologies facilitate not only communication between researchers from different fields but also the use of knowledge compiled by computers. We are on the way to a future where new concepts and knowledge can be extracted from a vast amount of data by applying ontologies. Appendix 1 (1) (2) (3) (4) (5) (6) (7) (8) (9) GO http://www.geneontology.org/ FlyBase http://flybase.bio.indiana.edu/ Gramene http://www.gramene.org/ Oryzabase http://www.shigen.nig.ac.jp/rice/oryzabase/ GenBank http://www.ncbi.nlm.nih.gov/ Swiss-Prot http://kr.expasy.org/sprot/ GenomeNet http://www.genome.jp/ PubMed http://www.ncbi.nlm.nih.gov/entrez/query.fcgi MaizeGDB http://www.maizegdb.org/ (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) GrainGenes http://wheat.pw.usda.gov/GG2/ BarleyBase http://www.barleybase.org/ IRIS http://www.iris.irri.org/ TAIR http://www.arabidopsis.org/ GRO http://www.gramene.org/db/ontology/search_term?id=GRO: 0007136 SES http://www.riceweb.org/ses/sesidx.htm GRIN http://www.ars.grin.gov/npgs ICIS: http://www.iris.irri.org/icis/SearchIRIS.htm http://www.gramene.org/rice_mutant/Mutant_Search1.html http://www.gramene.org/db/mutant/search_mutant?id=GR: 0060842 http://www.gramene.org/db/ontology/search_term?id=TO: 0000309 http://www.gramene.org/db/qtl/qtl_display GOALL http://www.shigen.nig.ac.jp/ontology/ Acknowledgments We thank Dr. David Mathews of Cornell University for critical reading of the manuscript. Reviewers of the paper provided helpful advice. References Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. and Wheeler, D.L. (2004) Nucleic Acids Res. 32: D23–D26. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A. et al. (2003) Nucleic Acids Res. 31: D365–D370. Bruskiewich, R., Coe, E.H., Jaiswal, P., McCouch, S., Polacco, M., Stein, L., Vincent, L. and Ware, D. (2002) Comp. Funct. Genomics 3: 137–142. Bruskiewich, R.M., Cosico, A.B., Eusebio, W., Portugal, A.M., Ramos, L.M. et al. (2003) Bioinformatics 19: i63–i65. FlyBase Consortium (2003) Nucleic Acids Res. 31: 172–175. Gene Ontology Consortium (2004) Nucleic Acids Res. 32: D258–D261. Hirochika, H., Sugimoto, K., Otsuki, Y., Tsugawa, H. and Kanda, M. (1996) Proc. Natl Acad. Sci. USA 93: 7783–7788. Itoh, J.-I., Nonomura, K.-I., Ikeda, K., Yamaki, S., Inukai, Y., Yamagishi, H., Kitano, H. and Nagano, Y. (2005) Plant Cell Physiol. 46 (in this pci501). Jaiswal, P., Ware, D., Ni, J., Chang, K., Zhao, W. et al. (2002) Comp. Funct. Genomics 3: 132–136. Kanehisa, M. (1997) Trends Biochem. Sci. 22: 442–444. Kurata, N., Miyoshi, K., Nonomura, K.-I., Yamazaki, Y. and Ito., Y. (2005) Plant Cell Physiol. 46 (in this pci506). Lawrence, C.J., Dong, Q., Polacco, M.L., Seigfried, T.E. and Brendel, V. (2004) Nucleic Acids Res. 32: D393–D397. Matthews, D.E., Carollo, V.L., Lazo, G.R. and Anderson, O.D. (2003) Nucleic Acids Res. 31: D183–D186. Price, C.A. and Reardon, E.M. (2001) Nucleic Acids Res. 29: D118–D119. Rhee, S.Y., Beavis, W., Berardini, T.Z., Chen, G., Dixon, D. et al. (2003) Nucleic Acids Res. 31: D224–D228. Satoh, H. and Omura, T. (1981) Jpn. J. Breed. 31: 316–326. Thirup, P. and Nielsen, K.H. (2002) Ugeskr Laeger 164: 4553–4554. Ware, D.H., Jaiswal, P., Ni, J., Yap, I.V. and Pan, X. (2002) Plant Physiol. 130: 1606–1613. Yamazaki, Y., Nagato, Y. and Kurata, N. (2003) Rice Genet. Newslett. 20: 9–10. (Received November 1, 2004; Accepted November 13, 2004)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Biological Ontologies in Rice Databases. An Introduction to the