Download A Community-Based Annotation Framework for

Bioinformatics A Community-Based Annotation Framework for Linking Solanaceae Genomes with Phenomes1[C][OA] Naama Menda, Robert M. Buels, Isaak Tecle, and Lukas A. Mueller* Department of Plant Breeding and Genetics, and Boyce Thompson Institute for Plant Research, Cornell University, Ithaca, New York 14853 The amount of biological data available in the public domain is growing exponentially, and there is an increasing need for infrastructural and human resources to organize, store, and present the data in a proper context. Model organism databases (MODs) invest great efforts to functionally annotate genomes and phenomes by in-house curators. The SOL Genomics Network (SGN; http://www.sgn.cornell.edu) is a clade-oriented database (COD), which provides a more scalable and comparative framework for biological information. SGN has recently spearheaded a new approach by developing community annotation tools to expand its curational capacity. These tools effectively allow some curation to be delegated to qualified researchers, while, at the same time, preserving the in-house curators’ full editorial control. Here we describe the background, features, implementation, results, and development road map of SGN’s community annotation tools for curating genotypes and phenotypes. Since the inception of this project in late 2006, interest and participation from the Solanaceae research community has been strong and growing continuously to the extent that we plan to expand the framework to accommodate more plant taxa. All data, tools, and code developed at SGN are freely available to download and adapt. Biological databases have become one of the principal drivers of research and innovation in biology. For plants, model organism databases (MODs), such as The Arabidopsis Information Resource (TAIR; Swarbreck et al., 2008) and MaizeGDB (Lawrence et al., 2007), contain enormous amounts of high-quality annotated data and have become one of the pillars of modern genome-scale biology. A complete set of annotation data provides a whole picture for each locus in the genome, its sequence, function, phenotypes and images, literature and controlled vocabulary annotations, gene interactions, and paralogous and orthologous genes. Such sequence annotations are crucial resources for the research community in many endeavors, such as the identification of genes and their products (Stein, 2001). One of the persistent challenges to any database is to keep it reflective of current knowledge, as new relevant data that augment or replace the existing data are being published rapidly. The early sequenced genomes, such as Mus musculus (Eppig et al., 2007), Drosophila (Crosby et al., 2007), 1 This work was supported by the National Research Initiative Plant Genome Program of the U.S. Department of Agriculture Cooperative State Research, Education, and Extension Service (BARD grant no. FI–370–2005) and the National Science Foundation (grant no. 2007–02777). * Corresponding author; e-mail [email protected]. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Lukas A. Mueller ([email protected]). [C] Some figures in this article are displayed in color online but in black and white in the print edition. [OA] Open Access articles can be viewed online without a subscription. www.plantphysiol.org/cgi/doi/10.1104/pp.108.119560 1788 and Arabidopsis (Arabidopsis thaliana; Arabidopsis Genome Initiative, 2000), benefited from large functional annotation efforts that relied on large numbers of professional curators. In plant biology, the Arabidopsis genome annotation was notably successful. Today, it provides a basis for genome annotations in other plants, particularly annotations related to basic cellular and developmental biology. However, for the databases and plant community, two important limitations remain. First, these model organism systems cannot be used to annotate the specific biology of other plants or plant clades, and, second, the centralized approach is not scalable beyond the existing model organisms without a concomitant scaling up of funding. Therefore, other radical methods must be developed for annotating more organisms, such as the Solanaceae clade, and also to enhance the quality and scale of curation. The most compelling prototype approaches involve the research community in the annotation process in some way. We refer to these strategies broadly as community annotation. Currently, annotation jamborees are most commonly practiced community annotation (Pennisi, 2000; Ohyanagi et al., 2006; Riley et al., 2006; http://www. sanger.ac.uk/HGP/havana/hawk.shtml). Unfortunately, the time and cost constraints on jamborees usually do not take full advantage of the rich granular controlled vocabulary terms (ontologies) and phenotypic descriptors. Besides, jamborees require significant logistics to organize and also their infrequent occurrence means that the timeliness of data may not be current. Therefore, for databases to pace parallel with emerging data, there is a need to develop annotation tools that are more participatory and user-friendly enough to allow authors to submit their data to relevant databases immediately after publication. Plant Physiology, August 2008, Vol. 147, pp. 1788–1799, www.plantphysiol.org Ó 2008 American Society of Plant Biologists Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. SGN Community Annotation Herein we describe a community annotation approach for gene and phenotype data that leverages the existing database infrastructure at SOL Genomics Network (SGN), including data from the ongoing International Tomato Sequencing Project (The Tomato Sequencing Consortium, unpublished data). We think this approach will be successful because: (1) the Solanaceae research community has a well-established tradition of unrestricted collaboration and sharing of data and materials; (2) this community annotation software is written with user-friendliness as a primary design goal, enabling scientists to utilize structured vocabularies and other advanced annotation tools with relatively little training; (3) SGN curators routinely provide necessary guidelines and technical support to community annotators; and, most importantly, (4) there is a significant worldwide social trend toward open collaboration and data sharing on the Web. We are amid a revolution in how people use computers to share data on the Web, as evidenced by the recent success of social networking sites that have made sharing user-generated content popular. Among those sites are Flickr (www.flickr.com) for photos and Youtube (www.youtube.com) for short videos. Bioinformatics has a long tradition of sharing information, programs, and code; Web sites designated for hosting Open Source software include SourceForge.net, BioPerl (Stajich et al., 2002), and GMOD (www.gmod.org). This social networking and data-sharing movement, in combination with the new paradigm for the Web, often termed Web 2.0, and which relies heavily on technologies that can be used to provide a richer and more user-friendly experience, are the critical ingredients for bringing successful community annotation to biology. The Solanaceae are an excellent system to showcase such community annotation systems. With their exceptionally conserved genomes, yet extremely diverse phenotypic variation and adaptations to natural and agricultural environments, they comprise important species, such as tomato (Solanum lycopersicum), potato (Solanum tuberosum), pepper (Capsicum annuum), and petunia (Petunia hybrida), that are important model systems for research as well as important food crops or commercial products. Our system builds upon the bioinformatics platform for addressing Solanaceae diversity, SGN (http://www.sgn.cornell.edu), a cladeoriented database (COD) containing genomic, genetic, and taxonomic information (Mueller et al., 2005). SGN is also the bioinformatics hub for the ongoing international project to fully sequence the euchromatic portion of the tomato genome. This project will provide a high-quality reference to interpret the sequence organization of other Solanaceae crops and serve as the basis for understanding how plants diversify and adapt to new and adverse environments. Thus, the tomato genome coupled with automated and user-contributed gene annotations will reveal novel phenotypes of agronomic and commercial value for the entire Solanaceae and related families of the Asterid clade. RESULTS The SGN community annotation effort has produced the necessary software for user-friendly Web interfaces for annotation and data display, back-end data modeling, storage, and auditing. The ease of use of the annotation tools combined with clear annotation guidelines has encouraged the Solanaceae research community to actively participate in the annotation process as measured by the continued increase in number of locus and phenotype annotations. Community Interest and Participation At the time of this writing, approximately 12 months after the introduction of community annotation functionalities on SGN, a total of 183 loci have been annotated by the community. Ninety-five of these loci have designated editors, 42 in total, who are experts on the locus or loci. The extent of annotation by the community ranges from creating a new locus or phenotype entry to adding or editing data to an existing entry. The contributed annotations include alleles, sequences, publications, ontology term annotations, images, phenotyped accessions, and locus-locus associations. The phenotype database also contains user-submitted information, including more than 6,000 phenotyped accessions of 17 distinct populations. Phenotypes are usually batch loaded into the database by SGN curators and the submitter has editorial privileges in a similar manner to the locus database. Phenotypes can also be added manually via the Web interface (see ‘‘Materials and Methods’’). Solanaceae Locus Module A gene is defined as the genomic sequence corresponding to a transcribed unit in the genome. The Solanaceae and tomato, in particular, have rich historic collections of gene descriptions based on morphological and biochemical phenotypes (Butler, 1952; Eshed and Zamir, 1995), often without a known sequence or gene product. Moreover, further molecular analysis of a given locus may reveal more than one gene product per locus. We decided to use the more general term locus to refer to these genes in an attempt to maintain data consistency in the face of these challenges. Each locus in our database has a unique name and symbol and must be associated with an organism. Currently, our database contains locus information of tomato, potato, pepper, eggplant (Solanum melongena), tobacco (Nicotiana tabacum), and henbane (Hyoscyamus niger; Table I). Locus data include links to GenBank accessions, supporting literature, SGN markers and unigenes, and Gene Ontology (GO) and Plant Ontology (PO) annotations (Fig. 1). To aid the community in locus annotation for a given species, we periodically update the database with selected bulk information from these sources. This allows the community annotators to utilize and complement this information with their Plant Physiol. Vol. 147, 2008 1789 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. Menda et al. Table I. Locus data type by organism Number of loci, accessions (phenotypes), literature entries, GenBank accessions, SGN unigenes, and GO and PO annotations for each organism in the locus database. Organism Loci Accessions Publications GenBank Sequences SGN Unigenes GO PO Tomato Potato Pepper Petunia Coffee Eggplant Tobacco Henbane Datura Total 1,660 1,038 375 356 112 48 10 3 2 3,604 6,683 1 0 0 0 237 0 0 0 6,921 610 531 146 274 45 17 12 3 2 1,640 1,046 2,064 500 442 221 53 8 4 2 4,340 687 2,919 504 563 112 13 0 0 0 4,798 3,902 4,098 1,547 1,660 506 234 16 3 4 11,970 291 611 257 265 36 17 2 1 1 1,481 manual annotations. To populate the database, we have developed an automated pipeline to process new locus data and upload and update existing links and annotations (see ‘‘Materials and Methods’’). User Interface Gene/Locus Search. The search application for the locus database allows users to search for loci using locus or allele names, symbols, and/or synonyms. In addition, more advanced search criteria are available for limiting search results to a specific organism, alleleassociated phenotype, chromosome, GO or PO term name, synonym, or ID/name of a locus editor, GenBank accession, loci that have associated gene sequence, loci with a mapped location, or loci with ontology annotation. Search results are displayed as a list of loci matching the search parameters with links to separate pages showing the details of each locus. Because the search application searches both the locus and allele datasets and there may be multiple alleles for each locus, the same locus may be shown in the result table multiple times, once for each allele. Clicking a locus name in the search results displays the locus page with the following sections. Locus Details. This section contains mandatory fields for the locus name and symbol and optional fields for gene activity, description, chromosome, and chromosome arm. These fields can be edited by an assigned locus editor or a curator, but not by other users. If the locus has a known chromosomal location, a chromosome glyph is shown, and if the locus is associated with a known marker, the marker is shown on the glyph in its genetic map location (Fig. 2A). The chromosome glyph and the marker name are clickable links to the SGN comparative viewer (Mueller et al., 2008) and SGN marker detail page, respectively. Any SGN submitter has the ability to add and delete locus synonyms. Clicking on the add/remove link leads to the locus synonyms page, where synonyms may be added or removed. If the locus information was obtained from another organization, a corresponding link is displayed. At the bottom of the section is information about the locus editors and the editing history. Every locus entry is assigned one or more editors who have full editing and deleting privileges. The name of each editor is Figure 1. Graphic representation of the major data types in the SGN schema for representing loci and phenotypes and their associated data. The two central data types are locus for storing gene information and accession for storing phenotype data. Both data types are interlinked and cross-reference to images, genetic map locations, the literature, and controlled vocabulary terms (ontologies). Phenotypes are linked to populations, and loci have sequence annotations to GenBank and SGN unigenes, which link to further information such as genetic markers, bacterial artificial chromosome sequences, and metabolic pathways (SolCyc database). This schema interacts highly with the Chado schema (Mungall et al., 2007). Chado tables are not shown in the figure. [See online article for color version of this figure.] 1790 Plant Physiol. Vol. 147, 2008 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. SGN Community Annotation Figure 2. SGN locus module. A, Web user-editable locus details section. The interface grants edit privileges to locus editors and curators. A chromosome glyph with genetic mapping information is printed on the right. Clicking on the chromosome opens the comparative viewer. Clicking on a marker name opens a marker info page. B, Images are displayed on the locus page and provide links to the phenotype database. C, Metabolic pathway information. Clicking on the chemical reaction glyph opens the SolCyc reaction page. [See online article for color version of this figure.] shown as a clickable link leading to the editor’s personal details page (see ‘‘Materials and Methods’’) where users can find contact details of the editor, followed by the date when the locus was created in the database, the date of its last update, and which editor made the last update. Notes and Figures. In this section, submitters can submit graphic figures or photographs about a locus. Because the ability to add figures to locus pages was introduced very recently, 43 figures have been submitted by community members. We expect this number to increase rapidly. Plant Physiol. Vol. 147, 2008 1791 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. Menda et al. Accessions and Images. This section contains a list of phenotyped accessions that are known to express the locus and, if available, images depicting the phenotype under the control of the locus (Fig. 2B). As of this writing, 657 loci are associated all in all with 4,039 accessions. Accessions or images listed in the locus page are linked to the accession detail page, where the accession’s phenotypic and genotypic properties are displayed. Information on accessions can only be updated by the original locus editors, but new images and other auxiliary information can be added by any SGN submitter/user. Known Alleles. This section lists variant forms (alleles) of the locus, accessions that harbor the alleles, and phenotypes of the alleles. Currently, there are 997 loci with a total number of 1,249 alleles; 262 of the alleles are identified in 553 accessions. Associated Loci. Locus-to-locus associations provide a flexible mechanism for describing the relationship between loci and, by extension, gene networks. The basis for a locus-to-locus relationship includes homology, coexpression, and shared pathway. As of this writing, there are 59 such relationships in the database, representing 28 gene networks from multiple organisms. An interactive tool for browsing these networks is in late-stage development. SolCyc Links. If a locus codes for a gene involved in small-molecule metabolism, a link is displayed leading to SolCyc, SGN’s biochemical pathway database for the Solanaceae (http://solcyc.sgn.cornell.edu). The links are shown as small glyphs representing the chemical reaction (Fig. 2C). As of this writing, there are 132 biochemical pathways associated with 86 distinct loci. As more loci are annotated with unigene identifiers and more reactions are curated in SolCyc, more pathways will be associated with SGN loci. Sequence Annotation. The sequence underlying a locus is represented in this section as an internal link to one or more SGN unigene sequences (Mueller et al., 2005), generated from a build of sequenced EST libraries and mRNA or gene sequences from GenBank repositories. The locus-unigene association provides more information about a gene’s predicted peptides, available marker and microarray resources, Inter-Pro domains with GO annotations, and inclusion in gene families. These data can be dynamic because new unigene and gene family builds are constructed periodically with newly submitted sequences. Currently, 2,179 loci have a total of 4,811 unigene annotations, out of which 3,678 unigenes are unique. Also listed in this section are genomic and mRNA sequences of the locus retrieved from GenBank. So far, 2,622 loci are annotated with 4,320 unique GenBank sequence accessions. Literature Annotations. This section displays publications that document the loci or have relevant data on the loci. Currently, we store in the database 1,012 publications associated with 1,304 unique locus entries gleaned from both periodic bulk data uploads and individual submissions by curators and submitters (see ‘‘Materials and Methods’’). This bulk literature update function helps keep literature annotations up to date by allowing users and curators to load and associate high volumes of literature citations as they are published without making time-consuming individual requests or waiting for the periodical update. Advanced textmining tools will be adapted in the future for largescale literature parsing (Muller et al., 2004). Ontology Annotations. This section is used to describe the functional and phenotypic properties of a locus using structured language (ontologies). Ontologies developed by the GO and PO consortia are used to characterize the biological processes, molecular functions, cellular components, plant anatomy, and plant growth stages in which a locus is involved. Ontology annotations are assigned automatically and manually (Fig. 3). Using both methods of annotation, 2,426 loci are annotated with 11,993 GO and 1,481 PO terms (Fig. 4). Because the same structured language is also used in other major plant databases (Lawrence et al., 2007; Liang et al., 2008, Swarbreck et al., 2008), ontology annotations allow cross-species comparative analyses. Solanaceae Phenotype Module A phenotype is the observable trait of an individual. Phenotyping records are kept with individual accessions because phenotypic variation of single plants may vary with genetic background, the environment, phenotyping methodologies, and human inconsistencies in scoring for traits. Each accession in the phenotype database has a unique name and is associated with a population (see ‘‘Materials and Methods’’). Currently, the database contains 6,921 accessions from 17 populations (Table II). Individual accession data include images, underlying loci and alleles, phenotypic attributes, the genetic makeup of each plant, germplasms, and ontology annotations (Fig. 1). The database is usually populated with batch information for large datasets. Accession entries, for example mutants, can also be added to the database by submitters using the Web interface. User Interface Phenotype Search. The phenotype search function allows users to search for accessions using keywords from the name or phenotype descriptors. An advanced search can be done using filters for a specific population, PO, or Solanaceae Phenotype (SP) term, name of an accession editor, accessions with associated loci, or accessions with ontology annotation. Search results are displayed as a list of accessions matching the search parameters with links to separate pages showing the details of each accession. Clicking an accession name in the search results displays the accession detail page (Fig. 5), divided into the following sections. Accession Details. Each accession in the database has a unique name, free-text description, population 1792 Plant Physiol. Vol. 147, 2008 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. SGN Community Annotation Figure 3. The ontology term annotation tool, available both on locus and on accession pages. Curators and submitters can select an ontology to browse from a drop-down menu. While typing an ontology term name or ID, a list of matches is displayed in the text area. When selecting a term from the results list, the user is required to choose a relationship type and an evidence code supporting the annotation. The field of evidence description is populated based on the selected evidence code. The fields of evidence with and reference are populated with the object’s associated sequences and literature references. Clicking on the associate ontology button stores the selected information in the database along with the user details and date, and the annotation is displayed on the Web page. [See online article for color version of this figure.] name, the name of the person who submitted the record, and references to loci identified in the accession (Fig. 5A). Each accession may be associated with more than one locus if it carries variation in more than one gene. As of this writing, the community annotation database has information on accessions of tomato and eggplant mutants, cultivars, mapping and quantitative trait loci (QTL) populations, breeder lines, transgenic accessions, and introgression lines. Images. In this section, images depicting the phenotype are displayed (Fig. 5B). Of the 14,785 images in the SGN images database, 7,963 images are displayed in association with relevant loci and phenotyped accessions. Phenotype Data. If an accession was part of a population study and with quantitative data on traits (Gonzalo and van der Knaap, 2008), the phenotypic values of the traits for the accession along with the population parameters, including mean and range values for each trait, are displayed (Fig. 5C). Genotype Data. For accessions with associated mapping data, such as mapped markers for known genes, flanking markers of introgressions, or marker scores for a mapping population, a linkage map is displayed, representing the genotype of the accession (Fig. 5D). Mapping data are currently available for 152 introgression lines (Eshed and Zamir, 1995) and 863 individuals from seven QTL populations (http://sgn.cornell.edu/cview/map.pl?map_id517). Alleles. Accessions carrying variation associated with a locus may also be associated with a specific allele. Each represented allele has a link back to its locus page and to the same allele page, which is linked from the locus page where allele data can be edited. There are 542 accessions associated with 553 allele descriptors of 129 distinct Solanaceae loci; 1,704 accessions have links to loci without a specific allele association (linked to the default allele). Germplasms. Each accession may be available for ordering from a stock center and such accessions have a list of its germplasms with a link to where it can be ordered. Ontology Annotations. The interface for displaying ontology terms associated with the recorded pheno- Plant Physiol. Vol. 147, 2008 1793 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. Menda et al. Figure 4. Locus ontology annotations by species. Number of annotations by controlled vocabulary name for each species. [See online article for color version of this figure.] types is similar to the locus module (Fig. 3). However, additional vocabulary terms can be used for describing SP variation. We have mapped the descriptors developed for categorizing Tomato Genomics Resource Center (TGRC) accessions (http://tgrc.ucdavis. edu) and the tomato monogenic mutant population (Menda et al., 2004) to an OBO format of an SP ontology (ftp://ftp.sgn.cornell.edu/ontology/SP.obo). These terms, in addition to other descriptors used for characterizing Solanaceae traits and phenotypes, are also mapped to the Phenotype and Trait Ontology (PATO; http://www.bioontology.org/wiki/index.php/ PATO:Main_Page) with the objective of providing a semantic framework for querying different databases using a common language. Mapping files, annotation association files, and the SP ontology can be downloaded from ftp://ftp.sgn.cornell.edu/ontology. DISCUSSION We have developed a comprehensive database for the community annotation of loci and phenotypes, providing functionality for extensive annotation based on free-text descriptions, controlled vocabularies, images, sequences, and literature references. Users with submitter accounts can contribute information using easy-to-use Web interfaces. All submitted data are immediately visible to all users, facilitating review and discussion of annotations as they emerge. While only submitters can modify data, all registered users can contribute knowledge using the forum-like comments option available on each page. Quality control pipelines and rigorous submission tracking ensure that only high-quality annotations are published on the site. As of March 2008, the database contains 3,604 loci, 1,014 publications, and 6,921 plant accessions (Table I). Table II. Phenotypes by population Number of accessions, associated loci, alleles, images, and SP annotations per population. Population Accessions Associated Loci Associated Alleles Images SP Annotations Tomato EMS Tomato FN Eggplant EMS TRGC ILs Tomato F2 2000 Tomato cultivars Yellow Stuffer F2 Howard German F2 Howard German BC1 Sausage F2 Sun1642 F2 Banana Legs F2 Rio Grande F1 Transgenic lines Breeder lines Mutant lines Total 2,537 809 237 1,962 152 88 312 200 113 100 111 100 99 94 3 3 1 6,921 27 7 0 3,872 4 0 122 0 0 0 0 0 0 0 3 3 1 4,039 4 0 0 426 1 0 122 0 0 0 0 0 0 0 0 0 0 553 1,758 478 320 879 264 2,469 858 205 113 101 132 194 98 94 0 0 0 7,963 4,233 1,433 356 1,729 1 1 0 0 0 0 0 0 0 0 0 0 0 7,753 1794 Plant Physiol. Vol. 147, 2008 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. SGN Community Annotation Figure 5. SGN accession page. A, The details section contains population and submitter information, followed by underlying loci entries. B, Images. C, Quantitative phenotypes. D, Genotype data. [See online article for color version of this figure.] There are 42 community submitters who contributed most of the phenotyped accessions and information on approximately 200 loci. This community annotation effort creates a medium and tools for the Solanaceae research community to annotate their genes and phenotypes, that way ensuring that the quality of data in the database is as accurate, current, and accessible as possible. Nevertheless, community annotation is only one aspect of the curational capacity at SGN and adds an additional aspect to the larger scale of automated and in-house-curator annotations. Critical metrics for our system’s success are the number of community annotators and the number of annotations they make. If the current rate of subscrip- Plant Physiol. Vol. 147, 2008 1795 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. Menda et al. tion continues, we expect the number of community annotators to grow by about 100 every year. Our outreach program actively solicits contributions from leading scientists through direct e-mail contact, presentations at conferences, and publications in leading journals. While our goal is to have at least 200 annotators by the end of 2009, we see this number as a critical mass for the system to be useful. We predict that, in a few years, online annotations will be a normal part of any biologist’s routine and our system will scale well to thousands of annotators. The literature contains a vast amount of information on genes and mutants that has yet to be integrated into any electronic database in a format that makes it computationally accessible. The SGN community database hopes to help close this gap by providing an easy way for most knowledgeable members of the community to contribute this information. Our system allows submitters to edit almost any data type associated with a locus or a phenotype, so even partial data excluded from publication, corrections, or supplemental information can be added to SGN community annotation pages. In 2007 alone, more than 1,300 Solanaceae-related papers were published (about 90% of which were on either tomato or potato) and more than 150 Solanaceae mRNA sequences were submitted to GenBank. Researchers from the community are by far the best resource for reviewing gene information and extracting relevant data from their own publications. Due to space limitations or focus on specific traits or processes, papers and supplementary materials do not always include all useful data gained from experiments. SGN provides the research community with a platform for sharing supplementary information that may be useful for other members of the research community. The inferred cross-links between phenomes and genomes provide a resource for studying genome evolution and the resulting phenotype variation in plants (Ori et al., 2007; Xiao et al., 2008). While such relationships are most accurately derived from manual curation of published experimental results, largescale links can be generated by comparative analyses of traits and gene expression patterns in closely related organisms. In recent years, much progress has been made in defining standard controlled vocabularies for biology, which seek to develop standard machine-readable ways to describe general processes shared by different organisms, called ontologies. Ontologies greatly facilitate meaningful cross-species queries between disparate databases by providing a common semantic framework that can be used in searches and comparisons. Among ontologies, the GO (Gene Ontology Consortium, 2008) and PO (Avraham, et al., 2008) are the most extensively used vocabularies for annotation of genes and phenotypes in databases such as Gramene, TAIR, and MaizeGDB. Gene and phenotype annotation with common controlled vocabularies facilitates finding genes that share similar, but not nec- essarily identical, attributes in function, morphology, and development. Our user-friendly tools encourage and greatly help users to annotate their genes of interest with controlled vocabularies. Beyond the borders of the Solanaceae community, several other approaches have recently been developed. For example, EcoliWiki (http://ecoliwiki.net) has deployed an installation of MediaWiki (http:// mediawiki.org) as a hub for community annotation of Escherichia coli K-12. Wikis have advantages in that they are simple to set up and maintain, user-friendly, and already familiar to many users due to the popularity of Wikipedia (http://wikipedia.org). For bioinformatics purposes, the most significant limitation of traditional wikis is that the wiki’s content is stored in a mostly unstructured manner and without any semantic metadata. This tends to make large-scale automated analysis of wiki content difficult and error prone at best. This limits the usefulness of such resources because such large-scale analyses are the bedrock in modern bioinformatics. In contrast, the SGN community annotation system stores data in a highly structured relational database, an ideal basis for largescale bioinformatics analyses. Despite its limitations, wiki-style free-text editing can, however, be an excellent option for community editing of information whose structure may not be known in advance. Currently, SGN community annotation pages allow submission of free-text comments at the bottom of each page that can be used for this purpose. FUTURE DEVELOPMENTS Ultimately, our objective is to present on each locus and phenotype page the entire story of a gene, including not only its descriptors, synonyms, and functions, but also its history, provenance, mapping, cloning, and sequencing, and all the experimental steps, people, and methods involved in its characterization. Each page will essentially be presented as a free-standing publication, creating a permanent, yet evolving, entry that can be cited and referenced. With the growing number of gene descriptors and annotations, MOD databases are becoming central actors in a community effort to develop a unified gene nomenclature and gold standards for annotation, not only for maintaining similar guidelines within an organism’s research community, but also for comparative searches across taxa. Journals are beginning to collaborate with databases to set nomenclature standards and naming conventions. Since July 2007, manuscripts for the publication Plant Physiology have been required to supply a TAIR locus identifier for Arabidopsis gene data (http://www.plantphysiol.org/misc/ ifora.shtml). The benefits of this policy include prevention of nomenclature conflicts (since TAIR arbitrates the nomenclature) and ensure availability of up-to-date gene information. We intend to provide the research community with a similar system of stable 1796 Plant Physiol. Vol. 147, 2008 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. SGN Community Annotation identifiers, naming conventions, and annotation standards for the Solanaceae. CONCLUSION SGN is the first among the major plant databases to put the control of the information directly in the hands of community experts, with SGN curators acting as editors in the annotation process, rather than exclusively as authors. As a result, SGN annotations are more up to date and richer with detailed descriptions, images, and several levels of gene-to-phenotype crosslinks, than would otherwise be possible without a large curatorial staff. We would be happy to collaborate with other research communities to help start community annotation efforts of other organisms and clades. MATERIALS AND METHODS Platform Technologies SGN stores and indexes most of its data using the open source PostgreSQL database system (http://www.postgresql.org). Most software developed at SGN is written in Object Oriented Perl and Javascript. User data submission forms are written using AJAX techniques to provide powerful and userfriendly interfaces. The SGN Web site uses the Apache (http://www.apache. org) Web server with the mod_perl integrated Perl interpreter. All servers and most development machines run the Debian distribution of the GNU/Linux operating system. More information on the database schemas, software, and setup at SGN can be found on the SGN Web site (http://sgn.cornell.edu). Data Types The first step toward implementing a system for representing phenotypeto-genotype relationships was to design a database schema for storing Solanaceae loci and phenotypes with cross-references between the two datasets (Fig. 1). The following conceptual data types are used, many of which map directly to Perl classes and/or tables in the PostgreSQL database. Locus: Central data type representing descriptive genetic information of plausible transcribed units in the genome. Locus has unique names and symbols, synonyms, allele data, related sequences. It is annotated with supporting literature records and phenotypes are described using controlled vocabularies. Allele: Alternative form of a locus. It may originate from natural or induced variation. Alleles allow representation of multiple products and phenotypes of a single locus in an organism. Phenotype: Measurable traits and characteristics of individuals within a defined population. Phenotypes are stored as text descriptors of alleles and individual accessions, annotated images, quantitative measurements, and controlled vocabulary terms. Accession: Single member of a predefined population. Annotated with phenotypic and genotypic attributes, such as images, locations on a genetic map, and controlled vocabulary terms. Cross-referenced with loci via associations with alleles (accession to allele to locus). Population: Collection of individuals (accessions) sharing a common genetic background or a common phenotyping or genotyping scheme. A population may be genetically homogeneous, such as mutant collections in a specific background or isogenic inbred lines, or may be a heterogeneous collection of plants of different genetic backgrounds that have been characterized using similar methodologies. Database Schema In the PostgreSQL database system, data are represented as tables with rows and columns that hold data and also refer to other tables. To maintain a comprehensive audit trail for every data point change, each user-updateable table associated with the community annotation system stores certain standardized metadata for each database record such as creation date, modification date, owner ID, the ID of the user who submitted an update, and obsoleteness information (Supplemental Fig. S1). The owner of the record is usually authorized to edit and delete (actually, obsolete) the information. All users can view the information. The core structure of the database (Fig. 1) consists of a locus table for storing gene descriptors and an individual table for storing phenotype descriptors of accessions. More than 30 additional tables store related information, such as alleles, image data, and annotations. Plant accessions (individuals) can be linked to an allele through a linking table allowing many relationships between loci, accessions, and images. Thus, accessions with mutations in several loci can be represented easily. Each locus has a default allele used as a place holder to allow associating phenotypes to loci in the absence of allele information. Over time, as genes are sequenced and annotated with allele information, the locus-phenotype associations may be refined to include more specific allele information. Sequences, publications, biochemical pathways, controlled vocabulary terms, and other general data used for annotation are primarily stored using a slightly modified version of the GMOD Chado database schema (Mungall et al., 2007). User Types and Privileges From the outset, the community annotation system was designed with an eye toward participation from Web site users in a variety of roles. Each logged-in user has an account type, which is used as the first level of granularity for assigning database access and editing privileges. Web access to view all data is unrestricted and does not require registration. The default user type is user, which carries permission for posting comments on pages for loci and individuals and on other pages on the site. Submitter accounts are granted only to users who have been individually vetted by an SGN curator, since these accounts carry privileges for submitting new data and editing many existing entries. Submitter accounts are generally available to anyone with a legitimate interest and expertise in Solanaceae research, and a request for a submitter-class account is typically granted within 24 h. There is also a third user type for SGN staff, curator, which carries administrative privileges. Any SGN submitter may add a new locus or request locus editing privileges for the purpose of curation and annotation of genes already existing in the database. To obtain locus editor privileges, a user must first create an account by clicking on the login link from the toolbar on any SGN page (http://www. sgn.cornell.edu/solpeople/login.pl) and follow the instructions after clicking the sign up for an account link. This will create an account of type user. User accounts can be upgraded to submitter upon request by e-mailing to [email protected] (using the link provided in the footer of every SGN Web page), or by requesting editor privileges for a specific locus by clicking on the request editor privileges link from the relevant locus page. SGN People Database When a user logs into the SGN Web site, they are directed to the site’s central hub for user-based functions called MySGN (http://www.sgn.cornell. edu/solpeople/top-level.pl), which provides entry points to many of the site’s user-based tools, including to the community annotation functions. On this page, users with submitter accounts can find a summary of all loci for which they have editor privileges, as well as a list of their recently annotated loci. It also has a link for viewing all recent changes to the community annotations. Each user’s publicly visible SGN person detail page also shows a list of loci for which they are editors. Community Annotation Tools All user-editable data types have designated owners. Loci, alleles, phenotyped accessions, images, and annotations can be submitted by any SGN submitter or curator. The submitter becomes the object owner by default. Loci Since loci are complex data types, with several research groups working on different aspects of the same locus, the system allows multiple submitters to be assigned editor privileges for a given locus. When a locus editor is logged in, the edit and delete links on the locus page become active. Clicking the edit link opens an editable form (Fig. 2A), and Plant Physiol. Vol. 147, 2008 1797 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. Menda et al. clicking the delete link brings up a delete confirmation dialog. Any user with a submitter-class account can add synonyms, alleles, sequences, publications, and ontology annotations to a locus record. Alleles Allele information is useful for storing sequences and phenotype variation of a gene. In the locus database an allele must be associated with one locus and have a unique name and symbol. The mode of inheritance of the allele is designated as recessive, dominant, or partially dominant, and the phenotype of the allele is an optional free-text field. Alleles may also have any number of unique synonyms, editable by any SGN submitter in a similar manner to the locus synonyms, and associated accessions and images, which are usually a subset of the accessions associated with the allele’s locus. The allele-accession association (and image) is a more granular link of gene and phenotype data; however, in case the underlying allele of a phenotyped accession is not recorded, there is only a link to the default allele of the locus. The allele owner has edit privileges for the allele information and can be different from the locus owner. Any user with a submitter account can add new alleles to an existing locus and, by doing so, become an allele owner. Alleles can also be associated with other data types from external databases, such as GenBank sequences and publications. Gene Networks Loci do not act in isolation, but are embedded in a network of other loci, with which they act in a pathway, are regulated, or interact in some other way. To represent these networks, we have introduced a tool for making locus-tolocus associations. This system includes a small ontology describing the interaction, as well as software tools, an easy-to-use AJAX (Asynchronous Javascript and XML) form for browsing the locus database (by name or symbol), providing a relationship type, an evidence code, and an optional literature reference that documents the association between loci. The association and its description are displayed on both related loci, regardless on which locus the association was initiated. Files and Images Figures and images can be associated directly with a locus or an accession using an upload function on locus and on accession detail pages (Fig. 2B). Submitters may upload photos, schematics, and documented notes including PDF files to add supplemental information or experimental results. The submitter is considered to be the copyright owner of the uploaded materials and grants a nonexclusive license to SGN to publicly display it. An image-specific detail page is available, with metadata including description, the user who uploaded the image, who is usually also the owner of the image and has edit and delete privileges. As with loci and individuals, images are never deleted from the file system and the database, but are only set to obsolete in the image table. The image page also contains tags—general text descriptors—that may be added or deleted by any SGN submitter in a similar manner to the locus synonyms. All the objects associated with an image are printed below it, with links to the relevant Web pages. The same image may be associated with one or more individuals and also with other object types in SGN’s database, thus creating a general image object not restrained to a specific data type. Accessions Phenotyped accessions are named individuals in the database and have a designated editor, in a similar manner to the locus data type, usually the person who uploaded or contributed the accession to SGN. Editing privileges are granted only to the editor and curators, and Web-based edits are structured as described in the locus section (Fig. 5). Sequence Annotations The community annotation tools allow submitters and curators to associate SGN unigenes and GenBank accessions with loci. GenBank accessions are fetched dynamically from the National Center for Biotechnology Information (NCBI) using its eUtils (http://www.ncbi.nlm.nih.gov/entrez/query/static/ eutils_help.html) tool. If a PubMed citation is embedded in the retrieved sequence entry, the user can also choose to import the publication so that both GenBank accession and PubMed publication will be linked with the annotated locus. Literature Annotations Submitters and curators can fetch publications from PubMed using PubMed IDs. Publications not available from PubMed can be added manually by filling in a form with citation details. Each publication can be linked with a locus and an allele entry. Ontology Annotations SGN uses three ontologies for annotating genes and phenotypes: GO, PO, and SP. The latest release of SP is available for download at ftp://ftp.sgn. cornell.edu/ontology/SP.obo. Automated locus annotations with GO and PO terms are obtained by sequence orthology to Arabidopsis (Arabidopsis thaliana), as inferred by sequence similarity (BLAST) and SGN family build membership. GO and PO annotations were downloaded from TAIR (http:// Arabidopsis.org) and loaded as annotations on the matching SGN loci. We have developed a manual annotation tool for loci and phenotype accessions, providing an interface for searching the ontology by term name, synonym, or ID. Submitters and curators are prompted to provide supporting information such as literature citations and evidence codes (Fig. 3). User-contributed annotations are posted immediately on the locus page with the name and contact information of the submitter, but are then checked for consistency by SGN curators. Curator-verified annotations are submitted periodically by SGN to the GO and PO consortiums and are available for browsing on their respective Web sites. The SP ontology, developed at SGN, is mapped to PO and PATO terms for entity-quality-value annotations of qualitative and quantitative traits. User Comments Anyone with an SGN user account can post user comments on pages without requesting submitter privileges or locus ownership. This comment tool is sometimes useful for posting partial data or other free-text information about the locus, its annotations, and/or supporting data. The locus page can thus be an open forum to discuss what is known about a locus. When a user posts a comment, an automatic e-mail is sent notifying SGN curators of the post, including the text of the comment. Inappropriate comments are removed immediately. Data Upload Pipelines The sources for original bulk data include NCBI (GenBank for sequences and PubMed for literature), the TGRC for loci and plant accessions, individual labs. To aid the community in locus annotation for a given species, we periodically populate or repopulate the database with selected bulk information from these sources using automated data-processing pipelines. These data serve as seed for the community annotators to refine and build upon and/or complement data already existing in SGN. To populate the database, we have developed an automated processing pipeline to process new locus data and upload and update existing links and annotations. Newly characterized genes are also added individually to the database by SGN curators or by members of the community as they are published in the public domain. Biochemical Pathways Locus-to-pathway associations are generated automatically from the unigene pathway annotations in the SolCyc database (http://solcyc.sgn.cornell. edu; Caspi et al., 2008). Whenever an SGN unigene with SolCyc annotations is associated with a locus, the link to the SolCyc reaction is inferred and displayed on the locus page (Fig. 2C). Quality Control Web Data User Edits SGN has developed several mechanisms to ensure that the community annotation meets quality standards. SGN curators are notified by an e-mail 1798 Plant Physiol. Vol. 147, 2008 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved. SGN Community Annotation feedback system when users upload, edit, and delete any data point. Details of who, when, and what was modified are sent by e-mail, in real time, to the SGN curators as well as stored in the database. Besides, designated editors who are experts on the locus or phenotype have the privileges to verify for accuracy of data. Deleting entries (locus or phenotype) and annotations using the Web system does not remove the information from the database, but only flags the item as obsolete, whereupon it is excluded from Web displays. This means that delete operations can be reverted, thus preventing data loss caused by accidents or malicious users. Back-end administration features allow SGN curators to view all annotation changes organized by date. An additional layer of control on data input by the community is restrictions on the data type that can be added in a specific field. For example, to describe functions and phenotypes of a locus, annotators are limited to using ontologies from a browsable list already existing in the SGN database. And, generally, data that can be used to annotate, for example, a locus, have to exist in an internal or external database. This ensures that random data entry or spamming is minimized. Storing History Each user-editable table has an associated history table, which stores the entire editing history. In addition to archiving the change history and allowing submitters to view information as it appeared in the past (useful if certain experiments were performed in the past on incomplete annotation), the history information can also be used to revert edits. Only locus editors and SGN curators can view the history information and revert edits. On each update of locus and accession details, the previous version of the information is transferred from primary tables in the relational database to a set of history tables, which are nearly identical in structure to the primary tables. When a locus owner or curator is logged in, the Web interface provides a clickable link to display all the changes previously made, and the name of the person who made each change. This history module enables easy tracking and reverting of data, providing an essential undo function for managing community-generated content. ACKNOWLEDGMENTS We would like to thank Esther Van Der Knaap, Roger Chetelat, and Dani Zamir and all submitters for contributing data to the phenotyped populations and locus database, and Anuradha Pujar for contributing to the development of the Solanaceae Phenotype Ontology. We would also like to thank two anonymous reviewers for their helpful comments. Received March 23, 2008; accepted May 9, 2008; published June 6, 2008. LITERATURE CITED Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815 Avraham S, Tung CW, Ilic K, Jaiswal P, Kellogg EA, McCouch S, Pujar A, Reiser L, Rhee SY, Sachs MM, et al (2008) The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations. Nucleic Acids Res 36: D449–D454 Butler L (1952) The linkage map of the tomato. J Hered 43: 25–35 Caspi R, Foerster H, Fulcher CA, Kaipa P, Krummenacker M, Latendresse M, Paley S, Rhee SY, Shearer AG, Tissier C, et al (2008) The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res 36: D623–D631 Crosby MA, Goodman JL, Strelets VB, Zhang P, Gelbart WM, FlyBase Consortium (2007) FlyBase: genomes by the dozen. Nucleic Acids Res 35: D486–D491 Eppig JT, Blake JA, Bult CJ, Kadin JA, Richardson JE, Mouse Genome Database Group (2007) The mouse genome database (MGD): new features facilitating a model system. Nucleic Acids Res 35: D630–D637 Eshed Y, Zamir D (1995) An introgression line population of Lycopersicon pennellii in the cultivated tomato enables the identification and fine mapping of yield-associated QTL. Genetics 141: 1147–1162 Gene Ontology Consortium (2008) The Gene Ontology project in 2008. Nucleic Acids Res 36: D440–D444 Gonzalo MJ, van der Knaap E (2008) A comparative analysis into the genetic bases of morphology in tomato varieties exhibiting elongated fruit shape. Theor Appl Genet 116: 647–656 Lawrence CJ, Schaeffer ML, Seigfried TE, Campbell DA, Harper LC (2007) MaizeGDB’s new data types, resources and activities. Nucleic Acids Res 35: D895–D900 Liang C, Jaiswal P, Hebbard C, Avraham S, Buckler ES, Casstevens T, Hurwitz B, McCouch S, Ni J, Pujar A, et al (2008) Gramene: a growing plant comparative genomics resource. Nucleic Acids Res 36: D947–D953 Menda N, Semel Y, Peled D, Eshed Y, Zamir D (2004) In silico screening of a saturated mutation library of tomato. Plant J 38: 861–872 Muller HM, Kenny EE, Sternberg PW (2004) Textpresso: an ontologybased information retrieval and extraction system for biological literature. PLoS Biol 2: e309 Mueller LA, Mills AA, Skwarecki B, Buels RM, Menda N, Tanksley SD (2008) The SGN comparative map viewer. Bioinformatics 24: 422–423 Mueller LA, Solow TH, Taylor N, Skwarecki B, Buels R, Binns J, Lin C, Wright MH, Ahrens R, Wang Y, et al (2005) The SOL Genomics Network: a comparative resource for Solanaceae biology and beyond. Plant Physiol 138: 1310–1317 Mungall CJ, Emmert DB, FlyBase Consortium (2007) A Chado case study: an ontology-based modular schema for representing genome-associated biological information. Bioinformatics 23: i337–i346 Ohyanagi H, Tanaka T, Sakai H, Shigemoto Y, Yamaguchi K, Habara T, Fujii Y, Antonio BA, Nagamura Y, Imanishi T, et al (2006) The Rice Annotation Project Database (RAP-DB): hub for Oryza sativa ssp. japonica genome information. Nucleic Acids Res 34: D741–D744 Ori N, Cohen AR, Etzioni A, Brand A, Yanai O, Shleizer S, Menda N, Amsellem Z, Efroni I, Pekker I, et al (2007) Regulation of LANCEOLATE by miR319 is required for compound-leaf development in tomato. Nat Genet 39: 787–791 Pennisi E (2000) Ideas fly at gene-finding jamboree. Science 287: 2182–2184 Riley M, Abe T, Arnaud MB, Berlyn MK, Blattner FR, Chaudhuri RR, Glasner JD, Horiuchi T, Keseler IM, Kosuge T, et al (2006) Escherichia coli K-12: a cooperatively developed annotation snapshot—2005. Nucleic Acids Res 34: 1–9 Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, et al (2002) The BioPerl toolkit: Perl modules for the life sciences. Genome Res 12: 1611–1618 Stein L (2001) Genome annotation: from sequence to biology. Nat Rev Genet 2: 493–503 Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, et al (2008) The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res 36: D1009–D1014 Xiao H, Jiang N, Schaffner E, Stockinger EJ, van der Knaap E (2008) A retrotransposon-mediated gene duplication underlies morphological variation of tomato fruit. Science 319: 1527–1530 Plant Physiol. Vol. 147, 2008 1799 Downloaded from on June 17, 2017 - Published by www.plantphysiol.org Copyright © 2008 American Society of Plant Biologists. All rights reserved.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Community-Based Annotation Framework for