Download A Community-Based Annotation Framework for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Population genetics wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Public health genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Transcript
Bioinformatics
A Community-Based Annotation Framework for Linking
Solanaceae Genomes with Phenomes1[C][OA]
Naama Menda, Robert M. Buels, Isaak Tecle, and Lukas A. Mueller*
Department of Plant Breeding and Genetics, and Boyce Thompson Institute for Plant Research, Cornell
University, Ithaca, New York 14853
The amount of biological data available in the public domain is growing exponentially, and there is an increasing need for
infrastructural and human resources to organize, store, and present the data in a proper context. Model organism databases
(MODs) invest great efforts to functionally annotate genomes and phenomes by in-house curators. The SOL Genomics
Network (SGN; http://www.sgn.cornell.edu) is a clade-oriented database (COD), which provides a more scalable and
comparative framework for biological information. SGN has recently spearheaded a new approach by developing community
annotation tools to expand its curational capacity. These tools effectively allow some curation to be delegated to qualified
researchers, while, at the same time, preserving the in-house curators’ full editorial control. Here we describe the background,
features, implementation, results, and development road map of SGN’s community annotation tools for curating genotypes
and phenotypes. Since the inception of this project in late 2006, interest and participation from the Solanaceae research
community has been strong and growing continuously to the extent that we plan to expand the framework to accommodate
more plant taxa. All data, tools, and code developed at SGN are freely available to download and adapt.
Biological databases have become one of the principal drivers of research and innovation in biology. For
plants, model organism databases (MODs), such as The
Arabidopsis Information Resource (TAIR; Swarbreck
et al., 2008) and MaizeGDB (Lawrence et al., 2007),
contain enormous amounts of high-quality annotated
data and have become one of the pillars of modern
genome-scale biology. A complete set of annotation
data provides a whole picture for each locus in the
genome, its sequence, function, phenotypes and images, literature and controlled vocabulary annotations,
gene interactions, and paralogous and orthologous
genes. Such sequence annotations are crucial resources
for the research community in many endeavors, such
as the identification of genes and their products (Stein,
2001). One of the persistent challenges to any database
is to keep it reflective of current knowledge, as new
relevant data that augment or replace the existing data
are being published rapidly.
The early sequenced genomes, such as Mus musculus
(Eppig et al., 2007), Drosophila (Crosby et al., 2007),
1
This work was supported by the National Research Initiative
Plant Genome Program of the U.S. Department of Agriculture
Cooperative State Research, Education, and Extension Service
(BARD grant no. FI–370–2005) and the National Science Foundation
(grant no. 2007–02777).
* Corresponding author; e-mail [email protected].
The author responsible for distribution of materials integral to the
findings presented in this article in accordance with the policy
described in the Instructions for Authors (www.plantphysiol.org) is:
Lukas A. Mueller ([email protected]).
[C]
Some figures in this article are displayed in color online but in
black and white in the print edition.
[OA]
Open Access articles can be viewed online without a subscription.
www.plantphysiol.org/cgi/doi/10.1104/pp.108.119560
1788
and Arabidopsis (Arabidopsis thaliana; Arabidopsis Genome Initiative, 2000), benefited from large functional
annotation efforts that relied on large numbers of
professional curators. In plant biology, the Arabidopsis
genome annotation was notably successful. Today, it
provides a basis for genome annotations in other plants,
particularly annotations related to basic cellular and
developmental biology.
However, for the databases and plant community,
two important limitations remain. First, these model
organism systems cannot be used to annotate the
specific biology of other plants or plant clades, and,
second, the centralized approach is not scalable beyond the existing model organisms without a concomitant scaling up of funding. Therefore, other radical
methods must be developed for annotating more
organisms, such as the Solanaceae clade, and also to
enhance the quality and scale of curation. The most
compelling prototype approaches involve the research
community in the annotation process in some way. We
refer to these strategies broadly as community annotation. Currently, annotation jamborees are most commonly practiced community annotation (Pennisi, 2000;
Ohyanagi et al., 2006; Riley et al., 2006; http://www.
sanger.ac.uk/HGP/havana/hawk.shtml). Unfortunately,
the time and cost constraints on jamborees usually do
not take full advantage of the rich granular controlled
vocabulary terms (ontologies) and phenotypic descriptors. Besides, jamborees require significant logistics to organize and also their infrequent occurrence
means that the timeliness of data may not be current.
Therefore, for databases to pace parallel with emerging data, there is a need to develop annotation tools
that are more participatory and user-friendly enough
to allow authors to submit their data to relevant
databases immediately after publication.
Plant Physiology, August 2008, Vol. 147, pp. 1788–1799, www.plantphysiol.org Ó 2008 American Society of Plant Biologists
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
SGN Community Annotation
Herein we describe a community annotation approach for gene and phenotype data that leverages
the existing database infrastructure at SOL Genomics
Network (SGN), including data from the ongoing International Tomato Sequencing Project (The Tomato
Sequencing Consortium, unpublished data). We think
this approach will be successful because: (1) the Solanaceae research community has a well-established tradition of unrestricted collaboration and sharing of data
and materials; (2) this community annotation software
is written with user-friendliness as a primary design
goal, enabling scientists to utilize structured vocabularies and other advanced annotation tools with relatively
little training; (3) SGN curators routinely provide necessary guidelines and technical support to community
annotators; and, most importantly, (4) there is a significant worldwide social trend toward open collaboration
and data sharing on the Web.
We are amid a revolution in how people use computers to share data on the Web, as evidenced by the
recent success of social networking sites that have
made sharing user-generated content popular. Among
those sites are Flickr (www.flickr.com) for photos and
Youtube (www.youtube.com) for short videos. Bioinformatics has a long tradition of sharing information,
programs, and code; Web sites designated for hosting
Open Source software include SourceForge.net, BioPerl (Stajich et al., 2002), and GMOD (www.gmod.org).
This social networking and data-sharing movement,
in combination with the new paradigm for the Web,
often termed Web 2.0, and which relies heavily on
technologies that can be used to provide a richer and
more user-friendly experience, are the critical ingredients for bringing successful community annotation to
biology.
The Solanaceae are an excellent system to showcase
such community annotation systems. With their exceptionally conserved genomes, yet extremely diverse
phenotypic variation and adaptations to natural and
agricultural environments, they comprise important
species, such as tomato (Solanum lycopersicum), potato
(Solanum tuberosum), pepper (Capsicum annuum), and
petunia (Petunia hybrida), that are important model
systems for research as well as important food crops or
commercial products. Our system builds upon the
bioinformatics platform for addressing Solanaceae diversity, SGN (http://www.sgn.cornell.edu), a cladeoriented database (COD) containing genomic, genetic,
and taxonomic information (Mueller et al., 2005).
SGN is also the bioinformatics hub for the ongoing
international project to fully sequence the euchromatic
portion of the tomato genome. This project will provide
a high-quality reference to interpret the sequence organization of other Solanaceae crops and serve as the basis
for understanding how plants diversify and adapt to
new and adverse environments. Thus, the tomato genome coupled with automated and user-contributed
gene annotations will reveal novel phenotypes of agronomic and commercial value for the entire Solanaceae
and related families of the Asterid clade.
RESULTS
The SGN community annotation effort has produced the necessary software for user-friendly Web
interfaces for annotation and data display, back-end
data modeling, storage, and auditing. The ease of use
of the annotation tools combined with clear annotation
guidelines has encouraged the Solanaceae research
community to actively participate in the annotation
process as measured by the continued increase in
number of locus and phenotype annotations.
Community Interest and Participation
At the time of this writing, approximately 12 months
after the introduction of community annotation functionalities on SGN, a total of 183 loci have been annotated by the community. Ninety-five of these loci have
designated editors, 42 in total, who are experts on the
locus or loci. The extent of annotation by the community ranges from creating a new locus or phenotype
entry to adding or editing data to an existing entry. The
contributed annotations include alleles, sequences,
publications, ontology term annotations, images, phenotyped accessions, and locus-locus associations. The
phenotype database also contains user-submitted information, including more than 6,000 phenotyped accessions of 17 distinct populations. Phenotypes are usually
batch loaded into the database by SGN curators and
the submitter has editorial privileges in a similar manner to the locus database. Phenotypes can also be
added manually via the Web interface (see ‘‘Materials
and Methods’’).
Solanaceae Locus Module
A gene is defined as the genomic sequence corresponding to a transcribed unit in the genome. The
Solanaceae and tomato, in particular, have rich historic
collections of gene descriptions based on morphological and biochemical phenotypes (Butler, 1952; Eshed
and Zamir, 1995), often without a known sequence or
gene product. Moreover, further molecular analysis of
a given locus may reveal more than one gene product
per locus. We decided to use the more general term
locus to refer to these genes in an attempt to maintain
data consistency in the face of these challenges.
Each locus in our database has a unique name and
symbol and must be associated with an organism.
Currently, our database contains locus information of
tomato, potato, pepper, eggplant (Solanum melongena),
tobacco (Nicotiana tabacum), and henbane (Hyoscyamus
niger; Table I). Locus data include links to GenBank
accessions, supporting literature, SGN markers and
unigenes, and Gene Ontology (GO) and Plant Ontology
(PO) annotations (Fig. 1). To aid the community in locus
annotation for a given species, we periodically update
the database with selected bulk information from these
sources. This allows the community annotators to
utilize and complement this information with their
Plant Physiol. Vol. 147, 2008
1789
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
Menda et al.
Table I. Locus data type by organism
Number of loci, accessions (phenotypes), literature entries, GenBank accessions, SGN unigenes, and
GO and PO annotations for each organism in the locus database.
Organism
Loci
Accessions
Publications
GenBank
Sequences
SGN
Unigenes
GO
PO
Tomato
Potato
Pepper
Petunia
Coffee
Eggplant
Tobacco
Henbane
Datura
Total
1,660
1,038
375
356
112
48
10
3
2
3,604
6,683
1
0
0
0
237
0
0
0
6,921
610
531
146
274
45
17
12
3
2
1,640
1,046
2,064
500
442
221
53
8
4
2
4,340
687
2,919
504
563
112
13
0
0
0
4,798
3,902
4,098
1,547
1,660
506
234
16
3
4
11,970
291
611
257
265
36
17
2
1
1
1,481
manual annotations. To populate the database, we have
developed an automated pipeline to process new locus
data and upload and update existing links and annotations (see ‘‘Materials and Methods’’).
User Interface
Gene/Locus Search. The search application for the
locus database allows users to search for loci using
locus or allele names, symbols, and/or synonyms. In
addition, more advanced search criteria are available
for limiting search results to a specific organism, alleleassociated phenotype, chromosome, GO or PO term
name, synonym, or ID/name of a locus editor, GenBank
accession, loci that have associated gene sequence, loci
with a mapped location, or loci with ontology annotation.
Search results are displayed as a list of loci matching
the search parameters with links to separate pages
showing the details of each locus. Because the search
application searches both the locus and allele datasets
and there may be multiple alleles for each locus, the
same locus may be shown in the result table multiple
times, once for each allele.
Clicking a locus name in the search results displays
the locus page with the following sections.
Locus Details. This section contains mandatory
fields for the locus name and symbol and optional
fields for gene activity, description, chromosome, and
chromosome arm. These fields can be edited by an assigned locus editor or a curator, but not by other users.
If the locus has a known chromosomal location, a
chromosome glyph is shown, and if the locus is associated with a known marker, the marker is shown on
the glyph in its genetic map location (Fig. 2A). The
chromosome glyph and the marker name are clickable
links to the SGN comparative viewer (Mueller et al.,
2008) and SGN marker detail page, respectively.
Any SGN submitter has the ability to add and delete
locus synonyms. Clicking on the add/remove link
leads to the locus synonyms page, where synonyms
may be added or removed.
If the locus information was obtained from another
organization, a corresponding link is displayed.
At the bottom of the section is information about the
locus editors and the editing history. Every locus entry
is assigned one or more editors who have full editing
and deleting privileges. The name of each editor is
Figure 1. Graphic representation of the
major data types in the SGN schema for representing loci and phenotypes and their associated data. The two central data types are
locus for storing gene information and accession for storing phenotype data. Both data
types are interlinked and cross-reference
to images, genetic map locations, the literature, and controlled vocabulary terms
(ontologies). Phenotypes are linked to populations, and loci have sequence annotations to GenBank and SGN unigenes,
which link to further information such as
genetic markers, bacterial artificial chromosome sequences, and metabolic pathways
(SolCyc database). This schema interacts
highly with the Chado schema (Mungall
et al., 2007). Chado tables are not shown in
the figure. [See online article for color
version of this figure.]
1790
Plant Physiol. Vol. 147, 2008
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
SGN Community Annotation
Figure 2. SGN locus module. A, Web user-editable locus details section. The interface grants edit privileges to locus editors and
curators. A chromosome glyph with genetic mapping information is printed on the right. Clicking on the chromosome opens the
comparative viewer. Clicking on a marker name opens a marker info page. B, Images are displayed on the locus page and provide
links to the phenotype database. C, Metabolic pathway information. Clicking on the chemical reaction glyph opens the SolCyc
reaction page. [See online article for color version of this figure.]
shown as a clickable link leading to the editor’s personal details page (see ‘‘Materials and Methods’’)
where users can find contact details of the editor,
followed by the date when the locus was created in the
database, the date of its last update, and which editor
made the last update.
Notes and Figures. In this section, submitters can
submit graphic figures or photographs about a locus.
Because the ability to add figures to locus pages was
introduced very recently, 43 figures have been submitted by community members. We expect this number to increase rapidly.
Plant Physiol. Vol. 147, 2008
1791
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
Menda et al.
Accessions and Images. This section contains a list of
phenotyped accessions that are known to express the
locus and, if available, images depicting the phenotype under the control of the locus (Fig. 2B). As of this
writing, 657 loci are associated all in all with 4,039
accessions. Accessions or images listed in the locus
page are linked to the accession detail page, where the
accession’s phenotypic and genotypic properties are
displayed. Information on accessions can only be
updated by the original locus editors, but new images
and other auxiliary information can be added by any
SGN submitter/user.
Known Alleles. This section lists variant forms (alleles) of the locus, accessions that harbor the alleles,
and phenotypes of the alleles. Currently, there are 997
loci with a total number of 1,249 alleles; 262 of the
alleles are identified in 553 accessions.
Associated Loci. Locus-to-locus associations provide a flexible mechanism for describing the relationship between loci and, by extension, gene networks.
The basis for a locus-to-locus relationship includes
homology, coexpression, and shared pathway. As of
this writing, there are 59 such relationships in the database, representing 28 gene networks from multiple
organisms. An interactive tool for browsing these
networks is in late-stage development.
SolCyc Links. If a locus codes for a gene involved in
small-molecule metabolism, a link is displayed leading to SolCyc, SGN’s biochemical pathway database
for the Solanaceae (http://solcyc.sgn.cornell.edu). The
links are shown as small glyphs representing the
chemical reaction (Fig. 2C). As of this writing, there
are 132 biochemical pathways associated with 86
distinct loci. As more loci are annotated with unigene
identifiers and more reactions are curated in SolCyc,
more pathways will be associated with SGN loci.
Sequence Annotation. The sequence underlying a
locus is represented in this section as an internal link
to one or more SGN unigene sequences (Mueller et al.,
2005), generated from a build of sequenced EST libraries
and mRNA or gene sequences from GenBank repositories. The locus-unigene association provides more information about a gene’s predicted peptides, available
marker and microarray resources, Inter-Pro domains
with GO annotations, and inclusion in gene families.
These data can be dynamic because new unigene and
gene family builds are constructed periodically with
newly submitted sequences. Currently, 2,179 loci have
a total of 4,811 unigene annotations, out of which 3,678
unigenes are unique. Also listed in this section are
genomic and mRNA sequences of the locus retrieved
from GenBank. So far, 2,622 loci are annotated with
4,320 unique GenBank sequence accessions.
Literature Annotations. This section displays publications that document the loci or have relevant data on
the loci. Currently, we store in the database 1,012
publications associated with 1,304 unique locus entries
gleaned from both periodic bulk data uploads and
individual submissions by curators and submitters (see
‘‘Materials and Methods’’). This bulk literature update
function helps keep literature annotations up to date by
allowing users and curators to load and associate high
volumes of literature citations as they are published
without making time-consuming individual requests
or waiting for the periodical update. Advanced textmining tools will be adapted in the future for largescale literature parsing (Muller et al., 2004).
Ontology Annotations. This section is used to describe the functional and phenotypic properties of a
locus using structured language (ontologies). Ontologies developed by the GO and PO consortia are used
to characterize the biological processes, molecular
functions, cellular components, plant anatomy, and
plant growth stages in which a locus is involved.
Ontology annotations are assigned automatically and
manually (Fig. 3). Using both methods of annotation,
2,426 loci are annotated with 11,993 GO and 1,481 PO
terms (Fig. 4). Because the same structured language is
also used in other major plant databases (Lawrence
et al., 2007; Liang et al., 2008, Swarbreck et al., 2008),
ontology annotations allow cross-species comparative
analyses.
Solanaceae Phenotype Module
A phenotype is the observable trait of an individual.
Phenotyping records are kept with individual accessions because phenotypic variation of single plants
may vary with genetic background, the environment,
phenotyping methodologies, and human inconsistencies in scoring for traits. Each accession in the
phenotype database has a unique name and is associated with a population (see ‘‘Materials and Methods’’).
Currently, the database contains 6,921 accessions from
17 populations (Table II). Individual accession data
include images, underlying loci and alleles, phenotypic attributes, the genetic makeup of each plant,
germplasms, and ontology annotations (Fig. 1). The
database is usually populated with batch information
for large datasets. Accession entries, for example mutants, can also be added to the database by submitters
using the Web interface.
User Interface
Phenotype Search. The phenotype search function
allows users to search for accessions using keywords
from the name or phenotype descriptors. An advanced
search can be done using filters for a specific population, PO, or Solanaceae Phenotype (SP) term, name of
an accession editor, accessions with associated loci, or
accessions with ontology annotation.
Search results are displayed as a list of accessions
matching the search parameters with links to separate
pages showing the details of each accession.
Clicking an accession name in the search results
displays the accession detail page (Fig. 5), divided into
the following sections.
Accession Details. Each accession in the database
has a unique name, free-text description, population
1792
Plant Physiol. Vol. 147, 2008
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
SGN Community Annotation
Figure 3. The ontology term annotation tool, available both on locus and on accession pages. Curators and submitters can select
an ontology to browse from a drop-down menu. While typing an ontology term name or ID, a list of matches is displayed in the
text area. When selecting a term from the results list, the user is required to choose a relationship type and an evidence code
supporting the annotation. The field of evidence description is populated based on the selected evidence code. The fields of
evidence with and reference are populated with the object’s associated sequences and literature references. Clicking on the
associate ontology button stores the selected information in the database along with the user details and date, and the annotation
is displayed on the Web page. [See online article for color version of this figure.]
name, the name of the person who submitted the
record, and references to loci identified in the accession (Fig. 5A). Each accession may be associated with
more than one locus if it carries variation in more than
one gene. As of this writing, the community annotation
database has information on accessions of tomato and
eggplant mutants, cultivars, mapping and quantitative
trait loci (QTL) populations, breeder lines, transgenic
accessions, and introgression lines.
Images. In this section, images depicting the phenotype are displayed (Fig. 5B). Of the 14,785 images in
the SGN images database, 7,963 images are displayed
in association with relevant loci and phenotyped accessions.
Phenotype Data. If an accession was part of a population study and with quantitative data on traits
(Gonzalo and van der Knaap, 2008), the phenotypic
values of the traits for the accession along with the
population parameters, including mean and range
values for each trait, are displayed (Fig. 5C).
Genotype Data. For accessions with associated
mapping data, such as mapped markers for known
genes, flanking markers of introgressions, or marker
scores for a mapping population, a linkage map is
displayed, representing the genotype of the accession (Fig. 5D). Mapping data are currently available
for 152 introgression lines (Eshed and Zamir, 1995)
and 863 individuals from seven QTL populations
(http://sgn.cornell.edu/cview/map.pl?map_id517).
Alleles. Accessions carrying variation associated
with a locus may also be associated with a specific
allele. Each represented allele has a link back to its
locus page and to the same allele page, which is linked
from the locus page where allele data can be edited.
There are 542 accessions associated with 553 allele
descriptors of 129 distinct Solanaceae loci; 1,704 accessions have links to loci without a specific allele association (linked to the default allele).
Germplasms. Each accession may be available for
ordering from a stock center and such accessions have
a list of its germplasms with a link to where it can be
ordered.
Ontology Annotations. The interface for displaying
ontology terms associated with the recorded pheno-
Plant Physiol. Vol. 147, 2008
1793
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
Menda et al.
Figure 4. Locus ontology annotations by species.
Number of annotations by controlled vocabulary
name for each species. [See online article for color
version of this figure.]
types is similar to the locus module (Fig. 3). However,
additional vocabulary terms can be used for describing SP variation. We have mapped the descriptors
developed for categorizing Tomato Genomics Resource Center (TGRC) accessions (http://tgrc.ucdavis.
edu) and the tomato monogenic mutant population
(Menda et al., 2004) to an OBO format of an SP
ontology (ftp://ftp.sgn.cornell.edu/ontology/SP.obo).
These terms, in addition to other descriptors used for
characterizing Solanaceae traits and phenotypes, are
also mapped to the Phenotype and Trait Ontology
(PATO; http://www.bioontology.org/wiki/index.php/
PATO:Main_Page) with the objective of providing a
semantic framework for querying different databases
using a common language. Mapping files, annotation
association files, and the SP ontology can be downloaded from ftp://ftp.sgn.cornell.edu/ontology.
DISCUSSION
We have developed a comprehensive database for
the community annotation of loci and phenotypes, providing functionality for extensive annotation based on
free-text descriptions, controlled vocabularies, images,
sequences, and literature references. Users with submitter accounts can contribute information using
easy-to-use Web interfaces. All submitted data are immediately visible to all users, facilitating review and
discussion of annotations as they emerge. While only
submitters can modify data, all registered users can
contribute knowledge using the forum-like comments
option available on each page. Quality control pipelines
and rigorous submission tracking ensure that only
high-quality annotations are published on the site.
As of March 2008, the database contains 3,604 loci,
1,014 publications, and 6,921 plant accessions (Table I).
Table II. Phenotypes by population
Number of accessions, associated loci, alleles, images, and SP annotations per population.
Population
Accessions
Associated
Loci
Associated
Alleles
Images
SP Annotations
Tomato EMS
Tomato FN
Eggplant EMS
TRGC
ILs
Tomato F2 2000
Tomato cultivars
Yellow Stuffer F2
Howard German F2
Howard German BC1
Sausage F2
Sun1642 F2
Banana Legs F2
Rio Grande F1
Transgenic lines
Breeder lines
Mutant lines
Total
2,537
809
237
1,962
152
88
312
200
113
100
111
100
99
94
3
3
1
6,921
27
7
0
3,872
4
0
122
0
0
0
0
0
0
0
3
3
1
4,039
4
0
0
426
1
0
122
0
0
0
0
0
0
0
0
0
0
553
1,758
478
320
879
264
2,469
858
205
113
101
132
194
98
94
0
0
0
7,963
4,233
1,433
356
1,729
1
1
0
0
0
0
0
0
0
0
0
0
0
7,753
1794
Plant Physiol. Vol. 147, 2008
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
SGN Community Annotation
Figure 5. SGN accession page. A, The details section contains population and submitter information, followed by underlying
loci entries. B, Images. C, Quantitative phenotypes. D, Genotype data. [See online article for color version of this figure.]
There are 42 community submitters who contributed most of the phenotyped accessions and information on approximately 200 loci. This community
annotation effort creates a medium and tools for the
Solanaceae research community to annotate their
genes and phenotypes, that way ensuring that the
quality of data in the database is as accurate, current,
and accessible as possible. Nevertheless, community
annotation is only one aspect of the curational capacity
at SGN and adds an additional aspect to the larger
scale of automated and in-house-curator annotations.
Critical metrics for our system’s success are the number of community annotators and the number of
annotations they make. If the current rate of subscrip-
Plant Physiol. Vol. 147, 2008
1795
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
Menda et al.
tion continues, we expect the number of community
annotators to grow by about 100 every year. Our outreach program actively solicits contributions from
leading scientists through direct e-mail contact, presentations at conferences, and publications in leading
journals. While our goal is to have at least 200 annotators by the end of 2009, we see this number as a critical
mass for the system to be useful. We predict that, in a
few years, online annotations will be a normal part of
any biologist’s routine and our system will scale well
to thousands of annotators.
The literature contains a vast amount of information
on genes and mutants that has yet to be integrated into
any electronic database in a format that makes it computationally accessible. The SGN community database
hopes to help close this gap by providing an easy way
for most knowledgeable members of the community to
contribute this information. Our system allows submitters to edit almost any data type associated with a
locus or a phenotype, so even partial data excluded
from publication, corrections, or supplemental information can be added to SGN community annotation
pages.
In 2007 alone, more than 1,300 Solanaceae-related
papers were published (about 90% of which were on
either tomato or potato) and more than 150 Solanaceae
mRNA sequences were submitted to GenBank. Researchers from the community are by far the best
resource for reviewing gene information and extracting
relevant data from their own publications. Due to space
limitations or focus on specific traits or processes,
papers and supplementary materials do not always
include all useful data gained from experiments. SGN
provides the research community with a platform for
sharing supplementary information that may be useful
for other members of the research community.
The inferred cross-links between phenomes and
genomes provide a resource for studying genome
evolution and the resulting phenotype variation in
plants (Ori et al., 2007; Xiao et al., 2008). While such
relationships are most accurately derived from manual curation of published experimental results, largescale links can be generated by comparative analyses
of traits and gene expression patterns in closely related
organisms.
In recent years, much progress has been made in
defining standard controlled vocabularies for biology,
which seek to develop standard machine-readable
ways to describe general processes shared by different
organisms, called ontologies. Ontologies greatly facilitate meaningful cross-species queries between disparate databases by providing a common semantic
framework that can be used in searches and comparisons. Among ontologies, the GO (Gene Ontology
Consortium, 2008) and PO (Avraham, et al., 2008) are
the most extensively used vocabularies for annotation
of genes and phenotypes in databases such as Gramene, TAIR, and MaizeGDB. Gene and phenotype
annotation with common controlled vocabularies facilitates finding genes that share similar, but not nec-
essarily identical, attributes in function, morphology,
and development. Our user-friendly tools encourage
and greatly help users to annotate their genes of
interest with controlled vocabularies.
Beyond the borders of the Solanaceae community,
several other approaches have recently been developed. For example, EcoliWiki (http://ecoliwiki.net)
has deployed an installation of MediaWiki (http://
mediawiki.org) as a hub for community annotation of
Escherichia coli K-12. Wikis have advantages in that
they are simple to set up and maintain, user-friendly,
and already familiar to many users due to the popularity of Wikipedia (http://wikipedia.org).
For bioinformatics purposes, the most significant
limitation of traditional wikis is that the wiki’s content
is stored in a mostly unstructured manner and without
any semantic metadata. This tends to make large-scale
automated analysis of wiki content difficult and error
prone at best. This limits the usefulness of such resources because such large-scale analyses are the bedrock in modern bioinformatics. In contrast, the SGN
community annotation system stores data in a highly
structured relational database, an ideal basis for largescale bioinformatics analyses. Despite its limitations,
wiki-style free-text editing can, however, be an excellent option for community editing of information
whose structure may not be known in advance. Currently, SGN community annotation pages allow submission of free-text comments at the bottom of each
page that can be used for this purpose.
FUTURE DEVELOPMENTS
Ultimately, our objective is to present on each locus
and phenotype page the entire story of a gene, including not only its descriptors, synonyms, and functions,
but also its history, provenance, mapping, cloning, and
sequencing, and all the experimental steps, people,
and methods involved in its characterization. Each
page will essentially be presented as a free-standing
publication, creating a permanent, yet evolving, entry
that can be cited and referenced.
With the growing number of gene descriptors and
annotations, MOD databases are becoming central
actors in a community effort to develop a unified
gene nomenclature and gold standards for annotation,
not only for maintaining similar guidelines within an
organism’s research community, but also for comparative searches across taxa. Journals are beginning to
collaborate with databases to set nomenclature standards and naming conventions. Since July 2007, manuscripts for the publication Plant Physiology have been
required to supply a TAIR locus identifier for Arabidopsis gene data (http://www.plantphysiol.org/misc/
ifora.shtml). The benefits of this policy include prevention of nomenclature conflicts (since TAIR arbitrates the nomenclature) and ensure availability of
up-to-date gene information. We intend to provide the
research community with a similar system of stable
1796
Plant Physiol. Vol. 147, 2008
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
SGN Community Annotation
identifiers, naming conventions, and annotation standards for the Solanaceae.
CONCLUSION
SGN is the first among the major plant databases to
put the control of the information directly in the hands
of community experts, with SGN curators acting as
editors in the annotation process, rather than exclusively as authors. As a result, SGN annotations are
more up to date and richer with detailed descriptions,
images, and several levels of gene-to-phenotype crosslinks, than would otherwise be possible without a
large curatorial staff.
We would be happy to collaborate with other research communities to help start community annotation efforts of other organisms and clades.
MATERIALS AND METHODS
Platform Technologies
SGN stores and indexes most of its data using the open source PostgreSQL
database system (http://www.postgresql.org). Most software developed at
SGN is written in Object Oriented Perl and Javascript. User data submission
forms are written using AJAX techniques to provide powerful and userfriendly interfaces. The SGN Web site uses the Apache (http://www.apache.
org) Web server with the mod_perl integrated Perl interpreter. All servers and
most development machines run the Debian distribution of the GNU/Linux
operating system. More information on the database schemas, software, and
setup at SGN can be found on the SGN Web site (http://sgn.cornell.edu).
Data Types
The first step toward implementing a system for representing phenotypeto-genotype relationships was to design a database schema for storing
Solanaceae loci and phenotypes with cross-references between the two
datasets (Fig. 1). The following conceptual data types are used, many of which
map directly to Perl classes and/or tables in the PostgreSQL database.
Locus: Central data type representing descriptive genetic information of
plausible transcribed units in the genome. Locus has unique names and
symbols, synonyms, allele data, related sequences. It is annotated with supporting literature records and phenotypes are described using controlled vocabularies.
Allele: Alternative form of a locus. It may originate from natural or induced
variation. Alleles allow representation of multiple products and phenotypes of a
single locus in an organism.
Phenotype: Measurable traits and characteristics of individuals within a
defined population. Phenotypes are stored as text descriptors of alleles and
individual accessions, annotated images, quantitative measurements, and
controlled vocabulary terms.
Accession: Single member of a predefined population. Annotated with
phenotypic and genotypic attributes, such as images, locations on a genetic
map, and controlled vocabulary terms. Cross-referenced with loci via associations with alleles (accession to allele to locus).
Population: Collection of individuals (accessions) sharing a common genetic
background or a common phenotyping or genotyping scheme. A population
may be genetically homogeneous, such as mutant collections in a specific
background or isogenic inbred lines, or may be a heterogeneous collection of
plants of different genetic backgrounds that have been characterized using
similar methodologies.
Database Schema
In the PostgreSQL database system, data are represented as tables with
rows and columns that hold data and also refer to other tables. To maintain a
comprehensive audit trail for every data point change, each user-updateable
table associated with the community annotation system stores certain standardized metadata for each database record such as creation date, modification date, owner ID, the ID of the user who submitted an update, and
obsoleteness information (Supplemental Fig. S1). The owner of the record is
usually authorized to edit and delete (actually, obsolete) the information. All
users can view the information. The core structure of the database (Fig. 1)
consists of a locus table for storing gene descriptors and an individual table for
storing phenotype descriptors of accessions. More than 30 additional tables
store related information, such as alleles, image data, and annotations. Plant
accessions (individuals) can be linked to an allele through a linking table
allowing many relationships between loci, accessions, and images. Thus,
accessions with mutations in several loci can be represented easily. Each locus
has a default allele used as a place holder to allow associating phenotypes to
loci in the absence of allele information. Over time, as genes are sequenced
and annotated with allele information, the locus-phenotype associations may
be refined to include more specific allele information. Sequences, publications,
biochemical pathways, controlled vocabulary terms, and other general data
used for annotation are primarily stored using a slightly modified version of
the GMOD Chado database schema (Mungall et al., 2007).
User Types and Privileges
From the outset, the community annotation system was designed with an
eye toward participation from Web site users in a variety of roles. Each logged-in
user has an account type, which is used as the first level of granularity for
assigning database access and editing privileges. Web access to view all data is
unrestricted and does not require registration. The default user type is user,
which carries permission for posting comments on pages for loci and
individuals and on other pages on the site. Submitter accounts are granted
only to users who have been individually vetted by an SGN curator, since
these accounts carry privileges for submitting new data and editing many
existing entries. Submitter accounts are generally available to anyone with a
legitimate interest and expertise in Solanaceae research, and a request for a
submitter-class account is typically granted within 24 h. There is also a third
user type for SGN staff, curator, which carries administrative privileges.
Any SGN submitter may add a new locus or request locus editing privileges
for the purpose of curation and annotation of genes already existing in the
database. To obtain locus editor privileges, a user must first create an account by
clicking on the login link from the toolbar on any SGN page (http://www.
sgn.cornell.edu/solpeople/login.pl) and follow the instructions after clicking
the sign up for an account link. This will create an account of type user. User
accounts can be upgraded to submitter upon request by e-mailing to [email protected] (using the link provided in the footer of every SGN
Web page), or by requesting editor privileges for a specific locus by clicking on
the request editor privileges link from the relevant locus page.
SGN People Database
When a user logs into the SGN Web site, they are directed to the site’s
central hub for user-based functions called MySGN (http://www.sgn.cornell.
edu/solpeople/top-level.pl), which provides entry points to many of the site’s
user-based tools, including to the community annotation functions. On this
page, users with submitter accounts can find a summary of all loci for which
they have editor privileges, as well as a list of their recently annotated loci. It
also has a link for viewing all recent changes to the community annotations.
Each user’s publicly visible SGN person detail page also shows a list of loci for
which they are editors.
Community Annotation Tools
All user-editable data types have designated owners. Loci, alleles, phenotyped accessions, images, and annotations can be submitted by any SGN
submitter or curator. The submitter becomes the object owner by default.
Loci
Since loci are complex data types, with several research groups working on
different aspects of the same locus, the system allows multiple submitters to
be assigned editor privileges for a given locus.
When a locus editor is logged in, the edit and delete links on the locus page
become active. Clicking the edit link opens an editable form (Fig. 2A), and
Plant Physiol. Vol. 147, 2008
1797
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
Menda et al.
clicking the delete link brings up a delete confirmation dialog. Any user with a
submitter-class account can add synonyms, alleles, sequences, publications,
and ontology annotations to a locus record.
Alleles
Allele information is useful for storing sequences and phenotype variation
of a gene. In the locus database an allele must be associated with one locus and
have a unique name and symbol. The mode of inheritance of the allele is
designated as recessive, dominant, or partially dominant, and the phenotype
of the allele is an optional free-text field. Alleles may also have any number of
unique synonyms, editable by any SGN submitter in a similar manner to the
locus synonyms, and associated accessions and images, which are usually a
subset of the accessions associated with the allele’s locus. The allele-accession
association (and image) is a more granular link of gene and phenotype data;
however, in case the underlying allele of a phenotyped accession is not
recorded, there is only a link to the default allele of the locus. The allele owner
has edit privileges for the allele information and can be different from the
locus owner. Any user with a submitter account can add new alleles to an
existing locus and, by doing so, become an allele owner. Alleles can also be
associated with other data types from external databases, such as GenBank
sequences and publications.
Gene Networks
Loci do not act in isolation, but are embedded in a network of other loci,
with which they act in a pathway, are regulated, or interact in some other way.
To represent these networks, we have introduced a tool for making locus-tolocus associations. This system includes a small ontology describing the
interaction, as well as software tools, an easy-to-use AJAX (Asynchronous
Javascript and XML) form for browsing the locus database (by name or
symbol), providing a relationship type, an evidence code, and an optional
literature reference that documents the association between loci. The association and its description are displayed on both related loci, regardless on
which locus the association was initiated.
Files and Images
Figures and images can be associated directly with a locus or an accession
using an upload function on locus and on accession detail pages (Fig. 2B).
Submitters may upload photos, schematics, and documented notes including
PDF files to add supplemental information or experimental results. The
submitter is considered to be the copyright owner of the uploaded materials
and grants a nonexclusive license to SGN to publicly display it.
An image-specific detail page is available, with metadata including description, the user who uploaded the image, who is usually also the owner of
the image and has edit and delete privileges. As with loci and individuals,
images are never deleted from the file system and the database, but are only
set to obsolete in the image table. The image page also contains tags—general
text descriptors—that may be added or deleted by any SGN submitter in a
similar manner to the locus synonyms. All the objects associated with an image
are printed below it, with links to the relevant Web pages. The same image may
be associated with one or more individuals and also with other object types in
SGN’s database, thus creating a general image object not restrained to a specific
data type.
Accessions
Phenotyped accessions are named individuals in the database and have a
designated editor, in a similar manner to the locus data type, usually the
person who uploaded or contributed the accession to SGN. Editing privileges
are granted only to the editor and curators, and Web-based edits are structured as described in the locus section (Fig. 5).
Sequence Annotations
The community annotation tools allow submitters and curators to associate
SGN unigenes and GenBank accessions with loci. GenBank accessions are
fetched dynamically from the National Center for Biotechnology Information
(NCBI) using its eUtils (http://www.ncbi.nlm.nih.gov/entrez/query/static/
eutils_help.html) tool. If a PubMed citation is embedded in the retrieved sequence
entry, the user can also choose to import the publication so that both GenBank
accession and PubMed publication will be linked with the annotated locus.
Literature Annotations
Submitters and curators can fetch publications from PubMed using
PubMed IDs. Publications not available from PubMed can be added manually
by filling in a form with citation details. Each publication can be linked with a
locus and an allele entry.
Ontology Annotations
SGN uses three ontologies for annotating genes and phenotypes: GO, PO,
and SP. The latest release of SP is available for download at ftp://ftp.sgn.
cornell.edu/ontology/SP.obo. Automated locus annotations with GO and PO
terms are obtained by sequence orthology to Arabidopsis (Arabidopsis thaliana), as inferred by sequence similarity (BLAST) and SGN family build
membership. GO and PO annotations were downloaded from TAIR (http://
Arabidopsis.org) and loaded as annotations on the matching SGN loci. We
have developed a manual annotation tool for loci and phenotype accessions,
providing an interface for searching the ontology by term name, synonym, or
ID. Submitters and curators are prompted to provide supporting information
such as literature citations and evidence codes (Fig. 3). User-contributed
annotations are posted immediately on the locus page with the name and
contact information of the submitter, but are then checked for consistency by
SGN curators.
Curator-verified annotations are submitted periodically by SGN to the GO
and PO consortiums and are available for browsing on their respective Web
sites. The SP ontology, developed at SGN, is mapped to PO and PATO terms
for entity-quality-value annotations of qualitative and quantitative traits.
User Comments
Anyone with an SGN user account can post user comments on pages
without requesting submitter privileges or locus ownership. This comment
tool is sometimes useful for posting partial data or other free-text information
about the locus, its annotations, and/or supporting data. The locus page can
thus be an open forum to discuss what is known about a locus. When a user
posts a comment, an automatic e-mail is sent notifying SGN curators of the
post, including the text of the comment. Inappropriate comments are removed
immediately.
Data Upload Pipelines
The sources for original bulk data include NCBI (GenBank for sequences
and PubMed for literature), the TGRC for loci and plant accessions, individual
labs. To aid the community in locus annotation for a given species, we
periodically populate or repopulate the database with selected bulk information from these sources using automated data-processing pipelines. These
data serve as seed for the community annotators to refine and build upon
and/or complement data already existing in SGN. To populate the database,
we have developed an automated processing pipeline to process new locus
data and upload and update existing links and annotations.
Newly characterized genes are also added individually to the database by
SGN curators or by members of the community as they are published in the
public domain.
Biochemical Pathways
Locus-to-pathway associations are generated automatically from the unigene pathway annotations in the SolCyc database (http://solcyc.sgn.cornell.
edu; Caspi et al., 2008). Whenever an SGN unigene with SolCyc annotations
is associated with a locus, the link to the SolCyc reaction is inferred and
displayed on the locus page (Fig. 2C).
Quality Control
Web Data User Edits
SGN has developed several mechanisms to ensure that the community
annotation meets quality standards. SGN curators are notified by an e-mail
1798
Plant Physiol. Vol. 147, 2008
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
SGN Community Annotation
feedback system when users upload, edit, and delete any data point. Details
of who, when, and what was modified are sent by e-mail, in real time, to the
SGN curators as well as stored in the database. Besides, designated editors
who are experts on the locus or phenotype have the privileges to verify for
accuracy of data.
Deleting entries (locus or phenotype) and annotations using the Web
system does not remove the information from the database, but only flags the
item as obsolete, whereupon it is excluded from Web displays. This means
that delete operations can be reverted, thus preventing data loss caused by
accidents or malicious users. Back-end administration features allow SGN
curators to view all annotation changes organized by date.
An additional layer of control on data input by the community is restrictions on the data type that can be added in a specific field. For example, to
describe functions and phenotypes of a locus, annotators are limited to using
ontologies from a browsable list already existing in the SGN database. And,
generally, data that can be used to annotate, for example, a locus, have to exist
in an internal or external database. This ensures that random data entry or
spamming is minimized.
Storing History
Each user-editable table has an associated history table, which stores the
entire editing history. In addition to archiving the change history and allowing
submitters to view information as it appeared in the past (useful if certain
experiments were performed in the past on incomplete annotation), the
history information can also be used to revert edits. Only locus editors and
SGN curators can view the history information and revert edits.
On each update of locus and accession details, the previous version of the
information is transferred from primary tables in the relational database to a
set of history tables, which are nearly identical in structure to the primary
tables. When a locus owner or curator is logged in, the Web interface provides
a clickable link to display all the changes previously made, and the name of
the person who made each change. This history module enables easy tracking
and reverting of data, providing an essential undo function for managing
community-generated content.
ACKNOWLEDGMENTS
We would like to thank Esther Van Der Knaap, Roger Chetelat, and Dani
Zamir and all submitters for contributing data to the phenotyped populations
and locus database, and Anuradha Pujar for contributing to the development
of the Solanaceae Phenotype Ontology. We would also like to thank two
anonymous reviewers for their helpful comments.
Received March 23, 2008; accepted May 9, 2008; published June 6, 2008.
LITERATURE CITED
Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of
the flowering plant Arabidopsis thaliana. Nature 408: 796–815
Avraham S, Tung CW, Ilic K, Jaiswal P, Kellogg EA, McCouch S, Pujar A,
Reiser L, Rhee SY, Sachs MM, et al (2008) The Plant Ontology Database:
a community resource for plant structure and developmental stages
controlled vocabulary and annotations. Nucleic Acids Res 36: D449–D454
Butler L (1952) The linkage map of the tomato. J Hered 43: 25–35
Caspi R, Foerster H, Fulcher CA, Kaipa P, Krummenacker M, Latendresse
M, Paley S, Rhee SY, Shearer AG, Tissier C, et al (2008) The MetaCyc
Database of metabolic pathways and enzymes and the BioCyc collection
of Pathway/Genome Databases. Nucleic Acids Res 36: D623–D631
Crosby MA, Goodman JL, Strelets VB, Zhang P, Gelbart WM, FlyBase
Consortium (2007) FlyBase: genomes by the dozen. Nucleic Acids Res
35: D486–D491
Eppig JT, Blake JA, Bult CJ, Kadin JA, Richardson JE, Mouse Genome
Database Group (2007) The mouse genome database (MGD): new
features facilitating a model system. Nucleic Acids Res 35: D630–D637
Eshed Y, Zamir D (1995) An introgression line population of Lycopersicon
pennellii in the cultivated tomato enables the identification and fine
mapping of yield-associated QTL. Genetics 141: 1147–1162
Gene Ontology Consortium (2008) The Gene Ontology project in 2008.
Nucleic Acids Res 36: D440–D444
Gonzalo MJ, van der Knaap E (2008) A comparative analysis into the
genetic bases of morphology in tomato varieties exhibiting elongated
fruit shape. Theor Appl Genet 116: 647–656
Lawrence CJ, Schaeffer ML, Seigfried TE, Campbell DA, Harper LC
(2007) MaizeGDB’s new data types, resources and activities. Nucleic
Acids Res 35: D895–D900
Liang C, Jaiswal P, Hebbard C, Avraham S, Buckler ES, Casstevens T,
Hurwitz B, McCouch S, Ni J, Pujar A, et al (2008) Gramene: a growing
plant comparative genomics resource. Nucleic Acids Res 36: D947–D953
Menda N, Semel Y, Peled D, Eshed Y, Zamir D (2004) In silico screening of
a saturated mutation library of tomato. Plant J 38: 861–872
Muller HM, Kenny EE, Sternberg PW (2004) Textpresso: an ontologybased information retrieval and extraction system for biological literature. PLoS Biol 2: e309
Mueller LA, Mills AA, Skwarecki B, Buels RM, Menda N, Tanksley SD
(2008) The SGN comparative map viewer. Bioinformatics 24: 422–423
Mueller LA, Solow TH, Taylor N, Skwarecki B, Buels R, Binns J, Lin C,
Wright MH, Ahrens R, Wang Y, et al (2005) The SOL Genomics
Network: a comparative resource for Solanaceae biology and beyond.
Plant Physiol 138: 1310–1317
Mungall CJ, Emmert DB, FlyBase Consortium (2007) A Chado case study:
an ontology-based modular schema for representing genome-associated
biological information. Bioinformatics 23: i337–i346
Ohyanagi H, Tanaka T, Sakai H, Shigemoto Y, Yamaguchi K, Habara T,
Fujii Y, Antonio BA, Nagamura Y, Imanishi T, et al (2006) The Rice
Annotation Project Database (RAP-DB): hub for Oryza sativa ssp.
japonica genome information. Nucleic Acids Res 34: D741–D744
Ori N, Cohen AR, Etzioni A, Brand A, Yanai O, Shleizer S, Menda N,
Amsellem Z, Efroni I, Pekker I, et al (2007) Regulation of LANCEOLATE by miR319 is required for compound-leaf development in tomato.
Nat Genet 39: 787–791
Pennisi E (2000) Ideas fly at gene-finding jamboree. Science 287: 2182–2184
Riley M, Abe T, Arnaud MB, Berlyn MK, Blattner FR, Chaudhuri RR,
Glasner JD, Horiuchi T, Keseler IM, Kosuge T, et al (2006) Escherichia
coli K-12: a cooperatively developed annotation snapshot—2005.
Nucleic Acids Res 34: 1–9
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C,
Fuellen G, Gilbert JG, Korf I, Lapp H, et al (2002) The BioPerl toolkit:
Perl modules for the life sciences. Genome Res 12: 1611–1618
Stein L (2001) Genome annotation: from sequence to biology. Nat Rev
Genet 2: 493–503
Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M,
Foerster H, Li D, Meyer T, Muller R, Ploetz L, et al (2008) The
Arabidopsis Information Resource (TAIR): gene structure and function
annotation. Nucleic Acids Res 36: D1009–D1014
Xiao H, Jiang N, Schaffner E, Stockinger EJ, van der Knaap E (2008) A
retrotransposon-mediated gene duplication underlies morphological
variation of tomato fruit. Science 319: 1527–1530
Plant Physiol. Vol. 147, 2008
1799
Downloaded from on June 17, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.