Download Plant Genome Resources at the National Center for Biotechnology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neuronal ceroid lipofuscinosis wikipedia , lookup

Copy-number variation wikipedia , lookup

Whole genome sequencing wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Gene therapy wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Non-coding DNA wikipedia , lookup

Protein moonlighting wikipedia , lookup

Minimal genome wikipedia , lookup

Gene desert wikipedia , lookup

NEDD9 wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Human genome wikipedia , lookup

Point mutation wikipedia , lookup

RNA-Seq wikipedia , lookup

Genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genomic library wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Metagenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Bioinformatics
Plant Genome Resources at the National Center
for Biotechnology Information
David L. Wheeler*, Brian Smith-White, Vyacheslav Chetvernin, Sergei Resenchuk, Susan M. Dombrowski,
Steven W. Pechous, Tatiana Tatusova, and James Ostell
National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894
The National Center for Biotechnology Information (NCBI) integrates data from more than 20 biological databases through
a flexible search and retrieval system called Entrez. A core Entrez database, Entrez Nucleotide, includes GenBank and is tightly
linked to the NCBI Taxonomy database, the Entrez Protein database, and the scientific literature in PubMed. A suite of more
specialized databases for genomes, genes, gene families, gene expression, gene variation, and protein domains dovetails with
the core databases to make Entrez a powerful system for genomic research. Linked to the full range of Entrez databases is the
NCBI Map Viewer, which displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of
many plants. A specialized plant query page allow maps from all plant genomes covered by the Map Viewer to be searched in
tandem to produce a display of aligned maps from several species. PlantBLAST searches against the sequences shown in the
Map Viewer allow BLAST alignments to be viewed within a genomic context. In addition, precomputed sequence similarities,
such as those for proteins offered by BLAST Link, enable fluid navigation from unannotated to annotated sequences,
quickening the pace of discovery. NCBI Web pages for plants, such as Plant Genome Central, complete the system by providing
centralized access to NCBI’s genomic resources as well as links to organism-specific Web pages beyond NCBI.
DATABASES COVERED BY THE NATIONAL CENTER
FOR BIOTECHNOLOGY INFORMATION’S
ENTREZ SYSTEM
The National Center for Biotechnology Information
(NCBI) provides a data-rich environment in support of
genomic research by integrating the data from more
than 20 biological databases through a flexible search
and retrieval system called Entrez. A core database in
Entrez, Entrez Nucleotide, includes GenBank (Benson
et al., 2005), a primary database of nucleotide sequences
built and maintained by NCBI and synchronized daily
with the DNA Databank of Japan (Tateno et al., 2004)
and the European Molecular Biological Laboratory
(Kanz et al., 2004) databases. Entrez Nucleotide contains almost 12 million plant-derived sequences. Entrez databases that are coupled tightly with the
nucleotide sequences in Entrez Nucleotide are the
NCBI Taxonomy, Protein, PubMed, and PubMed Central databases. The NCBI Taxonomy database covers
over 160,000 organisms and includes almost 60,000
plant taxa. Entrez Protein comprises sequences taken
from several sources (Boeckmann et al., 2003; Wu et al.,
2003; Deshpande et al., 2004), including the conceptual
translations of coding regions annotated on GenBank
records, and contains over 350,000 plant-derived sequences. This primary sequence data is coupled to the
scientific literature via links to PubMed and PubMed
Central. In addition to the core databases mentioned,
the Entrez system also covers a suite of more special* Corresponding author; e-mail [email protected]; fax
301–480–9241.
www.plantphysiol.org/cgi/doi/10.1104/pp.104.058842.
1280
ized databases using a variety of database schemas. A
few of particular relevance to genomics are Entrez
Genomes, containing genomic sequence and annotations for over 1,000 organisms, and including 39 complete sequences for plant chromosomes, plastids, and
mitochondria; UniGene, containing gene-oriented sequence clusters for over 40 organisms, and including
more than 200,000 clusters for 20 plants; Entrez Gene
(Maglott et al., 2004), a gene-centered resource covering over a million loci, including over 81,000 loci for
plants; HomoloGene, containing clusters of homologous genes for over a dozen model organisms, including almost 16,000 clusters for plants; RefSeq
(Pruitt et al., 2005), released bimonthly via FTP,
updated daily in Entrez, and containing reference
nucleotide and protein sequences for almost 3,000 organisms, including over 65,000 transcript sequences
and more then 69,000 protein sequences for 40 plants;
UniSTS, a nonredundant database of sequence-tagged
sites (STSs) from over 200 organisms, including over
8,000 STSs from 30 plants; dbSNP (Sherry et al., 2001),
a database of nucleotide variations from over 20
organisms, including almost four million single nucleotide polymorphisms (SNPs) from six plants; the
Conserved Domain Database (CDD; Marchler-Bauer
et al., 2004), containing over 10,000 alignment-based
protein domain profiles, including over 250 that are
specific to plants; and the Gene Expression Omnibus
(GEO; Barrett et al., 2004) database of gene expression,
with data from over 20 organisms, including 38 datasets and over 600,000 expression profiles for plant
tissues. A network of links allows fluid navigation
between these databases. Links may interconnect
a sequence in Entrez Nucleotide to the abstract in
Plant Physiology, July 2005, Vol. 138, pp. 1280–1288, www.plantphysiol.org Ó 2005 American Society of Plant Biologists
Downloaded from on August 3, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Plant Genome Resources
PubMed of the paper in which it is cited. If a sequence
in Entrez Nucleotide bears annotated coding regions,
then it will be interconnected to the sequences in
Entrez Protein derived from the respective conceptual
translations. Many Entrez links are computationally
derived, such as links between protein sequences made
on the basis of sequence similarity. These precomputed
protein alignments, updated daily, are available for
viewing using the BLAST Link (BLink) resource. To see
the variety of links to other databases or resources
available for any record in Entrez, click on the ‘‘Links’’
link in the upper right-hand corner of the Entrez
display.
The NCBI Map Viewer (http://www.ncbi.nlm.nih.
gov/mapview/), used to display genomic maps for
many plant and animal genomes, takes advantage of
the same database technology as Entrez and supports
text queries with Boolean logic. Searches of complete
genomic sequences from organisms ranging from
microbes to higher plants and animals may be performed via genomic BLAST searches that lead to Map
Viewer displays in which the genomic context of the
hits can be seen.
The NCBI genomics environment offers a unique
opportunity for researchers. While specialized genomic resources provide excellent coverage of the data
for a single organism or a closely related group of
organisms, the environment at NCBI enables the comparison of genomic data from species across the entire
taxonomic range. This wide taxonomic coverage greatly
enhances the utility of the data by allowing advances
in the understanding of the genomics of one organism
to be applied to a broader genomic context spanning
many organisms.
In the descriptions that follow, addresses for the key
NCBI Web pages appear directly in the text and are
also compiled in the ‘‘URLs for NCBI Resources and
Download Sites’’ section below. For a more detailed
overview of NCBI resources, see the NCBI home page
(http://www.ncbi.nlm.nih.gov) and the review in
Wheeler et al. (2005). Details of the creation and
curation of NCBI RefSeqs can be found at http://
www.ncbi.nlm.nih.gov/RefSeq/, and a synopsis of
the HomoloGene build procedure can be found at
http://www.ncbi.nlm.nih.gov/HomoloGene/HTML/
homologene_buildproc.html.
SOURCES AND ORGANIZATION OF THE PLANT
GENOMIC DATA AT NCBI
The bulk of the primary plant genomic data available at NCBI falls into one of three categories that
mirror the types of projects currently conducted by the
plant research community; these are genomic assemblies, batches of Expressed Sequence Tags (ESTs), and
genetic or physical genomic maps. Genomic assemblies
include that of the Arabadopsis (Arabidopsis thaliana)
genome produced by the Arabidopsis Genome Initiative (2000), the assembly of Oryza sativa subsp. japonica
resulting from the International Rice Genome Sequencing Project (Yu et al., 2002) and yielding assemblies for
chromosomes 1 and 10, and that of O. sativa L. subsp.
indica produced by the Chinese Academy of Sciences in
Beijing (Sasaki et al., 2002). Emerging data from several
plant genome sequencing projects, including projects
for Lycopersicon esculentum, Medicago sativa and Medicago truncatula, Lotus corniculatus var. japonicus, Populus
balsamifera, Poncirus trifoliata, Vitis vinifera, and many
others, can be accessed in the NCBI Genome Project
Database (http://www.ncbi.nlm.nih.gov/entrez/query.
fcgi?CMD5search&DB5genomeprj), which provides
a good overview of current and planned sequencing
efforts.
Data resulting from large-scale EST sequencing
projects for over 70 plants have been deposited into
GenBank and are integrated with other sequence data
at NCBI in a number of ways. Organisms for which
more than 70,000 EST sequences have been deposited
are automatically entered into the UniGene database,
in which case their EST sequences are combined with
other transcript sequences in GenBank and partitioned
into gene-oriented clusters. In this manner, ESTs from
20 plant species have been used to produce more than
200,000 transcript clusters. The UniGene clusters
themselves are subjected to further analysis to link
them to the STSs in UniSTS, genes in Entrez Gene,
gene homologs in HomoloGene, and proteins in Entrez Protein. As a consequence, it is often possible to
simply query Entrez with a GenBank EST accession
number and wind up with a gene location, protein sequence, and estimate of sequence conservation across
organisms in a single step.
ESTs arising from several plant species as well as the
UniGene clusters derived from them are aligned to the
well-assembled Arabidopsis and O. sativa genomes.
ESTs from the Liliopsida are aligned to the O. sativa
genome; these include Hordeum vulgare, O. sativa,
Sorghum bicolor, Triticum aestivum, and Zea mays. Those
from the Eudicotyledons are aligned to the Arabidopsis genome and include Arabidopsis itself, Glycine
max, Lactuca sativa, L. corniculatus, Malus 3 domestica,
M. truncatula, Populus tremula 3 Populus tremuloides,
Solanum tuberosum, and V. vinifera. These alignments
make an important link between EST data, which is
relatively inexpensive to obtain, and assembled, wellannotated genomic sequence, a more expensive commodity. In addition to these ESTs, more than three
million SNPs, submitted to NCBI’s dbSNP database,
have been mapped to the O. sativa genome at NCBI.
Genetic mapping projects are under way for a number of plants, and maps resulting from these projects
are displayed in the Map Viewer discussed below.
Such nonsequence data is imported by NCBI from
many publicly available databases in formats ranging
from Structured Query Language table dumps to flat
files with unique formats. NCBI processes the data to
link feature locations with GenBank records, with data
in related databases, and with URLs that point back to
the source of the data. Once processed, the data is
Plant Physiol. Vol. 138, 2005
1281
Downloaded from on August 3, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Wheeler et al.
entered into the Map Viewer database for display.
Sources of nonsequence data shown in the Map
Viewer include maps from MaizeGDB (Lawrence
et al., 2004), Gramene (Ware et al., 2002), GrainGenes
(2005, http://wheat.pw.usda.gov/GG2/index.shtml),
the Solanaceae Genomics Network (2005, http://
www.sgn.cornell.edu), and BarleyDB (BarleyDB at
Cropnet, 2005, http://ukcrop.net/barley.html). The
map names in Map Viewer displays link to descriptions of the source of each map data with references to
the primary data. A summary of the scope of the plant
genomic data available at NCBI with links to associated resources can be found on the Plant Genome
Central Web page.
ACCESSING THE DATA
The Map Viewer provides a central interface to plant
genomic data at NCBI, serving as an interactive tool
for viewing an ensemble of aligned genetic, physical,
or sequence-based maps with an adjustable focus
ranging from that of a complete chromosome to that
of a portion of a gene. The maps displayed in the Map
Viewer may be derived from a single organism or from
multiple organisms; map alignments are performed on
the basis of shared markers.
There are many species of plant for which Map
Viewer displays are available: Arabidopsis, oat (Avena
sativa), G. max, H. vulgare, L. esculentum, O. sativa, S.
bicolor, T. aestivum, Triticum monococcum, Z. mays, and
L. corniculatus var. japonicus. The creation of displays
for M. sativa and M. truncatula is in progress. A number of genetic or physical maps are usually available
for each organism displayed in the Map Viewer. For
two organisms, O. sativa and Arabidopsis, sequence
maps are available with annotated genes, and for
these two genomes EST and UniGene alignments are
available, as discussed above. Annotations on the
plant genomic assemblies displayed in the Map
Viewer are those produced by the appropriate model
organism community annotation effort. References
and links to the projects generating these annotations
are shown on Map Viewer genome overview pages.
Navigation to a target Map Viewer display is via one
of three primary routes: a Map Viewer text query,
an Entrez text query, or a PlantBLAST search. When
a match is found through one of these avenues, it is
possible to navigate to a Map Viewer display from
which many routes to genomic comparison and analysis are available. The usage scenarios outlined below
illustrate these three entry mechanisms.
USAGE SCENARIO: BEGINNING WITH A MAP
VIEWER TEXT QUERY
The Plant Genomes Query page (http://www.
ncbi.nlm.nih.gov/mapview/map_search.cgi?chr5plants.
inf) enables one to simultaneously search physical,
genetic, and sequence maps shown in Map Viewer
for several plant genomes. Queries may include clone
names or aliases, gene names or symbols, locus identifiers, gene products or descriptions, genetic marker
or GenBank identifiers. Multiple search terms can be
combined using Boolean logic.
Since the early 1990s, various lines of research have
shown that large-scale genome structure is conserved
in blocks across the grasses (Ahn and Tanksley, 1993;
Devos et al., 1994; Kurata et al., 1994; Van Deynze et al.,
1995). Locus nomenclature is organism specific and is
unreliable as a query method between species; however, the regular nomenclature of plasmids (Lederberg,
1986) is not influenced by how the plasmid or insert is
used. The data for the plant maps available through
Map Viewer include the marker-locus relationships,
based on experimental evidence and provided by
model organism annotators external to NCBI, for
each locus where the allelic state is identified by the
nucleic acid sequence. This information enables the
rendering of visual connection between those mapped
loci in adjacently displayed maps that were identified
by the same probe. In principle, this locus-probe
relationship allows a cross-species text search using
the probe name as the query string as depicted in
Figure 1. The text query used was ‘‘cdo718’’ (Fig. 1A),
the name of a plasmid with an oat cDNA insert. This
probe was used to map loci in nine maps available in
Map Viewer: the AxH-92 map in oat; the Cons map in
H. vulgare; the RC94, RW99, R, RC00, and RC01 maps
in O. sativa; the S-0 map in T. aestivum; and the RW99
map in Z. mays. The red lines between each map
connect the loci identified by the probe. The gray lines
connect the other loci in adjacent maps that have been
identified by the same probe. Note that the Map
Viewer display of Figure 1 shows full marker names
for each track because the compress maps check box
(Fig. 1B), checked by default to accommodate a large
number of maps, has been unchecked. Mouseovers
reveal more information about each marker (Fig. 1C)
and links to external resources, such as MaizeGDB
(Lawrence et al., 2004), are found in the links column
(Fig. 1D).
USAGE SCENARIO: BEGINNING WITH A
TEXT-BASED SEARCH IN ENTREZ
The entire suite of Entrez databases may be searched
in tandem using the Entrez Global Query page
(http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi).
Using Global Query, it is not necessary to know in
advance in which databases the data will be found,
and, as such, it is an excellent starting point for an
initial survey of a new research area. As an example,
consider a search for salt tolerance-related genes in
plants. One may begin with the following query: ‘‘salt
tolerance’’ AND viridiplantae [organism].
The query consists of two parts, linked by a Boolean
‘‘AND,’’ which is always given in capital letters. For
1282
Plant Physiol. Vol. 138, 2005
Downloaded from on August 3, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Plant Genome Resources
Figure 1. Map Viewer displays resulting from a search for marker ‘‘cdo718’’ showing aligned maps from several plants. The text
‘‘cdo718,’’ shown near A, is highlighted on each map with lines between maps connecting the highlighted markers (C). The
Compress maps check box (B) has been unchecked to show full locus names on the maps. The Symbol links under D lead to the
MaizeGDB (Lawrence et al., 2004) Web resource.
the query to be successful, the quoted phrase ‘‘salt
tolerance’’ must be found in some part of a database
record. The second portion of the query limits the
search to the taxon viridiplantae within the Entrez
organism field, given in the square brackets. Hence,
for databases, such as the sequence databases, that
classify records by organism, only records for the
viridiplantae will be returned.
The entries identified with the query are displayed
on the Global Entrez results page (Fig. 2H). The search
matches entries in the Entrez Nucleotide, Protein,
Gene, UniGene, HomoloGene, and GEO databases.
Navigating directly to the matches in the Gene database by clicking on the link in Global Query generates
a display of summaries (data not shown) for six genes.
Clicking on the link to the first of these Entrez Gene
records, the salt-tolerance (STO) gene of Arabidopsis,
yields the report shown in the main area of Figure 2.
The report begins with the Entrez Gene GeneID and
name of the gene, shown in section A. Following this is
a graphical representation of the structure of the gene,
section B, showing two splice variants, a two-exon
form, and a three-exon form. In section C, the genomic
context of STO is shown, followed, in section D, by its
type, aliases, taxonomic lineage, and links to the
literature in PubMed. Gene Ontology (Gene Ontology
Consortium, 2004) terms are given in section E. Gene
Ontology terms are taken directly from the Gene
Ontology database without modification. Section F,
titled ‘‘NCBI Reference Sequences,’’ gives links to
transcript and protein reference sequences as well as
links to precomputed analyses showing the existence
of a Z-box zinc finger domain from the CDD in the
predicted protein. Many links to other NCBI databases
such as UniGene and HomoloGene, to other NCBI
resources such as Map Viewer, and to other organismspecific sites such as The Arabidopsis Information
Resource (TAIR; Garcia-Hernandez et al., 2002) and
Munich Information Center for Protein Sequence
(MIPS) Arabidopsis Database (Schoof et al., 2002) are
available from the Links pulldown menu shown in
section G.
Plant Physiol. Vol. 138, 2005
1283
Downloaded from on August 3, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Wheeler et al.
Figure 2. Entrez Gene report for the STO gene showing many links to related databases and resources. The Entrez Gene report for
STO shows the GeneID and name of the gene (A), the structure of the gene (B), its genomic context (C), gene-type, aliases,
taxonomic lineage, and links to the literature. Gene Ontology (Gene Ontology Consortium, 2004) terms are listed in section E,
while links to RefSeqs and NCBI Conserved Domains detected within gene products are given in section F. Many links to other
NCBI databases such as UniGene and HomoloGene as well as to other organisms-specific sites such as TAIR (Garcia-Hernandez
et al., 2002) and MIPS (Schoof et al., 2002) are available from the Links pulldown menu.
Returning to the Global Query page, it may be of
interest to see if additional genes related to salt tolerance can be found using a different route. For example, clicking on the Global Query link to the Entrez
Protein database yields a list of the first 20 matching
proteins. Among these entries is a SwissProt-derived
record for Arabidopsis SAL1 phosphatase, accession
number Q42546. Clicking on the accession number to
see the GenBank flat file view of this entry reveals
that the first literature reference (Quintero et al., 1996)
within the record is titled ‘‘The SAL1 gene of Arabidopsis, encoding an enzyme with 3#(2#),5#bisphosphate nucleotidase and inositol polyphosphate
1-phosphatase activities, increases salt tolerance in
yeast.’’ Navigating from the protein record to Entrez
Gene using the a Links pulldown menu similar to that
shown in Figure 2G returns a report for a gene not
identified in Entrez Gene by a direct text query because
‘‘salt tolerance’’ is not among its annotations. The
report for this gene lists Gene Ontology terms suggesting an involvement with stress responses other than
salt concentration, such as cold stress, so it is possible
that the gene has a role in a general stress-response
pathway.
USAGE SCENARIO: BEGINNING
WITH A SEQUENCE-BASED SEARCH
USING PLANTBLAST
The suite of BLAST sequence similarity programs
offers an alternative to a text query as a method of
entry into the Entrez system. Using BLAST, a novel
sequence can be quickly linked to known database
sequences to which it is similar. Links from these
1284
Plant Physiol. Vol. 138, 2005
Downloaded from on August 3, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Plant Genome Resources
sequence records to others in Entrez allow fluid
navigation between NCBI resources. As an example,
consider an unknown European white birch (Betula
pendula) EST, GenBank accession number AJ606366. A
simple Entrez query reveals that this EST does not
belong to any UniGene cluster and, therefore, no geneoriented information is available. In addition, since
European white birch ESTs are not among those
routinely aligned to the O. sativa or Arabidopsis
genomic sequence by NCBI, it will not be possible to
simply look this one up in Map Viewer. However,
using Plant Genome BLAST, it may still be possible to
place this EST on the Arabidopsis genome.
The PlantBLAST Page (http://www.ncbi.nlm.nih.
gov/BLAST/Genome/PlantBlast.shtml) is reached
via the Plants link on the BLAST homepage. The
query form allows the selection of the species of plant
to BLAST against, the database to search, and the
BLAST algorithm to use. For example, one may select
Arabidopsis, one of 10 available plant species, as the
organism, the set of protein sequences derived from its
genomic annotation as the database, and the BLASTX
program from the pulldown menus, using the form
shown in Figure 3A. Alternatively, mapped sequences
from all available plants may be selected as the target
of the search. The BLASTX program allows one to
compare the six-frame protein translations of a nucleotide query to the sequences in a protein database and
is a very sensitive way to perform a cross-species
search. As a query, either the EST sequence or its
GenBank accession number, AJ606366, may be used.
Such a search returns significant matches to several
members of the chitinase protein family, among them
NCBI RefSeq accession number NP_172076. Clicking
Figure 3. Pathway beginning with a PlantBLAST search of an EST sequence. Armed with an EST sequence, one may navigate to
a Map Viewer display for a sequenced-anchored organism by performing a translated PlantBLAST search against mapped
proteins from Arabidopsis (A). The BLAST program chosen is BLASTX in which the six-frame translations of the EST sequence will
be compared to sequences from the selected protein database. Matches to Arabidopsis proteins mapping to the genome are seen
in the overview PlantBLAST graphic (B); clicking on the ‘‘NP_172076’’ link to the first hit in the table generates a Map Viewer
display (C) of the Arabidopsis Genes_seq track showing the position of the hit. The ‘‘At1g05850’’ link immediately to the right of
the Genes_seq track in the Map Viewer display leads to Entrez Gene and is followed by links to other NCBI and external
organism-specific resources.
Plant Physiol. Vol. 138, 2005
1285
Downloaded from on August 3, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Wheeler et al.
on the Genome View button in the BLAST results and
adding the gene map as the rightmost map (data not
shown) invokes a graphical Arabidopsis Map Viewer
overview revealing many hits across the Arabidopsis
genome (Fig. 3B). The ‘‘NP_172076’’ link in the overview leads to the detailed display the alignment of
NP_172076 to the Arabidopsis genome of Figure 3C
with links; ‘‘At1g05850’’ leads to an Entrez Gene report
for the corresponding gene, and ‘‘TAIR’’ leads to The
Arabidopsis Information Resource. The Entrez Gene
report contains links to a publication (Zhong et al.,
2002) in PubMed titled ‘‘Mutation of a chitinase-like
gene causes ectopic deposition of lignin, aberrant cell
shapes, and overproduction of ethylene.’’ The Gene
report also contains a link to the BLink view (Fig. 4A)
of alignments between the sequence of the At1g05850
gene product and similar protein sequences. Using the
BLink report CDD-Search button, the CDD view of
Figure 4B can be reached. The BLink report, a portion
of which is shown in Figure 4A, presents 200 alignments to other plant proteins including proteins from
O. sativa. Returning to the Map Viewer and following
the link to the TAIR Web page for the gene (data not
shown) also shows that this gene has two known
mutant alleles leading to ectopic deposition of lignin in
pith.
Consider as a final example an EST with GenBank
accession number AU251269 from Italian rye grass
(Lolium multiflorum). A BLASTX PlantBLAST search
against O. sativa proteins, performed as in the preceding case, reveals a putative protein homolog in
O. sativa with RefSeq accession number NP_909676.
Navigating to the BLink report for NP_909676 shows
conservation across all taxonomy branches (Fig. 5A)
and includes an alignment to human protein
NP_003932, Wiskott-Aldrich syndrome gene-like protein (Fig. 5B). The Entrez Gene report for this curated
RefSeq contains links to a large collection of articles in
PubMed and a number of Gene Ontology terms
pertaining to actin activity (Fig. 5, C and D) as well
as an impressive array of links to other resources
(Fig. 5E).
Figure 4. BLink and CDD reports for the protein product of At1g05850. The BLink report shown in A presents precomputed
alignments between the sequence of the At1g05850 gene product and similar sequences in the Entrez Protein database. The
CDD-Search button in the BLink report leads to a conserved domain view (B) of the At1g05850 gene product.
1286
Plant Physiol. Vol. 138, 2005
Downloaded from on August 3, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Plant Genome Resources
Figure 5. BLink report for an O. sativa protein and the Entrez Gene report of a similar human protein. O. sativa protein
NP_909676 aligns well to proteins from an array of organisms, as shown in the BLink report header (A). Among these proteins is
a human protein, NP_003932, associated with Wiskott-Aldrich syndrome and found in position (B). The Entrez Gene report for
the human protein provides information that may be helpful in understanding the O. sativa protein such as literature references in
PubMed and Gene Ontology terms relating to actin (C and D). Links to many other NCBI and gene-related resources are given in
the Links pulldown menu (E).
URLS FOR NCBI RESOURCES AND
DOWNLOAD SITES
Resources
NCBI Homepage: http://www.ncbi.nlm.nih.gov
The Reference Sequence Database: http://www.
ncbi.nlm.nih.gov/RefSeq/
Entrez Global Query: http://www.ncbi.nlm.nih.
gov/gquery/gquery.fcgi
Map
Viewer:
http://www.ncbi.nlm.nih.gov/
mapview/
Plant Genomes Central: http://www.ncbi.nlm.nih.
gov/genomes/PLANTS/PlantList.html
The Plant Genomes Query page: http://www.
ncbi.nlm.nih.gov/mapview/map_search.cgi?chr5
plants.inf
PlantBLAST:
http://www.ncbi.nlm.nih.gov/
BLAST/Genome/PlantBlast.shtml
Plant Genome sequencing projects database:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
CMD5search&DB5genomeprj
Downloads
Genome records can be downloaded from the following FTP directories: ftp://ftp.ncbi.nih.gov/
genbank/;
ftp://ftp.ncbi.nih.gov/genomes/;
ftp://ftp.ncbi.nih.gov/refseq/
Arabidopsis data in several file formats for each
chromosome available at: ftp://ftp.ncbi.nih.
gov/genomes/Arabidopsis_thaliana
GenBank sequence identifiers (NCBI gis) used to
create the databases available in PlantBLAST:
ftp://ftp.ncbi.nih.gov/genomes/PLANTS/
BLASTDB/
Plant Physiol. Vol. 138, 2005
1287
Downloaded from on August 3, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.
Wheeler et al.
For details and the statistics for the NCBI rice
genome assembly, see: http://www.ncbi.nlm.nih.
gov/genomes/PLANTS/rice_0.html
FASTA-format file of BAC-end GenBank records
organized by organism: ftp://ftp.ncbi.nih.gov/
genomes/BACENDS/
ACKNOWLEDGMENTS
The authors would like to thank Pavel Bolotov, Andrei Kochergin, Igor
Tolstoy, and Boris Kiryutin for their expertise and diligence in the maintenance of many of the databases highlighted in this article.
Received December 22, 2004; revised April 7, 2005; accepted April 22, 2005;
published July 11, 2005.
LITERATURE CITED
Ahn SN, Tanksley SD (1993) Comparative linkage maps of the rice and
maize genomes. Proc Natl Acad Sci USA 90: 7980–7984
Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev
D, Lash AE, Fujibuchi W, Edgar R (2004) NCBI GEO: mining millions of
expression profiles: database and tools. Nucleic Acids Res 33: D562–D566
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005)
GenBank: update. Nucleic Acids Res 33: D34–D38
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A,
Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I, et al (2003)
The SWISS-PROT protein knowledgebase and its supplement TrEMBL
in 2003. Nucleic Acids Res 31: 365–370
Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino
W, Zhang Q, Knezevich C, Xie L, Chen L, Feng Z, et al (2004) The RCSB
Protein Data Bank: a redesigned query system and relational database
based on the mmCIF schema. Nucleic Acids Res 33: D233–D237
Devos KM, Chao S, Li QY, Simonetti MC, Gale MD (1994) Relationship
between chromosome 9 of maize and wheat homeologous group 7
chromosomes. Genetics 138: 1287–1292
Garcia-Hernandez M, Berardini TZ, Chen G, Crist D, Doyle A, Huala E,
Knee E, Lambrecht M, Miller N, Mueller LA, et al (2002) TAIR:
a resource for integrated Arabidopsis data. Funct Integr Genomics 2: 239
Gene Ontology Consortium (2004) The Gene Ontology (GO) database and
informatics resource. Nucleic Acids Res 32: D258–D261
Kanz C, Aldebert P, Althorpe N, Baker W, Baldwin A, Bates K, Browne P,
van den Broek A, Castro M, Cochrane G, et al (2004) The EMBL
nucleotide sequence database. Nucleic Acids Res 33: D29–D33
Kurata N, Moore G, Nagamura Y, Foote T, Yano M, Minobe Y, Gale MD
(1994) Conservation of genome structure between rice and wheat.
Biotechnology (N Y) 12: 276–278
Lawrence CJ, Dong Q, Polacco ML, Seigfried TE, Brendel V (2004)
MaizeGDB, the community database for maize genetics and genomics.
Nucleic Acids Res 32: D393–D397
Lederberg EM (1986) Plasmid prefix designations registered by the
Plasmid Reference Center 1977-1985. Plasmid 1: 57–92
Maglott D, Ostell J, Pruitt KD, Tatusova T (2004) Entrez Gene: genecentered information at NCBI. Nucleic Acids Res 33: D54–D58
Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer
LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, et al (2004) CDD:
a Conserved Domain Database for protein classification. Nucleic Acids
Res 33: D192–D196
Pruitt KD, Tatusova T, Maglott DR (2005) NCBI Reference Sequence
(RefSeq): a curated non-redundant sequence database of genomes,
transcripts and proteins. Nucleic Acids Res 33: D501–D504
Quintero FJ, Garciadeblas B, Rodriguez-Navarro A (1996) The SAL1 gene
of Arabidopsis, encoding an enzyme with 3#(2#),5#-bisphosphate nucleotidase and inositol polyphosphate 1-phosphatase activities, increases salt tolerance in yeast. Plant Cell 8: 529–537
Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, Katayose Y, Wu J,
Niimura Y, Cheng Z, Nagamura Y, et al (2002) The genome sequence
and structure of rice chromosome 1. Nature 420: 312–316
Schoof H, Zaccaria P, Gundlach H, Lemcke K, Rudd S, Kolesov G, Arnold
R, Mewes HW, Mayer KF (2002) MIPS Arabidopsis thaliana Database
(MAtDB): an integrated biological knowledge resource based on the first
complete plant genome. Nucleic Acids Res 30: 91–93
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM,
Sirotkin K (2001) dbSNP: the NCBI database of genetic variation.
Nucleic Acids Res 29: 308–311
Tateno Y, Saitou N, Okubo K, Sugawara H, Gojobori T (2004) DDBJ in
collaboration with mass-sequencing teams on annotation. Nucleic Acids
Res 33: D25–D28
The Arabidopsis Genome Initiative (2000) Analysis of the genome
sequence of the flowering plant Arabidopsis thaliana. Nature 408:
796–815
Van Deynze AE, Nelson JC, O’Donoughue LS, Ahn SN, Siripoonwiwat
W, Harrington SE, Yglesias ES, Braga DP, McCouch SR, Sorrells ME
(1995) Comparative mapping in grasses: oat relationships. Mol Gen
Genet 249: 349–356
Ware D, Jaiswal P, Ni J, Pan X, Chang K, Clark K, Teytelman L, Schmidt S,
Zhao W, Cartinhour S, et al (2002) Gramene: a resource for comparative
grass genomics. Nucleic Acids Res 30: 103–105
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM,
DiCuccio M, Edgar R, Federhen S, Helmberg W, et al (2005) Database
resources of the National Center for Biotechnology Information. Nucleic
Acids Res 33: D39–D45
Wu CH, Yeh LSL, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z,
Kourtesis P, Ledley RS, Suzek BE, et al (2003) The Protein Information
Resource. Nucleic Acids Res 31: 345–347
Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X,
et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp.
indica). Science 296: 79–92
Zhong R, Kays SJ, Schroeder BP, Ye ZH (2002) Mutation of a chitinase-like
gene causes ectopic deposition of lignin, aberrant cell shapes, and
overproduction of ethylene. Plant Cell 14: 165–179
1288
Plant Physiol. Vol. 138, 2005
Downloaded from on August 3, 2017 - Published by www.plantphysiol.org
Copyright © 2005 American Society of Plant Biologists. All rights reserved.