Download Analysis of the genome sequence of the ¯owering plant

Document related concepts

Arabidopsis thaliana wikipedia , lookup

Transcript
articles
Analysis of the genome sequence of the
¯owering plant Arabidopsis thaliana
The Arabidopsis Genome Initiative*
* Authorship of this paper should be cited as `The Arabidopsis Genome Iniative'. A full list of contributors appears at the end of this paper
............................................................................................................................................................................................................................................................................
The ¯owering plant Arabidopsis thaliana is an important model system for identifying genes and determining their functions.
Here we report the analysis of the genomic sequence of Arabidopsis. The sequenced regions cover 115.4 megabases of the
125-megabase genome and extend into centromeric regions. The evolution of Arabidopsis involved a whole-genome duplication,
followed by subsequent gene loss and extensive local gene duplications, giving rise to a dynamic genome enriched by lateral gene
transfer from a cyanobacterial-like ancestor of the plastid. The genome contains 25,498 genes encoding proteins from 11,000
families, similar to the functional diversity of Drosophila and Caenorhabditis elegansÐ the other sequenced multicellular
eukaryotes. Arabidopsis has many families of new proteins but also lacks several common protein families, indicating that the sets
of common proteins have undergone differential expansion and contraction in the three multicellular eukaryotes. This is the ®rst
complete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processes
in all eukaryotes, identifying a wide range of plant-speci®c gene functions and establishing rapid systematic ways to identify
genes for crop improvement.
The plant and animal kingdoms evolved independently from
unicellular eukaryotes and represent highly contrasting life forms.
The genome sequences of C. elegans1 and Drosophila2 reveal that
metazoans share a great deal of genetic information required for
developmental and physiological processes, but these genome
sequences represent a limited survey of multicellular organisms.
Flowering plants have unique organizational and physiological
properties in addition to ancestral features conserved between
plants and animals. The genome sequence of a plant provides a
means for understanding the genetic basis of differences between
plants and other eukaryotes, and provides the foundation for
detailed functional characterization of plant genes.
Arabidopsis thaliana has many advantages for genome analysis,
including a short generation time, small size, large number of
offspring, and a relatively small nuclear genome. These advantages
promoted the growth of a scienti®c community that has investigated the biological processes of Arabidopsis and has characterized
many genes3. To support these activities, an international collaboration (the Arabidopsis Genome Initiative, AGI) began sequencing
the genome in 1996. The sequences of chromosomes 2 and 4 have
been reported4,5, and the accompanying Letters describe the
sequences of chromosomes 1 (ref. 6), 3 (ref. 7) and 5 (ref. 8).
Here we report analysis of the completed Arabidopsis genome
sequence, including annotation of predicted genes and assignment
of functional categories. We also describe chromosome dynamics
and architecture, the distribution of transposable elements and
other repeats, the extent of lateral gene transfer from organelles,
and the comparison of the genome sequence and structure to that of
other Arabidopsis accessions (distinctive lines maintained by singleseed descent) and plant species. This report is the summation of
work by experts interested in many biological processes selected to
illuminate plant-speci®c functions including defence, photomorphogenesis, gene regulation, development, metabolism, transport
and DNA repair.
The identi®cation of many new members of receptor families,
cellular components for plant-speci®c functions, genes of bacterial origin whose functions are now integrated with typical
eukaryotic components, independent evolution of several families
of transcription factors, and suggestions of as yet uncharacterized
metabolic pathways are a few more highlights of this work. The
implications of these discoveries are not only relevant for plant
796
biologists, but will also affect agricultural science, evolutionary
biology, bioinformatics, combinatorial chemistry, functional and
comparative genomics, and molecular medicine.
Overview of sequencing strategy
We used large-insert bacterial arti®cial chromosome (BAC), phage
(P1) and transformation-competent arti®cial chromosome (TAC)
libraries9±12 as the primary substrates for sequencing. Early stages of
genome sequencing used 79 cosmid clones. Physical maps of the
genome of accession Columbia were assembled by restriction
fragment `®ngerprint' analysis of BAC clones13, by hybridization14
or polymerase chain reaction (PCR)15 of sequence-tagged sites and
by hybridization and Southern blotting16. The resulting maps were
integrated (http://nucleus/cshl.org/arabmaps/) with the genetic
map and provided a foundation for assembling sets of contigs
into sequence-ready tiling paths. End sequence (http://www.
tigr.org/tdb/at/abe/bac_end_search.html) of 47,788 BAC clones
was used to extend contigs from BACS anchored by marker content
and to integrate contigs.
Ten contigs representing the chromosome arms and centromeric
heterochromatin were assembled from 1,569 BAC, TAC, cosmid and
P1 clones (average insert size 100 kilobases (kb)). Twenty-two PCR
products were ampli®ed directly from genomic DNA and
sequenced to link regions not covered by cloned DNA or to optimize
the minimal tiling path. Telomere sequence was obtained from
speci®c yeast arti®cial chromosome (YAC) and phage clones, and
from inverse polymerase chain reaction (IPCR) products derived
from genomic DNA. Clone ®ngerprints, together with BAC end
sequences, were generally adequate for selection of clones for
sequencing over most of the genome. In the centromeric regions,
these physical mapping methods were supplemented with genetic
mapping to identify contig positions and orientation17.
Selected clones were sequenced on both strands and assembled
using standard techniques. Comparison of independently derived
sequence of overlapping regions and independent reassembly
sequenced clones revealed accuracy rates between 99.99 and
99.999%. Over half of the sequence differences were between
genomic and BAC clone sequence. All available sequenced genetic
markers were integrated into sequence assemblies to verify sequence
contigs4±8. The total length of sequenced regions, which extend from
either the telomeres or ribosomal DNA repeats to the 180-base-pair
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
(bp) centromeric repeats, is 115,409,949 bp (Table 1). Estimates of
the unsequenced centromeric and rDNA repeat regions measure
roughly 10 megabases (Mb), yielding a genome size of about
125 Mb, in the range of the 50±150 Mb haploid content estimated
by different methods18. In general, features such as gene density,
expression levels and repeat distribution are very consistent across
the ®ve chromosomes (Fig. 1), and these are described in detail in
reports on individual chromosomes4±8 and in the analysis of
centromere, telomere and rDNA sequences.
We used tRNAscan-SE 1.21 (ref. 19) and manual inspection to
identify 589 cytoplasmic transfer RNAs, 27 organelle-derived
tRNAs and 13 pseudogenesÐmore than in any other genome
sequenced to date. All 46 tRNA families needed to decode all
possible 61 codons were found, de®ning the completeness of the
functional set. Several highly ampli®ed families of tRNAs were
found on the same strand6; excluding these, each amino acid is
decoded by 10±41 tRNAs.
The spliceosomal RNAs (U1, U2, U4, U5, U6) have all been
experimentally identi®ed in Arabidopsis. The previously identi®ed
sequences for all RNAs were found in the genome, except for U5
where the most similar counterpart was 92% identical. Between 10
and 16 copies of each small nuclear RNA (snRNA) were found
across all chromosomes, dispersed as singletons or in small groups.
The small nucleolar RNAs (snoRNAs) consist of two subfamilies,
the C/D box snoRNAs, which includes 36 Arabidopsis genes, and the
H/ACA box snoRNAs, for which no members have been identi®ed
in Arabidopsis. U3 is the most numerous of the C/D box snoRNAs,
with eight copies found in the genome. We identi®ed forty-®ve
additional C/D box snoRNAs using software (www.rna.wustl.edu/
snoRNAdb/) that detects snoRNAs that guide ribose methylation of
ribosomal RNA.
A combination of algorithms, all optimized with parameters
based on known Arabidopsis gene structures, was used to de®ne
gene structure. We used similarities to known protein and expressed
sequence tag (EST) sequence to re®ne gene models. Eighty per cent
of the gene structures predicted by the three centres involved were
completely consistent, 93% of ESTs matched gene models, and less
than 1% of ESTs matched predicted non-coding regions, indicating
100 kb
Chr. 1 29.1 Mb
Genes
ESTs
TEs
MT/CP
RNAs
Chr. 2 19.6 Mb
Genes
ESTs
TEs
MT/CP
RNAs
Chr. 3 23.2 Mb
Genes
ESTs
TEs
MT/CP
RNAs
Chr. 4 17.5 Mb
Genes
ESTs
TEs
MT/CP
RNAs
Chr. 5 26.0 Mb
Genes
ESTs
TEs
MT/CP
RNAs
Pseudo-colour spectra:
High density
Low density
Figure 1 Representation of the Arabidopsis chromosomes. Each chromosome is
represented as a coloured bar. Sequenced portions are red, telomeric and centromeric
regions are light blue, heterochromatic knobs are shown black and the rDNA repeat
regions are magenta. The unsequenced telomeres 2N and 4N are depicted with dashed
lines. Telomeres are not drawn to scale. Images of DAPI-stained chromosomes were
kindly supplied by P. Fransz. The frequency of features was given pseudo-colour
assignments, from red (high density) to deep blue (low density). Gene density (`Genes')
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
ranged from 38 per 100 kb to 1 gene per 100 kb; expressed sequence tag matches
(`ESTs') ranged from more than 200 per 100 kb to 1 per 100 kb. Transposable element
densities (`TEs') ranged from 33 per 100 kb to 1 per 100 kb. Mitochondrial and
chloroplast insertions (`MT/CP') were assigned black and green tick marks, respectively.
Transfer RNAs and small nucleolar RNAs (`RNAs') were assigned black and red ticks
marks, respectively.
© 2000 Macmillan Magazines Ltd
797
articles
that most potential genes were identi®ed. The sensitivity and
selectivity of the gene prediction software used in this report has
been comprehensively and independently assessed20.
The 25,498 genes predicted (Table 1) is the largest gene set
published to date: C. elegans1 has 19,099 genes and Drosophila2
13,601 genes. Arabidopsis and C. elegans have similar gene density,
whereas Drosophila has a lower gene density; Arabidopsis also has a
signi®cantly greater extent of tandem gene duplications and
segmental duplications, which may account for its larger gene set.
The rDNA repeat regions on chromosomes 2 and 4 were not
sequenced because of their known repetitive structure and content.
The centromeric regions are not completely sequenced owing to
large blocks of monotonic repeats such as 5S rDNA and 180-bp
repeats. The sequence continues to be extended further into
centromeric and other regions of complex sequence.
Characterization of the coding regions
To assess the similarities and differences of the Arabidopsis gene
complement compared with other sequenced eukaryotic genomes,
we assigned functional categories to the complete set of Arabidopsis
genes. For chromosome 4 genes and the yeast genome, predicted
functions were previously manually assigned5,21. All other predicted
proteins were automatically assigned to these functional
categories22, assuming that conserved sequences re¯ect common
functional relationships.
The functions of 69% of the genes were classi®ed according to
sequence similarity to proteins of known function in all organisms;
only 9% of the genes have been characterized experimentally
(Fig. 2a). Generally similar proportions of gene products were
predicted to be targeted to the secretory pathway and mitochondria
in Arabidopsis and yeast, and up to 14% of the gene products are
Table 1 Summary statistics of the Arabidopsis genome
Feature
Value
...................................................................................................................................................................................................................................................................................................................................................................
(a) The DNA molecules
Length (bp)
Top arm (bp)
Bottom arm (bp)
Base composition (%GC)
Overall
Coding
Non-coding
Number of genes
Gene density
(kb per gene)
Average gene
length (bp)
Average peptide
length (bp)
Exons
Number
Total length (bp)
Average per gene
Average size (bp)
Introns
Number
Total length (bp)
Average size (bp)
Number of genes
with ESTs (%)
Number of ESTs
Chr. 1
Chr. 2
Chr. 3
Chr. 4
Chr. 5
S
29,105,111
14,449,213
14,655,898
19,646,945
3,607,091
16,039,854
23,172,617
13,590,268
9,582,349
17,549,867
3,052,108
14,497,759
25,953,409
11,132,192
14,803,217
115,409,949
33.4
44.0
32.4
6,543
4.0
35.5
44.0
32.9
4,036
4.9
35.4
44.3
33.0
5,220
4.5
35.5
44.1
32.8
3,825
4.6
34.5
44.1
32.5
5,874
4.4
2,078
1,949
1,925
2,138
1,974
446
421
424
448
429
35,482
8,772,559
5.4
247
19,631
5,100,288
4.9
259
26,570
6,654,507
5.1
250
20,073
5,150,883
5.2
256
31,226
7,571,013
5.3
242
13,2982
33,249,250
28,939
4,828,766
168
60.8
15,595
2,768,430
177
56.9
21,350
3,397,531
159
59.8
16,248
3,030,649
186
61.4
25,352
4,030,045
159
61.4
107,484
18,055,421
30,522
14,989
20,732
16,605
22,885
105,733
6,543
4,194
64.1%
2,334
35.7%
2,513
38.4%
4,036
1,205
29.9%
1,322
32.8%
1,424
35.3%
5,220
2,989
57.8%
1,615
30.9%
1,664
31.9%
3,825
1,545
40.4%
1,402
36.7%
1,304
34.1%
5,874
3,136
53.4%
1,940
33.0%
2,121
36.1%
25,498
13,069
51.3%
8,613
33.8%
9,026
35.4%
25,498
...................................................................................................................................................................................................................................................................................................................................................................
(b) The proteome
Classi®cation/function
Total proteins
With INTERPRO
domains
Genes containing at
least one TM domain
Genes containing at
least one SCOP domain
With putative signal peptides
Secretory pathway
.0.95 speci®city
Chloroplast
.0.95 speci®city
mitochondria
.0.95 speci®city
1,242
1,146
866
602
901
113
19.0%
17.5%
13.2%
9.2%
13.8%
1.7%
675
632
535
290
425
49
16.7%
15.7%
13.2%
7.2%
10.5%
1.2%
877
813
754
420
554
63
17.0%
15.7%
14.6%
8.1%
10.7%
1.2%
659
632
532
298
390
59
17.2%
16.5%
13.9%
7.8%
10.2%
1.5%
1,014
964
887
475
627
65
17.3%
16.4%
15.1%
8.1%
10.7%
1.1%
4,467
4,167
3,574
2,085
2,897
349
17.6%
16.4%
14.0%
8.2%
11.4%
1.4%
Functional classi®cation
Cellular metabolism
Transcription
Plant defence
Signalling
Growth
Protein fate
Intracellular transport
Transport
Protein synthesis
1,188
880
640
573
542
520
435
236
216
22.7%
16.8%
12.2%
11.0%
10.4%
9.9%
8.3%
4.5%
4.1%
620
474
276
296
263
273
214
139
111
23.3%
17.8%
10.4%
11.1%
9.9%
10.2%
8.9%
5.2%
4.2%
745
566
354
356
357
314
269
155
148
22.8%
17.3%
10.8%
10.9%
10.9%
9.6%
8.2%
4.7%
4.5%
588
335
295
210
448
264
220
113
90
22.9%
13.1%
11.5%
8.2%
17.5%
10.3%
8.6%
4.4%
3.5%
868
763
490
420
469
395
334
206
165
21.1%
18.6%
11.9%
10.2%
11.4%
9.6%
8.1%
5.0%
4.0%
4,009
3,018
2,055
1,855
2,079
1,766
1,472
849
730
22.5%
16.9%
11.5%
10.4%
11.7%
9.9%
8.3%
4.8%
4.1%
Total
5,230
2,666
3,264
2,563
4,110
17,833
...................................................................................................................................................................................................................................................................................................................................................................
The features of Arabidopsis chromosomes 1±5 and the complete nuclear genome are listed. Specialized searches used the following programs and databases: INTERPRO23; transmembrane (TM) domains
by ALOM2 (unpublished); SCOP domain database121; functional classi®cation by the PEDANT analysis system22. Signal peptide prediction (secretory pathway, targeted to chloroplast or mitochondria) was
performed using TargetP122 and http://www.cbs.dtu.dk/services/TargetP/.
* Default value.
798
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
likely to be targeted to the chloroplast (Table 1). The signi®cant
proportion of genes with predicted functions involved in metabolism, gene regulation and defence is consistent with previous
analyses5. Roughly 30% of the 25,498 predicted gene products,
(Fig. 2a), comprising both plant-speci®c proteins and proteins with
similarity to genes of unknown function from other organisms,
could not be assigned to functional categories.
To compare the functional catagories in more detail, we compared data from the complete genomes of Escherichia coli23,
Synechocystis sp.24, Saccharomyces cerevisiae21, C. elegans1 and
Drosophila2, and a non-redundant protein set of Homo sapiens,
with the Arabidopsis genome data (Fig. 2b), using a stringent
BLASTP threshold value of E , 10-30. The proportion of
Arabidopsis proteins having related counterparts in eukaryotic
genomes varies by a factor of 2 to 3 depending on the functional
category. Only 8±23% of Arabidopsis proteins involved in transcription have related genes in other eukaryotic genomes, re¯ecting the
independent evolution of many plant transcription factors. In
contrast, 48±60% of genes involved in protein synthesis have
counterparts in the other eukaryotic genomes, re¯ecting highly
conserved gene functions. The relatively high proportion of
matches between Arabidopsis and bacterial proteins in the categories
`metabolism' and `energy' re¯ects both the acquisition of bacterial
genes from the ancestor of the plastid and high conservation of
sequences across all species. Finally, a comparison between unicellular and multicellular eukaryotes indicates that Arabidopsis
genes involved in cellular communication and signal transduction
have more counterparts in multicellular eukaryotes than in yeast,
re¯ecting the need for sets of genes for communication in multicellular organisms.
Pronounced redundancy in the Arabidopsis genome is evident in
segmental duplications and tandem arrays, and many other genes
with high levels of sequence conservation are also scattered over the
genome. Sequence similarity exceeding a BLASTP value E , 10-20
and extending over at least 80% of the protein length were used as
parameters to identify protein families (Table 2). A total of 11,601
protein types were identi®ed. Thirty-®ve per cent of the predicted
proteins are unique in the genome, and the proportion of proteins
belonging to families of more than ®ve members is substantially
higher in Arabidopsis (37.4%) than in Drosophila (12.1%) or
Cell growth, cell division
and DNA synthesis
a
Metabolism
Transcription
Cell rescue, defence,
cell death, ageing
Cellular
communciation/
signal transduction
Protein destination
Intracellular transport
Unclassified
Cellular biogenesis
Transport facilitation
Energy
Protein synthesis
Ionic homeostasis
b
E. coli
Synechocystis
S. cerevisiae
C. elegans
Drosophila
Human
0.7
0.6
0.5
0.4
0.3
0.2
0.1
C
el
lg
M
et
ab
ol
ro
is
m
an wth
E
d ,c
n
DN e
er
A ll di gy
sy vis
Tr nth ion
an
e
Pr
sc sis
ot
rip
ei
ti
n
sy on
Pr
ot
nt
he
ei
n
si
Tr
de
s
an
st
sp
i
n
In
or
a
t
tra
t
io
n
ce fac
ilit
llu
a
la
t
i
r
C Ce
tra on
el
ns
lu llul
la
a
po
r
r
rt
si com bio
gn
ge
C al mu ne
el tr n
l r an ic sis
ceesc sd atio
C
ll ue uc n/
la
d ,
t
ss
Io eat de ion
ifi
ca nic h, fen
tio
ho ag ce
e ,
n
no me ing
t y ost
a
et
cl sis
ea
r-c
U
nc
ut
la
ss
ifi
ed
0
Figure 2 Functional analysis of Arabidopsis genes. a, Proportion of predicted Arabidopsis
genes in different functional categories. b, Comparison of functional categories between
organisms. Subsets of the Arabidopsis proteome containing all proteins that fall into a
common functional class were assembled. Each subset was searched against the
complete set of translations from Escherichia coli, Synechocystis sp. PCC6803,
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
Saccharomyces cerevisae, Drosophila, C. elegans and a Homo sapiens non-redundant
protein database. The percentage of Arabidopsis proteins in a particular subset that had a
BLASTP match with E # 10-30 to the respective reference genome is shown. This re¯ects
the measure of sequence conservation of proteins within this particular functional
category between Arabidopsis and the respective reference genome. y axis, 0.1 = 10%.
© 2000 Macmillan Magazines Ltd
799
articles
Table 2 Proportion of genes in different organisms present as either singletons or in paralogous families
No of singletons and
distinct gene families
Unique
Gene families containing
2 members
3 members
4 members
5 members
.5 members
6.8%
13.8%
8.5%
12.0%
12.5%
2.3%
3.5%
3.4%
4.5%
7.0%
0.7%
2.2%
1.9%
2.7%
4.4%
0.0%
0.7%
1.6%
1.6%
3.6%
1.4%
8.4%
12.1%
24.0%
37.4%
...................................................................................................................................................................................................................................................................................................................................................................
H. in¯uenzae
S. cerevisiae
D. melanogaster
C. elegans
Arabidopsis
1,587
5,105
10,736
14,177
11,601
88.8%
71.4%
72.5%
55.2%
35.0%
...................................................................................................................................................................................................................................................................................................................................................................
The number of genes in the genomes of Haemophilus in¯uenzae, S. cerevisiae, Drosophila, C. elegans and Arabidopsis that are present either as singletons or in gene families with two or more members are
listed. To be grouped in a gene family, two genes had to show similarity exceeding a BLASTP value E , 10-20 and a FASTA alignment over at least 80% of the protein length. In column 1, the number of genes
that are unique plus the number of gene families are listed. Columns 2 to 6 give the percentage of genes present as singletons or in gene families of n members.
C. elegans (24.0%). The absolute number of Arabidopsis gene
families and singletons (types) is in the same range as the other
multicellular eukaryotes, indicating that a proteome of 11,000±
15,000 types is suf®cient for a wide diversity of multicellular life.
The proportion of gene families with more than two members is
considerably more pronounced in Arabidopsis than in other eukaryotes (Fig. 3). As segmental duplication is responsible for 6,303 gene
duplications (see below), the extent of tandem gene duplications
accounts for a signi®cant proportion of the increased family size.
These features of the Arabidopsis, and presumably other plant
genomes, may indicate more relaxed constraints on genome size
in plants, or a more prominent role of unequal crossing over to
generate new gene copies.
Conserved protein domains revealed more informative differences through INTERPRO25 analysis of the predicted gene products
from Arabidopsis, S. cerevisiae, C. elegans and Drosophila. Statistically over-represented domains, and those that are absent from the
Arabidopsis genome, indicate domains that may have been gained or
lost during the evolution of plants (Supplementary Information
Table 1). Proteins containing the Pro-Pro-Arg repeat, which is
involved in RNA stabilization and RNA processing, are overrepresented as compared to yeast, ¯y and worm; 400 proteins
containing this signature were detected in Arabidopsis compared
with only 10 in total in yeast, Drosophila and C. elegans. Protein
kinases and associated domains, 169 proteins containing a disease
resistance protein signature, and the Toll/IL-1R (TIR) domain, a
component of pathogen recognition molecules26, are also relatively
abundant. This suggests that pathways transducing signals in
response to pathogens and diverse environmental cues are more
abundant in plants than in other organisms.
The RING zinc ®nger domain is relatively over-represented in
Arabidopsis compared with yeast, Drosophila and C. elegans, whereas
the F-box domain is over-represented as compared with yeast and
Drosophila only. These domains are involved in targeting proteins to
the proteasome27 and ubiquitinylation28 pathways of protein degradation, respectively. In plants many processes such as hormone and
defence responses, light signalling, and circadian rhythms and
pattern formation use F-box function to direct negative regulators
Number of arrays
1,200
1,000
1052
800
600
400
249
200
0
108
2
3
4
57
36
20
18
17
15
5
6
7
8
9
10 11–15 16–20 21–23
6
2
2
Number of tandemly repeated genes per gene array
Figure 3 Distribution of tandemly repeated gene arrays in the Arabidopsis genome.
Tandemly repeated gene arrays were identi®ed using the BLASTP program with a
threshold of E , 10-20. One unrelated gene among cluster members was tolerated. The
histogram gives the number of clusters in the genome containing 2 to n similar gene units
in tandem.
800
to the ubiquitin degradation pathway. This mode of regulation
appears to be more prevalent in plants and may account for a higher
representation of the F box than in Drosophila and for the overrepresentation of the ubiquitin domain in the Arabidopsis genome.
RING ®nger domain proteins in general have a role in ubiquitin
protein ligases, indicating that proteasome-mediated degradation is
a more widespread mode of regulation in plants than in other
kingdoms.
Most functions identi®ed by protein domains are conserved in
similar proportions in the Arabidopsis, S. cerevisiae, Drosophila and
C. elegans genomes, pointing to many ubiquitous eukaryotic pathways. These are illustrated by comparing the list of human disease
genes29 to the complete Arabidopsis gene set using BLASTP. Out
of 289 human disease genes, 139 (48%) had hits in Arabidopsis
using a BLASTP threshold E , 10-10. Sixty-nine (24%) exceeded an
E , 10-40 threshold, and 26 (9.3%) had scores better than E , 10-100
(Table 3). There are at least 17 human disease genes more similar to
Arabidopsis genes than yeast, Drosophila or C. elegans genes
(Table 3).
This analysis shows that, although numerous families of proteins
are shared between all eukaryotes, plants contain roughly 150
unique protein families. These include transcription factors, structural proteins, enzymes and proteins of unknown function. Members of the families of genes common to all eukaryotes have
undergone substantial increases or decreases in their size in
Arabidopsis. Finally, the transfer of a relatively small number of
cyanobacteria-related genes from a putative endosymbiotic ancestor of the plastid has added to the diversity of protein structures
found in plants.
Genome organization and duplication
The Arabidopsis genome sequence provides a complete view of
chromosomal organization and clues to its evolutionary history.
Gene families organized in tandem arrays of two or more units have
been described in C. elegans1 and Drosophila2. Analysis of the
Arabidopsis genome revealed 1,528 tandem arrays containing
4,140 individual genes, with arrays ranging up to 23 adjacent
members (Fig. 3). Thus 17% of all genes of Arabidopsis are arranged
in tandem arrays.
Large segmental duplications were identi®ed either by directly
aligning chromosomal sequences or by aligning proteins and
searching for tracts of conserved gene order. All ®ve chromosomes
were aligned to each other in both orientations using MUMmer30,
and the results were ®ltered to identify all segments at least 1,000 bp
in length with at least 50% identity (Supplementary Information
Fig. 1). These revealed 24 large duplicated segments of 100 kb or
larger, comprising 65.6 Mb or 58% of the genome. The only
duplicated segment in the centromeric regions was a 375-kb
segment on chromosome 4. Many duplications appear to have
undergone further shuf¯ing, such as local inversions after the
duplication event.
We used TBLASTX5 to identify collinear clusters of genes residing
in large duplicated chromosomal segments. The duplicated regions
encompass 67.9 Mb, 60% of the genome, slightly more than was
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
found in the DNA-based alignment (Fig. 4), and these data extend
earlier ®ndings4,5,31. The extent of sequence conservation of the
duplicated genes varies greatly, with 6,303 (37%) of the 17,193 genes
in the segments classi®ed as highly conserved (E , 10-30) and a
further 1,705 (10%) showing less signi®cant similarity up to
E , 10-5. The proportion of homologous genes in each duplicated
segment also varies widely, between 20% and 47% for the highly
conserved class of genes. In many cases, the number of copies of a
gene and its counterpart differ (for example, one copy on one
chromosome and multiple copies on the other; see Supplementary
Information Fig. 2); this could be due to either tandem duplication
or gene loss after the segmental duplication.
What does the duplication in the Arabidopsis genome tell us
about the ancestry of the species? Polyploidy occurs widely in plants
and is proposed to be a key factor in plant evolution32. As the
majority of the Arabidopsis genome is represented in duplicated
(but not triplicated) segments, it appears most likely that
Arabidopsis, like maize, had a tetraploid ancestor33. A comparative
sequence analysis of Arabidopsis and tomato estimated that a
duplication occurred ,112 Myr ago to form a tetraploid34. The
degrees of conservation of the duplicated segments might be due to
divergence from an ancestral autotetraploid form, or might re¯ect
differences present in an allotetraploid ancestor. It is also possible,
however, that several independent segmental duplication events
took place instead of tetraploid formation and stabilization.
The diploid genetics of Arabidopsis and the extensive divergence
of the duplicated segments have masked its evolutionary history.
The determination of Arabidopsis gene functions must therefore be
pursued with the potential for functional redundancy taken into
account. The long period of time over which genome stabilization
has occurred has, however, provided ample opportunity for the
divergence of the functions of genes that arose from duplications.
Comparative analysis of Arabidopsis accessions
Comparing the multiple accessions of Arabidopsis allows us to
identify commonly occurring changes in genome microstructure.
It also enables the development of new molecular markers for
genetic mapping. High rates of polymorphism between
Arabidopsis accessions, including both DNA sequence and copy
number of tandem arrays, are prevalent at loci involved in disease
resistance35. This has been observed for other plant species, and such
loci are thought to serve as templates for illegitimate recombination
5 Mb
5 Mb
10 Mb
10 Mb
Figure 4 Segmentally duplicated regions in the Arabidopsis genome. Individual
chromosomes are depicted as horizontal grey bars (with chromosome 1 at the top),
centromeres are marked black. Coloured bands connect corresponding duplicated
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
to create new pathogen response speci®cities36. We carried out a
comparative analysis between 82 Mb of the genome sequence of
Arabidopsis accession Columbia (Col-0) and 92.1 Mb of nonredundant low-pass (twofold redundant) sequence data of the
genomic DNA of accession Landsberg erecta (Ler). We identi®ed
two classes of differences between the sequences: single nucleotide
polymorphisms (SNPs), and insertion±deletions (InDels). As we
used high stringency criteria, our results represent a minimum
estimate of numbers of polymorphisms between the two genomes.
In total, we detected 25,274 SNPs, representing an average density
of 1 SNP per 3.3 kb. Transitions (A/T±G/C) represented 52.1% of
the SNPs, and transversions accounted for the remainder: 17.3% for
A/T±T/A, 22.7% for A/T±C/G and 7.9% for C/G±G/C. In total, we
detected 14,570 InDels at an average spacing of 6.1 kb. They ranged
from 2 bp to over 38 kilobase-pairs, although 95% were smaller than
50 bp. Only 10% of the InDels were co-located with simple sequence
repeats identi®ed with the program Sputnik. An analysis of 416
relative insertions greater than 250 bp in Col-0 showed that 30%
matched transposon-related proteins, indicating that a substantial
proportion of the large InDels are the result of transposon insertion
or excision. Many InDels contained entire active genes not related to
transposons. Half of such genes absent from corresponding positions in the Col-0 sequence were found elsewhere on the genome of
Ler. This indicates that genes have been transferred to new genomic
locations.
Gene structures are often affected by small InDels and SNPs. The
positions of SNPs and InDels were mapped relative to 87,427 exons
and 70,379 introns annotated in the Col-0 sequence. SNPs were
found in exons, introns and intergenic regions at frequencies of 1
SNP per 3.1, 2.2 and 3.5 kb, respectively. The frequencies for InDels
were 1 per 9.3, 3.1 and 4.3 kb, respectively. Polymorphisms were
detected in 7% of exons, and alter the spliced sequences of 25% of
the predicted genes. For InDels in exons, insertion lengths divisible
by three are prevalent for small insertions (, 50 bp), indicating that
many proteins can withstand small insertions or deletions of amino
acids without loss of function.
Our analyses show that sequence polymorphisms between accessions of Arabidopsis are common, and that they occur in both
coding and non-coding regions. We found evidence for the relocation of genes in the genome, and for changes in the complement of
transposable elements. The data presented here are available at
http://www.arabidopsis.org/cereon/.
15 Mb
20 Mb
25 Mb
30 Mb
15 Mb
20 Mb
25 Mb
30 Mb
segments. Similarity between the rDNA repeats are excluded. Duplicated segments in
reversed orientation are connected with twisted coloured bands. The scale is
in megabases.
© 2000 Macmillan Magazines Ltd
801
articles
Comparison of Arabidopsis and other plant genera
Comparative genetic mapping can reveal extensive conservation of
genome organization between closely related species37,38. The comparative analysis of plant genome microstructure reveals much
about the evolution of plant genomes and provides unprecedented
opportunities for crop improvement by establishing the detailed
structures of, and relationships between, the genomes of crops and
Arabidopsis.
The lineages leading to Arabidopsis and Capsella rubella (shepherd's
purse) diverged between 6.2 and 9.8 Myr ago, and the gene content
and genome organization of C. rubella is very similar to that of
Arabidopsis39, including the large-scale duplications. Alignment of
Arabidopsis complementary DNA and EST sequences with genomic
DNA sequences of Arabidopsis and C. rubella showed conservation
of exon length and intron positions. Coding sequences predicted
from these alignments differed from the annotated Arabidopsis gene
sequences in two out of ®ve cases.
The ancestral lineages of Arabidopsis and the Brassica (cabbage
and mustard) genera diverged 12.2±19.2 Myr ago40. Brassica genes
show a high level of nucleotide conservation with their Arabidopsis
orthologues, typically more than 85% in coding regions40. The
structure of Brassica genomes resembles that of Arabidopsis, but
with extensive triplication and rearrangement41, and extensive
divergence of microstructure (Supplementary Information Fig. 3).
The divergence between the genomes of Arabidopsis and Brassica
oleracea is in striking contrast to that observed between Arabidopsis
and C. rubella, although the time since divergence is only twofold
greater. This accelerated rate of change in triplicated segments of the
genome of B. oleracea indicates that polyploidy fosters rapid
chromosomal evolution.
The Arabidopsis and tomato lineages diverged roughly 150 Myr
ago, and comparative sequence analysis of segments of their
genomes has revealed complex relationships34. Four regions of the
Arabidopsis genome are related to each other and to one region in
the tomato genome, suggesting that two rounds of duplication may
have occurred in the Arabidopsis lineage. The extensive duplication
described here supports the proposal that the more recent of these
duplications, estimated to have occurred ,112 Myr ago, was the
result of a polyploidization event. The lineages of Arabidopsis and
rice diverged ,200 Myr ago42. Three regions of the genome of
Arabidopsis were related to each other and to one region in the rice
genome, providing further evidence for multiple duplication
events43,44.
The frequent occurrence of tandem gene duplications and the
apparent deletion of single genes, or small groups of adjacent genes,
from duplicated regions suggests that unequal crossing over may be
a key mechanism affecting the evolution of plant genome microstructure. However, the segmental inversions and gene translocations in the genomes of both rice and B. oleracea that are not found
in Arabidopsis indicate that additional mechanisms may be
involved40.
Integration of the three genomes in the plant cell
The three genomes in the plant cellÐthose of the nucleus, the
plastids (chloroplasts) and the mitochondriaÐdiffer markedly in
gene number, organization and stability. Plastid genes are densely
packed in an order highly conserved in all plants45, whereas
mitochondrial genes46 are widely dispersed and subjected to extensive recombination.
Organellar genomes are remnants of independent organismsÐ
plastids are derived from the cyanobacterial lineage and mitochondria from the a-Proteobacteria. The remaining genes in plastids
include those that encode subunits of the photosystem and the
electron transport chain, whereas the genes in mitochondria encode
essential subunits of the respiratory chain. Both organelles contain
sets of speci®c membrane proteins that, together with housekeeping
proteins, account for 61% of the genes in the chloroplast and 88 %
in the mitochondrion (Table 4). The balances are involved in
transcription and translation.
The number of proteins encoded in the nucleus likely to be found
Table 3 Arabidopsis genes with similarities to human disease genes
Human disease gene
E value
Gene code
Arabidopsis hit
5.9 ´ 10-272
7.2 ´ 10-228
9.6 ´ 10-214
7.1 ´ 10-188
1.0 ´ 10-182
2.4 ´ 10-181
7.6 ´ 10-181
8.2 ´ 10-172
2.8 ´ 10-168
3.1 ´ 10-168
1.2 ´ 10-166
1.1 ´ 10-153
1.5 ´ 10-150
2.7 ´ 10-150
6.5 ´ 10-147
1.4 ´ 10-146
7.6 ´ 10-137
2.3 ´ 10-135
7.9 ´ 10-135
6.6 ´ 10-134
5.1 ´ 10-128
4.1 ´ 10-125
9.6 ´ 10-122
4.4 ´ 10-109
2.2 ´ 10-107
5.8 ´ 10-99
7.1 ´ 10-89
1.3 ´ 10-84
3.2 ´ 10-83
5.2 ´ 10-81
8.5 ´ 10-81
1.4 ´ 10-76
1.6 ´ 10-75
3.3 ´ 10-74
1.9 ´ 10-73
6.9 ´ 10-72
T27I1_16
F15K9_19
AT5g41360
F20D22_11
AT4g38510
At2g41700
AT5g44790
T6D22_10
At2g41700
AT3g48190
F7F22_1
F2K11_17
AT4g09140
At2g31900
T1G11_14
AT5g41150
AT5g40760
AT3g62700
T21F11_21
AT4g25540
AT4g02460
AT5g08470
AT4g02070
T19D16_15
AT5g57320
F10O3_11
AT3g28030
AT5g39040
AT4g24830
AT3g08720
AT3g17050
At2g20470
F26G16_9
AT5g26240
68069_m00158
AT3g08730
Putative calcium ATPase
Putative DNA repair protein
DNA excision repair cross-complementing protein
Multidrug resistance protein
Probable H+-transporting ATPase
Putative ABC transporter
ATP-dependent copper transporter
DNA ligase
Putative ABC transporter
Ataxia telangiectasia mutated protein AtATM
Niemann±Pick C disease protein-like protein
ATP-dependent copper transporter, putative
MLH1 protein
Putative unconventional myosin
Putative myosin heavy chain
Repair endonuclease (gb|AAF01274.1)
Glucose-6-phosphate dehydrogenase
ABC transporter-like protein
Putative glycerol kinase
Putative DNA mismatch repair protein
No title
Putative protein
G/T DNA mismatch repair enzyme
DNA helicase isolog
Villin
Putative transport protein
Hypothetical protein
ABC transporter-like protein
Argininosuccinate synthase-like protein
Putative ribosomal-protein S6 kinase (ATPK19)
Unknown protein
Putative protein kinase
Cation-chloride co-transporter, putative
CLC-d chloride channel protein
Hypothetical protein
Putative ribosomal-protein S6 kinase (ATPK6)
...................................................................................................................................................................................................................................................................................................................................................................
Darier±White, SERCA
Xeroderma Pigmentosum, D-XPD
Xeroderma pigment, B-ERCC3
Hyperinsulinism, ABCC8
Renal tubul. acidosis, ATP6B1
HDL de®ciency 1, ABCA1
Wilson, ATP7B
Immunode®ciency, DNA Ligase 1
Stargardt's, ABCA4
Ataxia telangiectasia, ATM
Niemann±Pick, NPC1
Menkes, ATP7A
HNPCC*, MLH1
Deafness, hereditary, MYO15
Fam, cardiac myopathy, MYH7
Xeroderma Pigmentosum, F-XPF
G6PD de®ciency, G6PD
Cystic ®brosis, ABCC7
Glycerol kinase de®c, GK
HNPCC, MSH3
HNPCC, PMS2
Zellweger, PEX1
HNPCC, MSH6
Bloom, BLM
Finnish amyloidosis, GSN
Chediak±Higashi, CHS1
Xeroderma Pigmentosum, G-XPG
Bare lymphocyte, ABCB3
Citrullinemia, type I, ASS
Cof®n±Lowry, RPS6KA3
Keratoderma, KRT9
Myotonic dystrophy, DM1
Bartter's, SLC12A1
Dents, CLCN5
Diaphanous 1, DAPH1
AKT2
...................................................................................................................................................................................................................................................................................................................................................................
802
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
in organelles was predicted using default settings on TargetP
(Table 1). Many nuclear gene products that are targeted to either
(or both) organelles were originally encoded in the organelle
genomes and were transferred to the nuclear genome during
evolutionary history. A large number also appear to be of eukaryotic
origin, with functions such as protein import components, which
were probably not required by the free-living ancestors of the
endosymbionts.
To identify nuclear genes of possible organellar ancestry, we
compared all predicted Arabidopsis proteins to all proteins from
completed genomes including those from plastids and mitochondria (Supplementary Information Table 2). This search identi®ed
proteins encoded by the Arabidopsis nuclear genome that are most
similar to proteins encoded by other species' organelle genomes (14
mitochondrial and 44 plastid). These represent organelle-tonuclear gene transfers that have occurred sometime after the
divergence of the organelle-containing lineages47. There is a great
excess of nuclear encoded proteins most similar to proteins from the
cyanobacteria Synechocystis (Supplementary Information Fig. 4;
806 Arabidopsis predicted proteins matching 404 different Synechocystis proteins, providing further evidence of a genome duplication). These 806 Arabidopsis predicted proteins, and many others of
greatly diverse function, are possibly of plastid descent. Through
searches against proteins from other cyanobacteria (with incompletely sequenced genomes), we identi®ed 69 additional genes of
possibly plastid descent. Only 25% of these putatively plastidderived proteins displayed a target peptide predicted by TargetP,
indicating potential cytoplasmic functions for most of these genes.
The difference between predicted plastid-targeted and predicted
plastid-derived genes indicates that there is a probable overestimation by ab initio targeting prediction methods and a lack of
resolution with respect to destination organelles, the possible
extensive divergence of some endosymbiont-derived genes in the
nuclear genome, the co-opting of nuclear genes for targeting to
organelles, and cytoplasmic functions for cyanobacteria-derived
proteins. Clearly more re®ned tools and extensive experimentation
is required to catalogue plastid proteins.
The transfer of genes between genomes still continues (Supplementary Information Table 3). Plastid DNA insertions in the
nucleus (17 insertions totalling 11 kb) contain full-length genes
encoding proteins or tRNAs, fragments of genes and an intron as
well as intergenic regions. Subsequent reshuf¯ing in the nucleus is
illustrated by the atpH gene, which was originally transferred
completely, but is now in two pieces separated by 2 kb. The 13
small mitochondrial DNA insertions total 7 kb in addition to the
large insertion close to the centromere of chromosome 2 (ref. 3).
The high level of recombination in the mitochondrial genome may
account for these events.
Transposable elements
Transposons, which were originally identi®ed in maize by Barbara
McClintock, have been found in all eukaryotes and prokaryotes. A
Table 4 General features of genes encoded by the three genomes in
Arabidopsis
Nucleus/cytoplasm
Plastid
Mitochondria
125 Mb
2
60%
25,498
Variable, but syntenic
4.5
154 kb
560
17%
79
Conserved
1.2
367 kb
26
10%
58
Variable
6.25
1,900 nt
79%
1/0.03
14%
900 nt
18.4%
1/0
0%
860 nt
12%
1/0.2±0.5
4%
.............................................................................................................................................................................
Genome size
Genome equivalent/cell
Duplication
Number of protein genes
Gene order
Density
(kb per protein gene)
Average coding length
Genes with introns
Genes/pseudogenes
Transposons
(% of total genome size)
.............................................................................................................................................................................
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
subset of transposons replicate through an RNA intermediate (class
I), whereas others move directly through a DNA form (class II).
Transposons are further classi®ed by similarity either between their
mobility genes or between their terminal and/or internal motifs, as
well as by the size and sequence of their target site. Internally deleted
elements can often be mobilized in trans by fully functional
elements.
Transposons in Arabidopsis account for at least 10% of the
genome, or about one-®fth of the intergenic DNA. The
Arabidopsis genome has a wealth of class I (2,109) and II
(2,203) elements, including several new groups (1,209 elements;
Supplementary Information Table 4). Mobile histories for many
elements were obtained by identifying regions of the genome with
signi®cant similarity to `empty' target sites (RESites) thus providing
high-resolution information concerning the termini and target site
duplications48,49. These regions were readily detected because of the
propensity of transposons to integrate into repeats and because of
duplications in the genome sequence. In several cases, genes appear
to have been included as `passengers' in transposable units48. In
some cases, shared sequence similarity, coding capacity and RESites
attest to recent activity of transposable elements in the Arabidopsis
genome. Only about 4% of the complete elements identi®ed
correspond to an EST, however, suggesting that most are not
transcribed.
Transposable elements found in many other plant genomes are
well represented in Arabidopsis, including copia- and gypsy-like long
terminal repeat (LTR) retrotransposons, long interspersal nuclear
elements (LINEs); short interspersed nuclear elements (SINEs),
hobo/Activator/Tam3 (hAT)-like elements, CACTA-like elements
and miniature inverted-repeat transposable elements (MITES).
Although usually small in size, some larger Tourist-like MITEs
contain open reading frames (ORFs) with similarity to the transposases of bacterial insertion sequences48. Basho and many Mutatorlike elements (MULEs), ®rst discovered in the Arabidopsis sequence,
represent structurally unique transposons48±50. Basho elements have
a target site preference for mononucleotide `A' and wide distribution
among plants48,51. MULEs exhibit a high level of sequence diversity
and members of most groups lack long terminal inverted repeats
(TIRs). Phylogenetic analysis of the Arabidopsis MURA-like transposases suggests that TIR-containing MULEs are more closely
related to one another than to MULEs lacking TIRs49,52.
For many plants with large genomes, class I retrotransposons
contribute most of the nucleotide content53. In the small Arabidopsis
genome, class I elements are less abundant and primarily occupy the
centromere. In contrast, Basho elements and class II transposons
such as MITEs and MULEs predominate on the periphery of
pericentromeric domains (Fig. 5). In class II transposons, MULEs
and CACTA elements are clustered near centromeres and heterochromatic knobs, whereas MITEs and hAT elements have a less
pronounced bias. The distribution pattern of transposable elements
observed in Arabidopsis may re¯ect different types of pericentromeric heterochromatin regions and may be similar to those found
in animals.
Numerous centromeric satellite repeats are located between
each chromosome arm and have not yet been sequenced, but
are represented in part by unanchored BAC contigs (R. Martienssen
and M. Marra, unpublished data). End sequence suggests that these
domains contain many more class I than class II elements, consistent with the distribution reported here (K. Lemcke and R.
Martienssen, unpublished data). We do not know the signi®cance
of the apparent paucity of elements in telomeric regions and in the
region ¯anking the rDNA repeats on chromosome 4 (but not on
chromosome 2).
Overall, transposon-rich regions are relatively gene-poor and
have lower rates of recombination and EST matches, indicating a
correlation between low gene expression, high transposon density
and low recombination51. The role of transposons in genome
© 2000 Macmillan Magazines Ltd
803
articles
organization and chromosome structure can now be addressed in a
model organism known to undergo DNA methylation and other
forms of chromatin modi®cation thought to regulate
transposition52.
rDNA, telomeres and centromeres
Nucleolar organizers (NORs) contain arrays of unit repeats encoding the 18S, 5.8S and 25S ribosomal RNA genes and are transcribed
by RNA polymerase I. Together with 5S RNA, which is transcribed
by RNA polymerase III, these rRNAs form the structural and
catalytic cores of cytoplasmic ribosomes. In Arabidopsis, the
NORs juxtapose the telomeres of chromosomes 2 and 4, and
comprise uninterrupted 18S, 5.8S and 25S units all orientated on
the chromosomes in the same direction54. In contrast, the 5S rRNA
genes are localized to heterogeneous arrays in the centromeric
regions of chromosomes 3, 4 and 5 (ref. 55; and Fig. 6). Both
NORs are roughly 3.5±4.0 megabase-pairs and comprise ,350±400
highly methylated rRNA gene units, each ,10 kb (ref. 54). The
sequence between the euchromatic arms and NORs has been
determined. Elsewhere in the genome, only one other 18S, 5.8S,
25S rRNA gene unit was identi®ed in centromere 3. Although minor
variations in sequence length and composition occur in the NOR
repeats, these variants are highly clustered, supporting a model of
sequence maintenance through concerted evolution55.
Arabidopsis telomeres are composed of CCCTAAA repeats and
average ,2±3 kb (ref. 56). For TEL4N (telomere 4 North), consensus repeats are adjacent to the NOR; the remaining telomeres are
typically separated from coding sequences by repetitive subtelomeric regions measuring less than 4 kb. Imperfect telomere-like
arrays of up to 24 kb are found elsewhere in the genome, particularly
Frequency
a
12
8
Frequency
c
Class I
Class II
Basho
4
0
0
Frequency
20
10
Position (Mb)
16
4
0
d
Chr. 3
Chr. 2
8
30
0
20
10
Position (Mb)
20
Chr. 4
16
12
12
8
8
4
0
e
b 12
Chr. 1
4
0
0
10
20
Position (Mb)
20
0
10
Position (Mb)
20
Membrane transport
Chr. 5
16
12
8
4
0
0
10
20
30
Position (Mb)
Figure 5 Distribution of class I, II and Basho transposons in Arabidopsis chromosomes.
The frequency of class I retroelements (green), class II DNA transposons (blue) and Basho
elements (purple) are shown at 100-kb intervals along the ®ve chromosomes (a±e) of
Arabidopsis.
804
near centromeres. These arrays might affect the expression of nearby
genes and may have resulted from ancient rearrangements, such as
inversions of the chromosome arms.
Centromere DNA mediates chromosome attachment to the
meiotic and mitotic spindles and often forms dense heterochromatin. Genetic mapping of the regions that confer centromere function
provided the markers necessary to precisely place BAC clones at
individual centromeres17; 69 clones were targeted for sequencing,
resulting in over 5 Mb of DNA sequence from the centromeric
regions. The unsequenced regions of centromeres are composed
primarily of long, homogeneous arrays that were characterized
previously with physical57 and genetic mapping17 and contain over
3 Mb of repetitive arrays, including the 180-bp repeats and 5S
rDNA51 (Fig. 6).
Arabidopsis centromeres, like those of many higher eukaryotes,
contain numerous repetitive elements including retroelements,
transposons, microsatellites and middle repetitive DNA17. These
repeats are rare in the euchromatic arms and often most abundant
in pericentromeric DNA. The repeats, af®nity for DNA-binding
dyes, dense methylation patterns and inhibition of homologous
recombination indicate that the centromeric regions are highly
heterochromatic, and such regions are generally viewed as very
poor environments for gene expression. Unexpectedly, we found at
least 47 expressed genes encoded in the genetically de®ned centromeres of Arabidopsis (http://preuss.bsd.uchicago.edu/arabidopsis.
genome.html). In several cases, these genes reside on islands of
unique sequence ¯anked by repetitive arrays, such as 180-bp or 5S
rDNA repeats. Among the genes encoded in the centromeres are
members of 11 of the 16 functional categories that comprise the
proteome. The centromeres are not subject to recombination;
consequently, genes residing in these regions probably exhibit
unique patterns of molecular evolution.
The function of higher eukaryotic centromeres may be speci®ed
by proteins that bind to centromere DNA, by epigenetic
modi®cations, or by secondary or higher order structures. A
pairwise comparison of the non-repetitive portions of all ®ve
centromeres showed they share limited (1±7%) sequence similarity.
Forty-one families of small, conserved centromere sequences
(AtCCS, see http://preuss.bsd.uchicago.edu/arabidopsis.genome.
html) are enriched in the centromeric and pericentromeric regions
and differ from sequences found in the centromeres of other
eukaryotes. Molecular and genetic assays will be required to
determine whether these conserved motifs nucleate Arabidopsis
centromere activity. Apart from the AtCCS sequences, most centromere DNA is not shared between chromosomes, complicating
efforts to derive clear evolutionary relationships. In contrast, genetic
and cytological assays indicate that homologous centromeres are
highly conserved among Arabidopsis accessions, albeit subject to
rearrangements such as inversions to form knobs5,58,59 and
insertions4. Further investigation of centromere DNA promises to
yield information on the evolutionary forces that act in regions of
limited recombination, as well as an improved understanding of the
role of DNA sequence patterns in chromosome segregation.
Transporters in the plasma and intracellular membranes of
Arabidopsis are responsible for the acquisition, redistribution and
compartmentalization of organic nutrients and inorganic ions, as
well as for the ef¯ux of toxic compounds and metabolic end
products, energy and signal transduction, and turgor generation.
Previous genomic analyses of membrane transport systems in
S. cerevisiae and C. elegans led to the identi®cation of over 100
distinct families of membrane transporters60,61. We compared
membrane transport processes between Arabidopsis, animals,
fungi and prokaryotes, and identi®ed over 600 predicted membrane
transport systems in Arabidopsis (http://www-biology.ucsd.edu/
,ipaulsen/transport/), a similar number to that of C. elegans
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
(,700 transporters) and over twofold greater than either
S. cerevisiae or E. coli (,300 transporters).
We compared the transporter complement of Arabidopsis,
C. elegans and S. cerevisiae in terms of energy coupling mechanisms
(Fig. 7a). Unlike animals, which use a sodium ion P-type ATPase
pump to generate an electrochemical gradient across the plasma
membrane, plants and fungi use a proton P-type ATPase pump to
form a large membrane potential (-250 mV)62. Consequently, plant
secondary transporters are typically coupled to protons rather than
to sodium63. Compared with C. elegans, Arabidopsis has a surprisingly high percentage of primary ATP-dependent transporters (12%
and 21% of transporters, respectively), re¯ecting increased numbers
of P-type ATPases involved in metal ion transport and ABC ATPases
proposed to be involved in sequestering unusual metabolites and
drugs in the vacuole or in other intracellular compartments. These
processes may be necessary for pathogen defence and nutrient
storage.
About 15% of the transporters in Arabidopsis are channel proteins, ®ve times more than in any single-celled organism but half the
number in C. elegans (Fig. 7b). Almost half of the Arabidopsis
channel proteins are aquaporins, and Arabidopsis has 10-fold more
Mfamily major intrinsic protein (MIP) family water channels than
any other sequenced organism. This abundance emphasizes the
importance of hydraulics in a wide range of plant processes,
including sugar and nutrient transport into and out of the vasculature, opening of stomatal apertures, cell elongation and epinastic
movements of leaves and stems. Although Arabidopsis has a diverse
range of metal cation transporters, C. elegans has more, many of
which function in cell±cell signalling and nerve signal transduction.
Arabidopsis also possesses transporters for inorganic anions such as
phosphate, sulphate, nitrate and chloride, as well as for metal cation
channels that serve in signal transduction or cell homeostasis.
Compared with other sequenced organisms, Arabidopsis has 10fold more predicted peptide transporters, primarily of the protondependent oligopeptide transport (POT) family, emphasizing the
importance of peptide transport or indicating that there is broader
substrate speci®city than previously realized. There are nearly 1,000
Arabidopsis genes encoding Ser/Thr protein kinases, suggesting that
peptides may have an important role in plant signalling64.
Virtually no transporters for carboxylates, such as lactate and
pyruvate, were identi®ed in the Arabidopsis genome. About 12% of
the transporters were predicted to be sugar transporters, mostly
consisting of paralogues of the MFS family of hexose transporters.
Notably, S. cerevisiae, C. elegans and most prokaryotes use
APC family transporters as their principle means of amino-acid
transport, but Arabidopsis appears to rely primarily on the AAAP
family of amino-acid and auxin transporters. More than 10% of the
transporters in Arabidopsis are homologous to drug ef¯ux pumps;
these probably represent transporters involved in the sequestration
into vacuoles of xenobiotics, secondary metabolites, and breakdown
products of chlorophyll.
Surprisingly, Arabidopsis has close homologues of the human
ABC TAP transporters of antigenic peptides for presentation to the
major histocompatability complex (MHC). In Arabidopsis, these
transporters may be involved in peptide ef¯ux, or more speculatively, in some form of cell-recognition response. Arabidopsis also
has 10-fold more members of the multi-drug and toxin extrusion
(MATE) family than any other sequenced organism; in bacteria,
these transporters function as drug ef¯ux pumps. Curiously,
Arabidopsis has several homologues of the Drosophila RND transporter family Patched protein, which functions in segment polarity,
and more than ten homologues of the Drosophila ABC family eye
pigment transporters. In plants, these are presumably involved in
intracellular sequestration of secondary metabolites.
DNA repair and recombination
DNA repair and recombination pathways have many functions in
different species such as maintaining genomic integrity, regulating
mutation rates, chromosome segregation and recombination,
genetic exchange within and between populations, and immune
system development. Comparing the Arabidopsis genome with
other species65 indicates that Arabidopsis has a similar set of DNA
repair and recombination (RAR) genes to most other eukaryotes.
The pathways represented include photoreactivation, DNA ligation,
non-homologous end joining, base excision repair, mismatch
excision repair, nucleotide excision repair and many aspects of
DNA recombination (Supplementary Information Table 5). The
Arabidopsis RAR genes include homologues of many DNA repair
genes that are defective in different human diseases (for example,
hereditary breast cancer and non-polyposis colon cancer, xeroderma pigmentosum and Cockayne's syndrome).
One feature that sets Arabidopsis apart from other eukaryotes is
the presence of additional homologues of many RAR genes. This is
seen for almost every major class of DNA repair, including recombination (four RecA), DNA ligation (four DNA ligase I), photoreactivation (one class II photolyase and ®ve class I photolyase
homologues) and nucleotide excision repair (six RPA1, two RPA2,
two Rad25, three TFB1 and four Rad23). This is most striking for
genes with probable roles in base excision repair. Arabidopsis
encodes 16 homologues of DNA base glycosylases (enzymes that
CEN1
F28L22
F2C1
T28N5
F12G6
T18N24
F9D18
T4I21
F25O15
F5A13
F9M8
Key
180 bp
CEN2
T25N22 F27C21 T5M2
T13E11 F9A16 T17H1
T18C6
T12J2
T14C8
T15D9
F7B19
T5E7
160 bp
Mitochondrial
CEN3
5S rDNA
T8N9
T7B9 T15D2
F6H5
F1D9
T13O13
F23H6
T18B3 T14A11 F21A14
T27B3 T14K23
T28G19
T26P13
T4P3
F4M19 F26B15
100 kb
CEN4
T19B17
F4H6
T4B21 T32N4 C6L9
T27D20
T26N6 T19J18 T1J1 C17L7
T1J24
F6H8
F21I2
F14G16
F28D6
CEN5
T3P1
F3F24
F23C8
F17M7
F7I20
F19I11
Figure 6 Predicted centromere composition. Genetically de®ned centromere boundaries
are indicated by ®lled circles; fully and partially assembled BAC sequences are
represented by solid and dashed black lines, respectively. Estimates of repeat sizes within
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
F3D18
T6F8
F19N2
T32B3
F13C19 F18O9
T29A4
F18A12
T25B21
F14C23 T15F17
F15I15
the centromeres were derived from consideration of repeat copy number, physical
mapping and cytogenetic assays.
© 2000 Macmillan Magazines Ltd
805
articles
recognize abnormal DNA bases and cleave them from the sugarphosphate backbone)Ðmore than any other species known. This
includes several homologues of each of three families of alkylation
damage base glycosylases: two of the S. cerevisiae MPG; six of the E.
coli TagI; and two of the E. coli AlkA. Arabidopsis also encodes three
homologues of the apurinic-apyrimidinic (AP) endonuclease Xth.
AP endonucleases continue the base excision repair started by
glycosylases by cleaving the DNA backbone at abasic sites.
Evolutionary analysis indicates that some of the extra copies of
RAR genes in Arabidopsis originated through relatively recent gene
duplicationsÐbecause many of the sets of genes are more closely
related to each other than to their homologues in any other species.
As duplication is frequently accompanied by functional divergence,
the duplicate (paralogous) genes may have different repair speci®cities or may have evolved functions that are outside RAR functions
(as is the case for two of the ®ve class I photolyase homologues,
which function as blue-light receptors). In most cases, it is not
known whether the paralogous gene copies have different functions.
The presence of multiple paralogues might also allow functional
redundancy or a greater repair or recombination capacity.
The multiplicity of RAR genes in Arabidopsis is also partly due to
the transfer of genes from the organellar genomes to the nucleus.
Repair gene homologues that appear to be of chloroplast origin
(Supplementary Information Tables 2 and 5) include the recombination proteins RecA, RecG and SMS, two class I photolyase
homologues, Fpg, two MutS2 proteins, and the transcriptionrepair coupling factor Mfd. Two of these (RecA and Fpg) are
involved in RAR functions in the plastid, suggesting that the
others may be as well. The ®nding of an Mfd orthologue of
cyanobacterial descent is surprising. In E. coli, Mfd couples nucleotide excision repair carried out by UvrABC to transcription, leading
to the rapid repair of DNA damage on the transcribed strand of
transcribed genes66 The absence of orthologues of UvrABC in
Arabidopsis renders the function of Mfd dif®cult to predict. The
presence of Mfd but not UvrABC has been reported for only one
other species, a bacterial endosymbiont of the pea aphid.
Other nuclear-encoded Arabidopsis DNA repair gene homologues are evolutionarily related to genes from a-Proteobacteria, and
thus may be of mitochondrial descent. In particular, the six homologues of the alkyl-base glycosylase TagI appear to be the result of a
large expansion in plants after transfer from the mitochondrial
genome. Whether any of these TagI homologues function in the
repair and maintenance of mitochondrial DNA has not been
determined. More detailed phylogenetic analysis may reveal additional Arabidopsis RAR genes to be of organellar ancestry.
There are some notable absences of proteins important for RAR
in other species, including alkyltransferases, MSH4, RPA3 and many
components of TFIIH (TFB2, TFB3, TFB4, CCL1, Kin28). Nevertheless, Arabidopsis shows many similarities to the set of DNA repair
genes found in other eukaryotes, and therefore offers an experimental system for determining the functions of many of these
proteins, in part through characterization of mutants defective in
DNA repair67.
Gene regulation
Eukaryotic gene expression involves many nuclear proteins that
modulate chromatin structure, contribute to the basal transcription
machinery, or mediate gene regulation in response to developmental, environmental or metabolic cues. As predicted by sequence
similarity, more than 3,000 such proteins may be encoded by the
Arabidopsis genome, suggesting that it has a comparable complexity
of gene regulation to other eukaryotes. Arabidopsis has an additional
level of gene regulation, however, with DNA methylation potentially
mediating gene silencing and parental imprinting.
Plants have evolved several variations on chromatin remodelling
proteins, such as the family of HD2 histone deacetylases68. Although
Arabidopsis possesses the usual number of SNF2-type chromatin
806
remodelling ATPases, which regulate the expression of nearly all
genes, there are signi®cant structural differences between yeast and
metazoan SNF2-type genes and their orthologues in Arabidopsis.
DDM1, a member of the SNF2 superfamily, and MOM1, a gene with
similarity to the SNF2 family, are involved in transcriptional gene
silencing in Arabidopsis. MOM1 has no clear orthologue in fungal or
metazoan genomes.
Consistent with its methylated DNA, Arabidopsis possesses
eight DNA methyltransferases (DMTs). Two of the three types
are orthologous to mammalian DMT69 whereas one, chromomethyltransferase70, is unique to plants. No DMTs are found in
yeast or C. elegans, although two DMT-like genes are found in
Drosophila71. Arabidopsis also encodes eight proteins with methylDNA-binding domains (MBDs). Despite lacking methylated DNA,
Drosophila encodes four MBD proteins and C. elegans has two.
These differences in chromatin components are likely to
re¯ect important differences in chromatin-based regulatory
control of gene expression in eukaryotes (Supplementary Information Table 6; http://Ag.Arizona.Edu/chromatin/chromatin.html).
The Arabidopsis genome encodes transcription machinery for the
three nuclear DNA-dependent RNA polymerase systems typical of
eukaryotes (Supplementary Information Table 6). Transcription by
RNA polymerases II and III appears to involve the same machinery
as is used in other eukarotes; however, most transcription factors for
RNA polymerase I are not readily identi®ed. Only two polymerase I
regulators (other than polymerase subunits and TATA-binding
protein) are apparent in Arabidopsis, namely homologues of yeast
RRN3 and mouse TTF-1. All eukaryotes examined to date have
distinct genes for the largest and second largest subunits of polymerase I, II and III. Unexpectedly, Arabidopsis has two genes
encoding a fourth class of largest subunit and second-largest
subunit (Supplementary Information Fig. 5). It will be interesting
to determine whether the atypical subunits comprise a polymerase
that has a plant-speci®c function. Four genes encoding singlesubunit plastid or mitochondrial RNA polymerases have been
identi®ed in Arabidopsis (Supplementary Information Table 6).
Genes for the bacterial b-, b9- and a-subunits of RNA polymerase
are also present, as are homologues of various s-factors, and these
proteins may regulate chloroplast gene expression. Mutations in the
Sde-1 gene, encoding RNA-dependent RNA polymerase (RdRp),
lead to defective post-transcriptional gene silencing72. We also
identi®ed ®ve more closely related RdRp genes.
Our analysis, using both similarity searches and domain matches,
has identi®ed 1,709 proteins with signi®cant similarity to known
classes of plant transcription factors classi®ed by conserved DNAbinding domains. This analysis used a consistent conservative
threshold that probably underestimates the size of families of
diverse sequence. This class of protein is the least conserved
among all classes of known proteins, showing only 8±23% similarity to transcription factors in other eukaryotes (Fig. 2b). This
reduced similarity is due to the absence of certain classes of
transcription factors in Arabidopsis and large numbers of plantspeci®c transcription factors. We did not detect any members of
several widespread families of transcription factors, such as the REL
(Rel-like DNA-binding domain) homology region proteins, nuclear
steroid receptors and forkhead-winged helix and POU (Pit-1, Octand Unc-8b) domain families of developmental regulators. Conversely, of 29 classes of Arabidopsis transcription factors, 16 appear
to be unique to plants (Supplementary Information Table 6).
Several of these, such as the AP2/EREBP-RAV, NAC and ARFAUX/IAA families, contain unique DNA-binding domains, whereas
others contain plant-speci®c variants of more widespread domains,
such as the DOF and WRKY zinc-®nger families and the two-repeat
MYB family.
Functional redundancy among members of large families of
closely related transcription factors in Arabidopsis is a signi®cant
potential barrier to their characterization73. For example, in the
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
SHATTERPROOF and SEPALLATA families of MADS box transcription factors, all genes must be defective to produce visible
mutant phenotypes74,75. These functionally redundant genes are
found on the segmental duplications described above. Our analyses,
together with the signi®cant sequence similarity found in large
families of transcription factors such as the R2R3-repeat MYB and
WRKY families, suggest that strategies involving overexpression will
be important in determining the functions of members of transcription factor families.
Arabidopsis has two or over three times more transcription factors
than identi®ed in Drosophila29 or C.elegans1, respectively. The signi®cantly greater extent of segmental chromosomal and local tandem
duplications in the Arabidopsis genome generates larger gene families,
including transcription factors. The partly overlapping functions
de®ned for a few transcription factors are also likely to be much
more widespread, implicating many sequence-related transcription
factors in the same cellular processes. Finally, the expanded number
of genes involved in metabolism, defence and environmental interaction in Arabidopsis (Fig. 2a), which have few counterparts in
Drosophila and C. elegans, all require additional numbers and classes
of transcription factors to integrate gene function in response to a
vast range of developmental and environmental cues.
Cellular organization
Plant cells differ from animal cells in many features such as plastids,
vacuoles, Golgi organization, cytoskeletal arrays, plasmodesmata
linking cytoplasms of neighbouring cells, and a rigid polysaccharide-rich extracellular matrixÐthe cell wall. Because the cell wall
maintains the position of a cell relative to its neighbours, both
changes in cell shape and organized cell divisions, involving cytoskeleton reorganization and membrane vesicle targeting, have major
roles in plant development. Plant cytokinesis is also unique in
that the partitioning membrane is formed de novo by vesicle fusion.
We compared the Arabidopsis genome with those of C. elegans,
a
A. thaliana
C. elegans
S. cerevisiae
Channels
Secondary transport
Primary transport
Uncharacterized
b
Cations (inorganic)
Amines, amides and polyamines
Anions (inorganic)
Peptides
Water
Sugars and derivatives
Carboxylates
Amino acids
Bases and derivatives
Vitamins and cofactors
Drugs and toxins
Macromolecules
Unknown
Figure 7 Comparison of the transport capabilities of Arabidopsis, C. elegans and
S. cerevisiae. Pie charts show the percentage of transporters in each organism according
to bioenergetics (a) and substrate speci®city (b).
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
Drosophila and yeast to glimpse the genetic basis of plant-cellspeci®c features.
The principal components of the plant cytoskeleton are microtubules (MTs) and actin ®laments (AFs); intermediate ®laments
(IFs) have not been described in plants. Arabidopsis appears to lack
genes for cytokeratin or vimentin, the main components of animal
IFs, but has several variants of actin, a- and b-tubulin. The
Arabidopsis genome also encodes homologues of chaperones that
mediate the folding of tubulin and actin polypeptides in yeast and
animal cells, such as the prefoldin and cytosolic chaperonin complexes and tubulin-folding cofactors. The dynamic stability of MTs
and AFs is in¯uenced by MT-associated proteins and actin-binding
proteins, respectively, several of which are encoded by Arabidopsis
genes. These include the MT-severing ATPase katanin, AF-crosslinking/bundling proteins, such as ®mbrins and villins, and AFdisassembling proteins, such as pro®lin and actin-depolymerizing
factor/co®lin. The Arabidopsis proteome appears to lack homologues of proteins that, in animal cells, link the actin cytoskeleton
across the plasma membrane to the extracellular matrix, such as
integrin, talin, spectrin, a-actinin, vitronectin or vinculin. This
apparent lack of `anchorage' proteins is consistent with the different
composition of the cell wall and with a prominence of cortical MTs
at the expense of cortical AFs in plant cells.
Plant-speci®c cytoskeletal arrays include interphase cortical MTs
mediating cell shape, the preprophase band marking the cortical site
of cell division, and the phragmoplast assisting in cytokinesis76.
Although plant cells lack structural counterparts of the yeast spindle
pole body and the animal centrosome, Arabidopsis has homologues
of core components of the MT-nucleating g-tubulin ring complex,
such as g-tubulin, Spc97/hGCP2 and Spc98/hGCP3. Arabidopsis
has numerous motor molecules, both kinesins and dyneins with
associated dynactin complex proteins, which are presumably
involved in the dynamic organization of MTs and in transporting
cargo along MT tracks. There are also myosin motors that may be
involved in AF-supported organelle traf®cking. Essential features of
the eukaryotic cytoskeleton appear to be conserved in Arabidopsis.
The Arabidopsis genome encodes homologues of proteins
involved in vesicle budding, including several ARFs and ARFrelated small G-proteins, large but not small ARF GEFs (adenosine
ribosylation factor on guanine nucleotide exchange factor), adapter
proteins, and coat proteins of the COP and non-COP types.
Arabidopsis also has homologues of proteins involved in vesicle
docking and fusion, including SNAP receptors (SNAREs), Nethylmaleimide-sensitive factor (NSF) and Cdc48-related ATPases,
accessory proteins such as Sec1 and soluble NSF attachment protein
(SNAP), and Rab-type GTPases. The large number of Arabidopsis
SNAREs can be grouped by sequence similarity to yeast and animal
counterparts involved in speci®c traf®cking pathways, and some
have been localized to the trans-Golgi and the pre-vacuolar
pathway77. Arabidopsis also has a receptor for retention of proteins
in the endoplasmic reticulum, a cargo receptor for transport to the
vacuole and several phragmoplastins related to animal dynamin
GTPases. Thus, plant cells appear to use the same basic machinery
for vesicle traf®cking as yeast and animal cells.
Animal cells possess many functionally diverse small G-proteins
of the Ras superfamily involved in signal transduction, AF reorganization, vesicle fusion and other processes. Surprisingly,
Arabidopsis appears to lack genes for G-proteins of the Ras, Rho,
Rac and Cdc42 subfamilies but has many Rab-type G-proteins
involved in vesicle fusion and several Rop-type G-proteins, one of
which has a role in actin organization of the tip-growing pollen
tube78. The signi®cance of this divergent ampli®cation of different
subfamilies of small G-proteins in plants and animals remains to be
determined.
Arabidopsis possesses cyclin-dependent kinases (CDKs), including a plant-speci®c Cdc2b kinase expressed in a cell-cycle-dependent manner, several cyclin subtypes, including a D-type cyclin that
© 2000 Macmillan Magazines Ltd
807
articles
mediates cytokinin-stimulated cell-cycle progression79, a retinoblastoma-related protein and components of the ubiquitin-dependent
proteolytic pathway of cyclin degradation. In yeast and animal cells,
chromosome condensation is mediated by condensins, sister chromatids are held together by cohesins such as Scc1, and metaphase±
anaphase transition is triggered by separin/Esp1 endopeptidase
proteolysis of Scc1 on APC-mediated degradation of its inhibitor,
securin/Psd1. Related proteins are encoded by the Arabidopsis
genome. Thus, the basic machinery of cell-cycle progression,
genome duplication and segregation appears to be conserved in
plants. By contrast, entry into M phase, M-phase progression and
cytokinesis seem to be modi®ed in plant cells. Arabidopsis does not
appear to have homologues of Cdc25 phosphatase, which activates
Cdc2 kinase at the onset of mitosis, or of polo kinase, which
regulates M-phase progression in yeast and animals. Conversely,
plant-speci®c mitogen-actived protein (MAP) kinases appear to be
involved in cytokinesis.
Cytokinesis partitions the cytoplasm of the dividing cell. Yeast
and animal cells expand the membrane from the surface towards the
centre in a cleavage process supported by septins and a contractile
ring of actin and type II myosin. By contrast, plant cytokinesis starts
in the centre of the division plane and progresses laterally. A
transient membrane compartment, the cell plate, is formed de
novo by fusion of Golgi-derived vesicles traf®cking along the
phragmoplast MTs80. Consistent with the unique mode of plant
cytokinesis, Arabidopsis appears to lack genes for septins and type II
myosin. Conversely, cell-plate formation requires a cytokinesisspeci®c syntaxin that has no close homologue in yeast and animals.
Although syntaxin-mediated membrane fusion occurs in animal
cytokinesis and cellularization, the vesicles are delivered to the base
of the cleavage furrow. Thus, the plant-speci®c mechanism of cell
division is linked to conserved eukaryotic cell-cycle machinery.
Two main conclusions are suggested by this comparative analysis.
First, Arabidopsis and eukaryotic cells have common features related
to intracellular activities, such as vesicle traf®cking, cytoskeleton
and cell cycle. Second, evolutionarily divergent features, such as
organization of the cytoskeleton and cytokinesis, appear to relate to
the plant cell wall.
Development
The regulation of development in Arabidopsis, as in animals,
involves cell±cell communication, hierarchies of transcription factors, and the regulation of chromatin state; however, there is no
reason to suppose that the complex multicellular states of plant and
animal development have evolved by elaborating the same general
processes during the 1.6 billion years since the last common unicellular ancestor of plants and animals81,82. Our genome analyses
re¯ect the long, independent evolution of many processes contributing to development in the two kingdoms.
Plants and animals have converged on similar processes of pattern
formation, but have used and expanded different transcription
factor families as key causal regulators. For example, segmentation
in insects and differentiation along the anterior±posterior and limb
axes in mammals both involve the spatially speci®c activation of a
series of homeobox gene family members. The pattern of activation
is causal in the later differentiation of body and limb axis regions. In
plants the pattern of ¯oral whorls (sepals, petals, stamens, carpels) is
also established by the spatially speci®c activation of members of a
family of transcription factors, but in this instance the family is the
MADS box family. Plants also have homeobox genes and animals
have MADS box genes, implying that each lineage invented separately its mechanism of spatial pattern formation, while converging
on actions and interactions of transcription factors as the mechanism. Other examples show even greater divergence of plant and
animal developmental control. Examples are the AP2/EREBP and
NAC families of transcription factors, which have important roles in
¯ower and meristem development; both families are so far found
808
only in plants (Supplementary Information Table 6).
A similar story can be told for cell±cell communication. Plants do
not seem to have receptor tyrosine kinases, but the Arabidopsis
genome has at least 340 genes for receptor Ser/Thr kinases, belonging to many different families, de®ned by their putative extracellular
domains (Supplementary Information Table 7). Several families
have members with known functions in cell±cell communication,
such as the CLV1 receptor involved in meristem cell signalling, the
S-glycoprotein homologues involved in signalling from pollen to
stigma in self-incompatible Brassica species, and the BRI1 receptor
necessary for brassinosteroid signalling83. Animals also have receptor Ser/Thr kinases, such as the transforming growth factor-b
(TGF-b) receptors, but these act through SMAD proteins that are
absent from Arabidopsis. The leucine-rich repeat (LRR) family of
Arabidopsis receptor kinases shares its extracellular domain with
many animal and fungal proteins that do not have associated kinase
domains, and there are at least 122 Arabidopsis genes that code for
LRR proteins without a kinase domain. Other Arabidopsis receptor
kinase families have extracellular domains that are unfamiliar in
animals. Thus, evolution is modular, and the plant and animal
lineages have expanded different families of receptor kinases for a
similar set of developmental processes.
Several Arabidopsis genes of developmental importance appear to
be derived from a cyanobacteria-like genome (Supplementary
Information Table 2), with no close relationship to any animal or
fungal protein. One salient example is the family of ethylene
receptors; another gene family of apparent chloroplast origin is
the phytochromesÐlight receptors involved in many developmental decisions (see below). Whereas the land plant phytochromes
show clear homology to the cyanobacterial light receptors, which
are typical prokaryotic histidine kinases, the plant phytochromes
are histidine kinase paralogues with Ser/Thr speci®city84. Similarly
to the ethylene receptors, the proteins that act downstream of plant
phytochrome signalling are not found in cyanobacteria, and thus it
appears that a bacterial light receptor entered the plant genome
through horizontal transfer, altered its enzymatic activity, and
became linked to a eukaryotic signal transduction pathway. This
infusion of genes from a cyanobacterial endosymbiont shows that
plants have a richer heritage of ancestral genes than animals, and
unique developmental processes that derive from horizontal gene
transfer.
Signal transduction
Being generally sessile organisms, plants have to respond to local
environmental conditions by changing their physiology or redirecting their growth. Signals from the environment include light and
pathogen attack, temperature, water, nutrients, touch and gravity.
In addition to local cellular responses, some stimuli are communicated across the plant body, with plant hormones and peptides
acting as secondary messengers. Some hormones, such as auxin, are
taken up into the cell, whereas others, such as ethylene and
brassinosteroids, and the peptide CLV3, act as ligands for receptor
kinases on the plasma membrane. No matter where the signal is
perceived by the cell, it is transduced to the nucleus, resulting in
altered patterns of gene expression.
Comparative genome analysis between Arabidopsis, C. elegans
and Drosophila supports the idea that plants have evolved their own
pathways of signal transduction85. None of the components of the
widely adopted signalling pathways found in vertebrates, ¯ies or
worms, such as Wingless/Wnt, Hedgehog, Notch/lin12, JAK/STAT,
TGF-b/SMADs, receptor tyrosine kinase/Ras or the nuclear steroid
hormone receptors, is found in Arabidopsis. By contrast, brassinosteroids are ligands of the BRI1 Ser/Thr kinase, a member of the
largest recognizable class of transmembrane sensors encoded by
340 receptor-like kinase (RLK) genes in the Arabidopsis genome
(Supplementary Information Table 7). With a few notable exceptions, such as CLV1, the types of ligands sensed by RLKs are
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
completely unknown, providing an enormous future challenge for
plant biologists. G-protein-coupled receptors (GPCRs)/ seventransmembrane proteins are an abundant class of proteins in
mammalian genomes, instrumental in signal transduction. INTERPRO detected 27 GPCR-related domains in Arabidopsis (Supplementary Information Table 1), although there is no direct
experimental evidence for these. Arabidopsis contains a family of
18 seven-transmembrane proteins of the mildew resistance (MlO)
class, several of which are involved in defence responses. Notably,
only single Ga (GPA1) and Gb (AGB1) subunits are found in
Arabidopsis, both previously known86.
Although cyclic GMP has been proposed to be involved in signal
transduction in Arabidopsis87, a protein containing a guanylate
cyclase domain was not identi®ed in our analyses. Nevertheless,
cyclic nucleotide-binding domains were detected in various proteins, indicating that cNMPs may have a role in plant signal
transduction. Thus, although cNMP-binding domains appear to
have been conserved during evolution, cNMP synthesis in
Arabidopsis may have evolved independently.
We were unable to identify a protein with signi®cant similarity to
known Gg subunits, but recent biochemical studies suggest that a
protein with this functional capacity is likely to be present in plant
cells (H. Ma, personal communication). Therefore, there is potential for the formation of only a single heterotrimeric G-protein
complex; however, its functional interaction with any of the potential GPCR-related proteins remains to be determined.
Modules of cellular signal pathways from bacteria and animals
have been combined and new cascades have been innovated in
plants. A pertinent example is the response to the gaseous plant
hormone ethylene88. Ethylene is perceived and its signal transmitted
by a family of receptors related to bacterial-type two-component
histidine kinases (HKs). In bacteria, yeast and plants, these proteins
sense many extracellular signals and function in a His-to-Asp
phosphorelay network89. In turn, these proteins physically interact
with the genetically downstream protein CTR1, a Raf/MAPKKKrelated kinase, revealing the juxtaposition of bacterial-type twocomponent receptors and animal-type MAP kinase cascades. Unlike
animals, however, Arabidopsis does not seem to have a Ras protein
to activate the MAP kinase cascade. MAP kinases are found in
abundance in Arabidopsis: we identi®ed ,20, a higher number than
in any other eukaryote. As potentially counteracting components,
we found ,70 putative PP2C protein phosphatases. Although this
group is largely uncharacterized functionally, several members are
related to ABI1/ABI2, key negative regulators in the signalling
pathway for the plant hormone abscisic acid. Additional components of the His-to-Asp phosphorelay system were also found in
Arabidopsis, including authentic response regulators (ARRs), pseudoresponse regulators (PRRs) and phosphotransfer intermediate
protein (HPt)90. We found 11 HKs in the proteome (3 new), 16 RRs
(2 new) and 8 PRRs (2 new). The biological roles of most ARRs,
PRRs and HPts are largly unknown, but several have been found
to have diverse functions in plants, including transcriptional activation in response to the plant hormone cytokinin91, and as components of the circadian clock92.
Plants seem to have evolved unique signalling pathways by
combining a conserved MAP kinase cascade module with new
receptor types. In many cases, however, the ligands are unknown.
Conversely, some known signalling molecules, such as auxin, are
still in search of a receptor. Auxin signalling may represent yet
another plant-speci®c mode of signalling, with protein degradation
through the ubiquitin-proteasome pathway preceding altered gene
expression. With many Arabidopsis genes encoding components of
the ubiquitin-proteasome pathway, elimination of negative regulators may be a more widespread phenomenon in plant signalling.
Recognizing and responding to pathogens
Plants are constantly exposed to pests, parasites and pathogens and
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
have evolved many defences. In mammals, polymorphism for
parasite recognition encoded in the MHC genes contributes to
resistance. In plants, disease resistance (R) genes that confer parasite
recognition are also extremely polymorphic. This polymorphism
has been proposed to restrict parasites, and its absence may explain
the breakdown of resistance in crop monocultures93. In contrast to
MHC genes, plant resistance genes are found at several loci, and the
complete genome sequence enables analysis of their complement
and structure. Parasite recognition by resistance genes triggers
defence mechanisms through various signalling molecules, such as
protein kinases and adapter proteins, ion ¯uxes, reactive oxygen
intermediates and nitric oxide. These halt pathogen colonization
through transcriptional activation of defence genes and a form of
programmed cell death called the hypersensitive response94. The
Arabidopsis genome contains diverse resistance genes distributed at
many loci, along with components of signalling pathways, and
many other genes whose role in disease resistance has been inferred
from mutant phenotypes.
Most resistance genes encode intracellular proteins with a nucleotide-binding (NB) site typical of small G proteins, and carboxyterminal LRRs95. Their amino termini either carry a TIR domain, or
a putative coiled coil (CC). There are 85 TIR±NB±LRR resistance
genes at 64 loci, and 36 CC±NB±LRR resistance genes at 30 loci.
Some NB±LRR resistance genes express neither obvious TIR nor
CC domains at their N termini. This potential class is present seven
times, at six loci. There are 15 truncated TIR±NB genes that lack an
LRR at 10 loci, often adjacent to full TIR±NB±LRR genes. There are
also six CC±NB genes, at ®ve loci. These truncated products may
function in resistance. Intriguingly, two TIR±NB±LRR genes carry
a WRKY domain, found in transcription factors that are implicated
in plant defence, and one of these also encodes a protein kinase
domain.
Resistance gene evolution may involve duplication and divergence of linked gene families36; however, most (46) resistance genes
are singletons; 50 are in pairs, 21 are in 7 clusters of 3 family
members, with single clusters of 4, 5, 7, 8 and 9 members,
respectively. Of the non-singletons, ,60% of pairs are in direct
repeats, and ,40% are in inverted repeats. Resistance genes are
unevenly distributed between chromosomes, with 49 on chromosome 1; 2 on chromosome 2; 16 on chromosome 3; 28 on chromosome 4; and 55 on chromosome 5.
In other plant species, resistance genes encode both transmembrane receptors for secreted pathogen products and protein kinases,
and some other classes are also found. The Cf genes in tomato
encode extracellular LRRs with a transmembrane domain and short
cytoplasmic domain. Mutation in an Arabidopsis homologue,
CLAVATA2, results in enlarged meristems, but to date no resistance
function has been assigned to the 30 Arabidopsis CLV2 homologues.
CLAVATA1, a transmembrane LRR kinase, is also required for
meristem function. Xa21, a rice LRR-kinase, confers Xanthomonas
resistance, and the Arabidopsis FLS2 LRR kinase confers recognition
of ¯agellin. It has been proposed that CLV1 and CLV2 function as a
heterodimer; perhaps this is also true for Xa21, FLS2 and Cf
proteins. There are 174 LRR transmembrane kinases in
Arabidopsis, with only FLS2 assigned a role in resistance. A unique
resistance gene, beet Hs1pro-1, which confers nematode resistance,
has two Arabidopsis homologues.
The tomato Pto Ser/Thr kinase acts as a resistance protein in
conjunction with an NB±LRR protein, so similar kinases might do
the same for Arabidopsis NB±LRR proteins. There are 860 Ser/Thr
kinases in the Arabidopsis sequence. Fifteen of these share 50%
identity over the Pto-aligned region. The Toll pathway in Drosophila
and mammals regulates innate immune responses through
LRR/TIR domain receptors that recognize bacterial lipopolysaccharides96. Pto is highly homologous to Drosophila PELLE
and mammalian IRAK protein kinases that mediate the TIR
pathway.
© 2000 Macmillan Magazines Ltd
809
articles
Additional genes have been de®ned that are required for resistance by our analysis of the genome sequence. The ndr1 mutation
de®nes a gene required by the CC±NB±LRR gene RPS2 and RPM1.
NDR1 is 1 of 28 Arabidopsis genes that are similar both to each other
and to the tobacco HIN1 gene that is transcriptionally induced early
during the hypersensitive response. EDS1 is a gene required for
TIR±NB±LRR function, and like PAD4, encodes a protein with a
putative lipase motif. EDS1, PAD4 and a third gene comprise the
EDS1/PAD4 family. The NPR1/NIM1/SAI1 gene is required for
systemic acquired resistance, and we found ®ve additional NPR1
homologues. Recessive mutations at both the barley Mlo and
Arabidopsis LSD1 loci confer broad-spectrum resistance and derepress a cell-death program. There are at least 18 Mlo family
members that resemble heterotrimeric GPCRs in Arabidopsis, and
only two LSD1 homologues.
One of the earliest responses to pathogen recognition is the
production of reactive oxygen intermediates. This involves a specialized respiratory burst oxidase protein that transfers an electron
across the plasma membrane to make superoxide. Arabidopsis
encodes eight apparently functional gp91 homologues, called
Atrboh genes. Unlike gp91, they all carry an ,300 amino-acid Nterminal extension carrying an EF-hand Ca2+-binding domain. In
mammals, activation of the respiratory oxidative burst complex in
the neutrophil, which includes gp91, requires the action of Rac
proteins. As no Rac or Ras proteins are found in Arabidopsis,
members of the large rop family of G proteins may carry this out.
Similarly, we did not detect any Arabidopsis homologues of other
mammalian respiratory burst oxidase components (p22, p47, p67,
p40).
There are no clear homologues of many mammalian defence and
cell-death control genes. Although nitric oxide production is
involved in plant defence, there is no obvious homologue of nitric
oxide synthase. Also absent are apparent homologues of the REL
domain transcription factors involved in innate immunity in both
Drosophila and mammals. We found no similarity to proteins
involved in regulating apoptosis in animal cells, such as classical
caspases, bcl2/ced9 and baculovirus p35. There are, however, 36
cysteine proteases. There are also eight homologues of a newly
de®ned metacaspase family97, two of which, along with LSD1, have a
clear GATA-type zinc-®nger.
Photomorphogenesis and photosynthesis
Because nearly all plants are sessile and most depend on photosynthesis, they have evolved unique ways of responding to light.
Light serves as an energy source, as well as a trigger and modulator
of complex developmental pathways, including those regulated by
the circadian clock. Light is especially important during seedling
emergence, where it stimulates chlorophyll production, leaf development, cotyledon expansion, chloroplast biogenesis and the coordinated induction of many nuclear- and chloroplast-encoded genes,
while at the same time inhibiting stem growth. The goal of this
process, called photomorphogenesis, is the establishment of a body
plan that allows the plant to be an ef®cient photosynthetic machine
under varying light conditions98. The signal transduction cascade
leading to light-induced responses begins with the activation of
photoreceptors. Next, the light signal is transduced via positively
and negatively acting nuclear and cytoplasmic proteins, causing
activation or derepression of nuclear and chloroplast-encoded
photosynthetic genes and enabling the plant to establish optimal
photoautotrophic growth. Although genetic and biochemical studies have de®ned many of the components in this process, the
genome sequence provides an opportunity to identify comprehensively Arabidopsis genes involved in photomorphogenesis and the
establishment of photoautotrophic growth. We identi®ed at least
100 candidate genes involved in light perception and signalling, and
139 nuclear-encoded genes that potentially function in photosynthesis.
810
The roles have been described of only 35 of the 100 candidate
photomorphogenic genes (Supplementary Information Table 8).
All of the light photoreceptors had been discovered previously,
including ®ve red/far-red absorbing phytochromes (PHYA-E), two
blue/ultraviolet-A absorbing cryptochromes (CRY1 and CRY2),
one blue-absorbing phototropin (NPH1) and one NPH1-like (or
NPL1). In contrast, we uncovered many new proteins similar to
the photomorphogenesis regulators COP/DET/FUS, PKS1, PIF3,
NDPK2, SPA1, FAR1, GIGANTEA, FIN219, HY5, CCA1, ATHB-2,
ZEITLUPE, FKF1, LKP1, NPH3 and RPT2.
Both the phytochromes and NPH1 contain chromophores for
light sensing coupled to kinase domains for signal transmission.
Phytochromes have an N-terminal chromophore-binding domain,
two PAS domains, and a C-terminal Ser/Thr kinase domain99,
whereas NPH1 has two LOV domains (members of the PAS
domain superfamily) for ¯avin mononucleotide binding and a
C-terminal Ser/Thr kinase domain100. PAS domains potentially
sense changes in light, redox potential and oxygen energy levels, as
well as mediating protein±protein interactions99,100. We searched
for uncharacterized proteins with the combination of a kinase
domain and either a phytochrome chromophore-binding site or
PAS domains. Although we found no new phytochrome-like
genes, we did identify four predicted proteins that contain PAS
and kinase domains (Supplementary Information Fig. 6). These
proteins share 80% amino-acid identity, but, unlike NPH1 and
NPL1, have only one PAS domain. The combination of potential
signal sensing and transmitting domains makes it tempting to
speculate that these proteins may be receptors for light or other
signals.
Our screen included searches for components of photosynthetic
reaction centres and light-harvesting complexes, enzymes involved
in CO2 ®xation and enzymes in pigment biosynthesis. We identi®ed
11 core proteins of photosystem I, including the eukaryotic-speci®c
components PsaG and PsaH101, and 8 photosystem II proteins,
including a single member (psbW) of the photosystem II core. We
also found 26 proteins similar to the Chlorophyll-a/b binding
proteins (8 Lhca and 18 Lhcb). Of the seven subunits of the
cytochrome b6f complex (PetA±D, PetG, PetL, PetM), only one
(PetC) was found in the nuclear genome, whereas the remainder are
probably encoded in the chloroplast. Similarly, of the nine subunits
of the chloroplast ATP synthase complex, three are encoded in the
nucleus, including the II- , g- and d-subunits; the remaining
subunits (I, III, IV, a, b, e) are encoded in the chloroplast102. Ten
genes were related to the soluble components of the electron transfer
chain, including two plastocyanins, ®ve ferredoxins and three
ferredoxin/NADP oxidoreductases. Forty genes are predicted to
have a role in CO2 ®xation, including all of the enzymes in the
Calvin±Benson cycle. For pigment biosynthesis, 16 genes in chlorophyll biosynthesis and 31 genes in carotenoid biosynthesis were
found (Supplementary Information Table 8). Our analyses have
identi®ed several potential components of the light perception
pathway, and have revealed the complex distribution of components
of the photosynthetic apparatus between nuclear and plastid
genomes.
Metabolism
Arabidopsis is an autotrophic organism that needs only minerals,
light, water and air to grow. Consequently, a large proportion of the
genome encodes enzymes that support metabolic processes, such as
photosynthesis, respiration, intermediary metabolism, mineral
acquisition, and the synthesis of lipids, fatty acids, amino acids,
nucleotides and cofactors103. With respect to these processes,
Arabidopsis appears to contain a complement of genes similar to
those in the photoautotropic cyanobacterium Synechocystis45, but,
whereas Synechocystis generally has a single gene encoding an enzyme,
Arabidopsis frequently has many. For example, Arabidopsis has at
least seven genes for the glycolytic enzyme pyruvate kinase, with an
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
additional ®ve for pyruvate kinase-like proteins. Whatever the
reason for this high level of redundancy, it varies from gene to
gene in the same pathway; the 11 enzymes of glycolysis are encoded
by up to 51 genes that are present in as few as one or as many as eight
copies. Similarly, of the 59 genes encoding proteins involved in
glycerolipid metabolism, 39 are represented by more than one
gene104. Genome duplication and expansion of gene families by
tandem duplication have contributed to this diversity.
This high degree of apparent structural redundancy does not
necessarily imply functional redundancy. For instance, although
there are seven genes for serine hydroxymethyltransferase, a mutation in the gene for the mitochondrial form completely blocks the
photorespiratory pathway105. Although there are 12 genes for
cellulose synthase, mutations in at least 2 of the 12 confer distinct
phenotypes because of tissue-speci®c gene expression106.
The metabolome of Arabidopsis differs from that of cyanobacteria, or of any other organism sequenced to date, by the presence of
many genes encoding enzymes for pathways that are unique to
vascular plants. In particular, although relatively little is known
about the enzymology of cell-wall metabolism, more than 420 genes
could be assigned probable roles in pathways responsible for the
synthesis and modi®cation of cell-wall polymers. Twelve genes
encode cellulose synthase, and 29 other genes encode 6 families of
structurally related enzymes thought to synthesize other major
polysaccharides106. Roughly 52 genes encode polygalacturonases,
20 encode pectate lyases and 79 encode pectin esterases, indicating a
massive investment in modifying pectin. Similarly, the presence of
39 b-1,3-glucanases, 20 endoxyloglucan transglycosylases, 50 cellulases and other hydrolases, and 23 expansins re¯ects the importance
of wall remodelling during growth of plant cells. Excluding ascorbate and glutathione peroxidases, there are 69 genes with signi®cant
similarity to known peroxidases and 15 laccases (diphenol oxidases). Their presence in such abundance indicates the importance
of oxidative processes in the synthesis of lignin, suberin and other
cell-wall polymers. The high degree of apparent redundancy in the
genes for cell-wall metabolism might re¯ect differences in substrate
speci®city by some of the enzymes.
The high degree of apparent redundancy in the genes for cell wall
metabolism might re¯ect differences in substrate speci®city by some
of the enzymes. It is already known that cell types have different wall
compositions, which may require that the relevant enzymes be
subject to cell-type-speci®c transcriptional regulation. Of the 40 or
so cell types that plants make, almost all can be identi®ed by unique
features of their cell wall107. A large number of genes involved in wall
metabolism have yet to be de®ned. Although more than 60 genes for
glycosyltransferases can be found in the genome sequence, most of
these are probably involved in protein glycosylation or metabolite
catabolism and do not seem to be adequate to account for the
polysaccharide complexity of the wall. For instance, at least 21
enzymes are required just to produce the linkages of the pectic
polysaccharide RGII, and none of these enzymes has been identi®ed
at present. Thus, if these and related enzymes involved in the
synthesis of other cell-wall polymers are also represented by multiple genes, a substantial number of the genes of currently unknown
function may be involved in cell-wall metabolism.
Higher plants collectively synthesize more than 100,000 secondary metabolites. Because ¯owering plants are thought to have
similar numbers of genes, it is apparent that a great deal of
enzyme creation took place during the evolution of higher plants.
An important factor in the rapid evolution of metabolic complexity
is the large family of cytochrome P450s that are evident in
Arabidopsis (Supplementary Information Table 1). These enzymes
represent a superfamily of haem-containing proteins, most of which
catalyse NADPH- and O2-dependent hydroxylation reactions. Plant
P450s participate in myriad biochemical pathways including those
devoted to the synthesis of plant products, such as phenylpropanoids, alkaloids, terpenoids, lipids, cyanogenic glycosides and
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
glucosinolates, and plant growth regulators, such as gibberellins,
jasmonic acid and brassinosteroids. Whereas Arabidopsis has ,286
P450 genes, Drosophila has 94, C. elegans has 73 and yeast has only 3.
This low number in yeast indicates that there are few reactions of
basic metabolism that are catalysed by P450s. It seems likely that
many animal P450s are involved in detoxi®cation of compounds
from food plant sources. The role of endogenous enzymes is poorly
understood; only a few dozen P450 enzymes from plants have been
characterized to any extent. The discrepancy between the number of
known P450-catalysed reactions and the number of genes suggests
that Arabidopsis produces a relatively large number of metabolites
that have yet to be identi®ed.
In addition to the large number of cytochrome P450s, Arabidopsis
has many other genes that suggest the existence of pathways or
processes that are not currently known. For instance, the presence of
19 genes with similarity to anthranilate N-hydroxycinnamoyl/
benzoyl transferase is currently inexplicable. This enzyme is
involved in the synthesis of dianthramide phytoalexins in Caryophyllaceae and Gramineae. No phytoalexins of this class have been
described in Arabidopsis as yet. Similarly, the presence of 12 genes
with sequence similarity to the berberine bridge enzyme, ((S)reticuline:oxygen oxidoreductase (methylene-bridge-forming); EC
1.5.3.9), and 13 genes with similarity to tropinone reductase,
suggests that Arabidopsis may have the ability to produce alkaloids.
In other plants, the berberine bridge enzyme transforms reticuline
into scoulerine, a biosynthetic precursor to a multitude of speciesspeci®c protopine, protoberberine and benzophenanthridine alkaloids. The discovery of these and many other intriguing genes in
the Arabidopsis genome has created a wealth of new opportunities
to understand the metabolic and structural diversity of higher
plants.
Concluding remarks
The twentieth century began with the rediscovery of Mendel's rules
of inheritance in pea108, and it ends with the elucidation of the
complete genetic complement of a model plant, Arabidopsis. The
analysis of the completed sequence of a ¯owering plant reported
here provides insights into the genetic basis of the similarities and
differences of diverse multicellular organisms. It also creates the
potential for direct and ef®cient access to a much deeper understanding of plant development and environmental responses, and
permits the structure and dynamics of plant genomes to be assessed
and understood.
Arabidopsis, C. elegans and Drosophila have a similar range of
11,000±15,000 different types of proteins, suggesting this is the
minimal complexity required by extremely diverse multicellular
eukaryotes to execute development and respond to their environment. We account for the larger number of gene copies in
Arabidopsis compared with these other sequenced eukaryotes with
two possible explanations. First, independent ampli®cation of
individual genes has generated tandem and dispersed gene families
to a greater extent in Arabidopsis, and unequal crossing over may be
the predominant mechanism involved. Second, ancestral duplication of the entire genome and subsequent rearrangements have
resulted in segmental duplications. The pattern of these duplications suggests an ancient polyploidy event, and mutant analysis
indicates that at least some of the many duplicate genes are
functionally redundant. Their occurrence in a functionally diploid
genetic model came as a surprise, and is reminiscent of the situation
in maize, an ancient segmental allotetraploid. The remarkable
degree of genome plasticity revealed in the large-scale duplications
may be needed to provide new functions, as alternative promoters
and alternative splicing appear to be less widely used in plants than
they are in animals. Apart from duplicated segments, the overall
chromosome structure of Arabidopsis closely resembles that of
Drosophila; transposons and other repetitive sequences are concentrated in the heterochromatic regions surrounding the centromere,
© 2000 Macmillan Magazines Ltd
811
articles
whereas the euchromatic arms are largely devoid of repetitive
sequences. Conversely, most protein-coding genes reside in the
euchromatin, although a number of expressed genes have been
identi®ed in centromeric regions. Finally, Arabidopsis is the ®rst
methylated eukaryotic genome to be sequenced, and will be invaluable in the study of epigenetic inheritance and gene regulation.
Unlike most animals, plants generally do not move, they can
perpetuate inde®nitely, they reproduce through an extended haploid phase, and they synthesize all their metabolites. Our comparison of Arabidopsis, bacterial, fungal and animal genomes starts to
de®ne the genetic basis for these differences between plants and
other life forms. Basic intracellular processes, such as translation or
vesicle traf®cking, appear to be conserved across kingdoms, re¯ecting a common eukaryotic heritage. More elaborate intercellular
processes, including physiology and development, use different sets
of components. For example, membrane channels, transporters and
signalling components are very different in plants and animals, and
the large number of transcription factors unique to plants contrasts
with the conservation of many chromatin proteins across the three
eukaryotic kingdoms. Unexpected differences between seemingly
similar processes include the absence of intracellular regulators of
cell division (Cdc25) and apoptosis (Bcl-2). On the other hand,
DNA repair appears more highly conserved between plants and
mammals than within the animal kingdom, perhaps re¯ecting
common factors such as DNA methylation. Our analysis also
shows that many genes of the endosymbiotic ancestor of the plastid
have been transferred to the nucleus, and the products of this rich
prokaryotic heritage contribute to diverse functions such as photoautotrophic growth and signalling.
The sequence reported here changes the fundamental nature of
plant genetic analysis. Forward genetics is greatly simpli®ed as
mutations are more conveniently isolated molecularly, but at the
same time extensive gene duplications mean that functional redundancy must be taken into account. At a biochemical level, the
speci®city conferred by nucleotide sequence, and the completeness
of the survey allow complex mixtures of RNA and protein to be
resolved into their individual components using micro-arrays and
mass spectrometry. This speci®city can also be used in the parallel
analysis of genome-wide polymorphisms and quantitative traits in
natural populations109. Looking ahead, the challenge of determining
the function of the large set of predicted genes, many of which are
plant-speci®c, is now a clear priority, and multinational programs
have been initiated to accomplish this goal using site-selected
mutagenesis among the the necessary tools110. Finally, productive
paths of crop improvement, based on enhanced knowledge of
Arabidopsis gene function, will help meet the challenge of sustaining
our food supply in the coming years.
Note added in proof: at the time of publication 17 centromeric BACs
and 5 sequence gaps in chromosome arms are being sequenced. M
The three centres used similar annotation approaches involving in silico gene-®nding
methods, comparison to EST and protein databases, and manual reconciliation of that
data. Gene ®nding involved three steps: (1) analysis of BAC sequences using a computational gene ®nder; (2) alignment of the sequence to the protein and EST databases; (3)
assignment of functions to each of the genes. Genscan111, GeneMark.HMM112, Xgrail113
Gene®nder (P. Green, unpublished software) and GlimmerA114 were used to analyse BAC
sequences. All of these systems were specially trained for Arabidopsis genes. Splice sites
were predicted using NetGene2115, Splice Predictor116 and GeneSplicer (M. Pertea and
S. Salzberg, unpublished software). For the second step, BACs were aligned to ESTs and to
the Arabidopsis gene index117 using programs such as DDS/GAP2118 or BLASTN119.
Segmental duplications were analysed and displayed using a modi®ed version of
DIALIGN2 (ref. 120).
The C. elegans Sequencing Consortium. Sequence and analysis of the genome of C. elegans. Science
282, 2012±2018 (1998).
Adams, M. D. The genome sequence of Drosophila melanogaster. Science 287, 2185±2195 (2000).
Meinke, D. W., Cherry, J. M., Dean, C., Rounsley, S. D. & Koornneef, M. Arabidopsis thaliana: a
model plant for genome analysis. Science 282, 662±665 (1998).
812
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
38.
39.
40.
41.
42.
43.
Received 20 October; accepted 15 November 2000.
2.
3.
5.
37.
Methods
1.
4.
44.
45.
46.
Lin, X. et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402,
761±768 (1999).
Mayer, K. et al. Sequence and analysis of chromsome 4 of the plant Arabidopsis thaliana. Nature 402,
769±777 (1999).
Theologis, A. et al. Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana. Nature
408, 816±820 (2000).
Salanoubat, M. et al. Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana.
Nature 408, 820±822 (2000).
Tabata, S. et al. Sequence and analysis of chromosome 5 of the plant Arabidopsis thaliana. Nature
408, 820±822 (2000).
Choi, S. D., Creelman, R., Mullet, J. & Wing, R. A. Construction and characterisation of a bacterial
arti®cial chromosome library from Arabidopsis thaliana. Weeds World 2, 17±20 (1995).
Mozo, T., Fischer, S., Shizuya, H. & Altmann, T. Construction and characterization of the IGF
Arabidopsis BAC library. Mol. Gen. Genet. 258, 562±570 (1998).
Lui, Y. -G., Mitsukawa, N., Vazquez-Tello, A. & Whittier, R. F. Generation of a high-quality P1 library
of Arabidopsis suitable for chromosome walking. Plant J. 7, 351±358 (1995).
Lui, Y. -G. et al. Complementation of plant mutants with large genomic DNA fragments by a
transformation-competent arti®cial chromosome vector accelerates positional cloning. Proc. Natl
Acad. Sci. USA 96, 6535±6540 (1999).
Marra, M. et al. A map or sequence analysis of the Arabidopsis thaliana genome. Nature Genet. 22,
265±270 (1999).
Mozo, T. et al. A complete BAC-based physical map of the Arabidopsis thaliana genome. Nature
Genet. 22, 271±275 (1999).
Sato, S. et al. Structural analysis of Arabidopsis thaliana chromosome 5. I. Sequence features of the 1.
6 Mb regions covered by twenty physically assigned P1 clones. DNA Res. 4, 215±230 (1997).
Bent, E., Johnson, S. & Bancroft, I. BAC representation of two low-copy regions of the genome of
Arabidopsis thaliana. Plant J. 13, 849±855 (1998).
Copenhaver, G. P. et al. Genetic de®nition and sequence analysis of Arabidopsis centromeres. Science
286, 2468±2474 (1999).
Meyerowitz, E. M. & Somerville, C. R. Arabidopsis (Cold Spring Harbor Laboratory Press, Cold
Spring Harbor, New York, 1994)
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A program for improved detection of transfer RNA genes in
genomic sequence. Nucleic Acids Res. 25, 955±964 (1997).
Pavy, N. et al. Evaluation of gene prediction software using a genomic data set: application to
Arabidopsis thaliana sequences. BioInformatics 15, 887±900 (1999).
Mewes, H. W. et al. Overview of the yeast genome. Nature 387 (Suppl.) 7±65 (1997).
Frishman, D. et al. Functional and structural genomics using PEDANT. BioInformatics (in the press).
Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453±1462
(1997).
Kotani, H. & Tabata, S. Lessons from the sequencing of the genome of a unicellular cyanobacterium,
Synechocystis SP. PCC6803. Annu. Rev. Plant Physiol. Plant Mol. Biol. 49, 151±171 (1998).
Apweiler, R. et al. INTERPRO (http://www. ebi. ac. uk/interpro/). Collaborative Computer Project
11 Newsletter no. 10 (Cambridge, 2000).
Bent, A. F. et al. RPS2 of Arabidopsis thaliana a leucine-rich repeat class of plant disease resistance
genes. Science 265, 1856±1860 (1994).
Skowyra, D. et al. F box proteins are receptors that recruit phsphorylated substrates to the SCF
ubiquitin-ligase complex. Cell 91, 209±219 (1997).
Joazeiro, C. A. P. & Weissman, A. M. RING ®nger proteins: mediators of ubiquitin ligase activity. Cell
102, 549±552 (2000).
Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 2204±2215 (2000).
Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369±2376 (1999).
Blanc, G. et al. Extensive duplication and reshuf¯ing in the Arabidopsis genome. Plant Cell 12, 1093±
1102 (2000).
Wendel, J. F. Genome evolution in polyploids. Plant Mol. Biol. 42, 225±249 (2000).
Gaut, B. S. & Doebley, J. F. DNA sequence evidence for the segmental allotetraploid origin of maize.
Proc. Natl Acad. Sci. USA 94, 6809±6814 (1997).
Ku, H. -M., Vision, T., Liu, J. & Tanksley, S. D. Comparing sequenced segments of the tomato and
Arabidopsis genomes: Large-scale duplication followed by selective gene loss creates a network of
synteny. Proc. Natl Acad. Sci. USA 97, 9121±9126 (2000).
Noel, L. et al. Pronounced intraspeci®c haplotype divergence at the RPP5 complex disease resistance
locus of Arabidopsis. Plant Cell 11, 2099±2111 (1999).
Ellis, J., Dodds, P. & Pryor, T. Structure, function, and evolution of plant disease resistance genes.
Trends Plant Sci. 3, 278±284 (2000).
Tanksley, S. D. et al. High density molecular linkage maps of the tomato and potato genomes.
Genetics 132, 1141±1160 (1992).
Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Grasses, line up and form a circle. Curr. Biol. 5,
737±739 (1995).
Acarkan, A., Rossberg, M., Koch, M. & Schmidt, R. Comparative genome analysis reveals extensive
conservation of genome organisation for Arabidopsis thaliana and Capsella rubella. Plant J. 23, 55±
62 (2000).
Cavell, A., Lydiate, D., Parkin, I., Dean, C. & Trick, M. A 30 centimorgan segment of Arabidopsis
thaliana chromosome 4 has six collinear homologues within the Brassica napus genome. Genome 41,
62±69 (1998).
O'Neill, C. & Bancroft, I. Comparative physical mapping of segments of the genome of Brassica
oleracea var alboglabra that are homologous to sequenced regions of the chromosomes 4 and 5 of
Arabidopsis thaliana. Plant J. 23, 233±243 (2000).
Wolfe, K. H., Gouy, M., Yang, Y. -W., Sharp, P. M. & Li, W. -H. Date of the monocot-dicot divergence
estimated from the chloroplast DNA sequence data. Proc. Natl Acad. Sci. USA 86, 6201±6205 (1989).
van Dodeweerd, A. -M. et al. Identi®cation and analysis of homologous segments of the genomes of
rice and Arabidopsis thaliana. Genome 42, 887±892 (1999)
Mayer, K. Sequence level analysis of homologous segments of the genomes of rice and Arabidopsis
thaliana. Genome Res. (submitted).
Sato, S. Complete structure of the chloroplast genome of Arabidopsis thaliana. DNA Research 6,
283±290 (1999).
Unseld, M., Marienfeld, J., Brandt, P. & Brennicke, A. The mitochondrial genome in Arabidopsis
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
thaliana contains 57 genes in 366,924 nucleotides. Nature Genet. 15, 57±61 (1997).
47. Palmer, J. D. et al. Dynamic evolution of plant mitochondrial genomes: mobile genes and introns
and highly variable mutation rates. Proc. Natl Acad. Sci. USA 97, 6960±6966 (2000).
48. Le, Q. -H. et al. Transposon diversity in Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 97, 7376±
7381 (2000).
49. Yu, Z., Wright, S. & Bureau, T. Mutator-like elements (MULEs) in Arabidopsis thaliana: Structure,
diversity and evolution. Genetics (in the press).
50. Feschotte, C. & Mouches, C. Evidence that a family of miniature inverted-repeat transposable
elements (MITEs) from the Arabidopsis thaliana genome has arisen from a pogo-like DNA
transposon. Mol. Biol. Evol. 17, 730±737 (2000).
51. Martienssen, R. Transposons, DNA methylation and gene control. Trends Genet. 14, 263±264
(1998).
52. Singer, T., Yordan, C. & Martienssen, R. Robertson's Mutator transposons in Arabidopsis are
regulated by the chromatin-remodeling gene Decrease in DNA Methylation (DDM1). Genes Dev. (in
the press).
53. SanMiguel, P. et al. Nested retrotransposons in the intergenic regions of the maize genome. Science
274, 765±768 (1996).
54. Copenhaver, G. P. & Pikaard, C. S. Two-dimensional RFLP analyses reveal megabase-sized clusters of
rRNA gene variants in Arabidopsis thaliana, suggesting local spreading of variants as the mode for
gene homogenization during concerted evolution. Plant J. 9, 273±282 (1996).
55. Fransz, P. et al. Cytogenetics for the model system Arabidopsis thaliana. Plant J. 13, 867±876 (1998).
56. Richards, E. J. & Ausubel, F. M. Isolation of a higher eukarotic telomere from Arabidopsis thaliana.
Cell 53, 127±136 (1988).
57. Round, E. K., Flowers, S. K. & Richards, E. J. Arabidopsis thaliana centromere regions: genetic map
positions and repetitive DNA structure. Genome Res. 7, 1045±1053 (1997).
58. The CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium. The complete sequence of a
heterochromatic island from a higher eukaryote. Cell 100, 377±386 (2000).
59. Fransz, P. F. et al. Integrated cytogenetic map of chromosome arm 4S of A. thaliana: Structural
organization of heterochromatic knob and centromere region. Cell 100, 367±376 (2000).
60. Paulsen, I. T., Nguyen, L., Sliwinski, M. K., Rabus, R. & Saier, M. H. Jr Microbial genome analyses:
comparative transport capabilities in eighteen prokaryotes. J. Mol. Biol. 301, 75±101 (2000).
61. Paulsen, I. T., Sliwinski, M. K., Nelissen, B., Goffeau, A. & Saier, M. H. Jr Uni®ed inventory of
established and putative transporters encoded within the complete genome of Saccharomyces
cerevisiae. FEBS Lett. 430, 116±125 (1998).
62. Hirsch, R. E., Lewis, B. D, Spalding, E. P. & Sussman, M. R. A role for the AKT1 potassium channel in
plant nutrition. Science 280, 918±921 (1998).
63. Slayman, C. L. & Slayman, C. W. Depolarization of the plasma membrane of Neurospora during
active transport of glucose: evidence for a proton-dependent cotransport system. Proc. Natl Acad.
Sci. USA 71, 1035±1939 (1974).
64. Ryan, C. A. & Pearce, G. Systemin: a polypeptide signal for plant defensive genes. Annu. Rev. Cell.
Dev. Biol. 14, 1±17 (1998).
65. Eisen, J. A. & Hanawalt, P. C. A phylogenomic study of DNA repair genes, proteins, and processes.
Mutat. Res. 435, 171±213 (1999).
66. Selby, C. P. & Sancar, A. Structure and function of transcription-repair coupling factor. Structural
domains and binding properties. J. Biol. Chem. 270, 4882±4889 (1995).
67. Britt, A. B. Molecular genetics of DNA repair in higher plants. Trends Plant Sci. 4, 20±25 (1999).
68. Dangl, M. Response to Aravind, L. & Koonin, E. V. Second Family of Histone Deacetylases. Science
280, 1167 (1998).
69. Cao, X. et al. Conserved plant genes with similarity to mammalian de novo DNA methyltransferases.
Proc. Natl Acad. Sci. USA 97, 4979±4984 (2000).
70. Henikoff, S. & Comai, L. A DNA methyltransferase homologue with a chromodomain exists in
multiple polymorphic forms in Arabidopsis. Genetics 149, 307±318 (1998).
71. Hung, M. -S. et al. Drosophila proteins related to vertebrate DNA (5-cytosine) methyltransferases.
Proc Natl Acad. Sci. USA 96, 11940±11945 (1999).
72. Dalmay, T., Hamilton, A. J., Rudd, S., Angell, S. & Baulcombe, D. C. An RNA-dependent-RNA
polymerase in Arabidopsis is required for post transcriptional gene silencing mediated by a transgene
but not by a virusÐthe truth. Cell 101, 543±553 (2000).
73. Riechmann, J. L. & Ratcliffe, O. J. A genomic perspective on plant transcription factors. Curr. Opin.
Plant Biol. 3, 423±434 (2000).
74. Liljegren, S. J. et al. SHATTERPROOF MADS-box genes control seed dispersal in Arabidopsis.
Nature 404, 766±770 (2000).
75. Pelaz, S. et al. B and C ¯oral organ identity functions require SEPALLATA MADS-box genes. Nature
405, 200±203 (2000).
76. Canaday, J., Stoppin-Mellet, V., Mutterer, J., Lambert, A. M. & Schmit, A. C. Higher plant cells:
gamma-tubulin and microtubule nucleation in the absence of centrosomes. Microsc. Res. Technol.
49, 487±495 (2000).
77. Bassham, D. C. & Raikhel, N. V. Unique features of the plant vacuolar sorting machinery. Curr. Opin.
Cell Biol. 12, 491±495 (2000).
78. Zheng, Z. L. & Yang, Z. The Rrop GTPase switch turns on polar growth in pollen. Trends Plant Sci. 5,
298-303 (2000).
79. den Boer, B. G. & Murray, J. A. Triggering the cell cycle in plants. Trends Cell Biol. 10, 245±250
(2000).
80. Heese, M., Mayer, U. & Jurgens, G. Cytokinesis in ¯owering plants: cellular process and
developmental integration. Curr. Opin. Plant Biol. 1, 486±491 (1998).
81. Meyerowitz, E. M. Plants, animals, and the logic of development. Trends Genet. 15, M65±M68
(1999).
82. Wang, D. Y. C. et al. Divergence time estimates for the early history of animal phyla and the origin of
plants, animals and fungi. Proc. R. Soc. Lond. B Bio. 266, 63±171 (1999).
83. Torii, K. Receptor kinase activation and signal transduction in plants: an emerging picture. Curr.
Opin. Plant Biol. 3, 362±367 (2000).
84. Yeh, K. C. & Lagarias, J. C. Eukaryotic phytochromes: Light-regulated serine/threonine protein
kinases with histidine kinase ancestry. Proc. Natl Acad. Sci. USA 95, 13976±13981 (1998).
85. McCarty, D. R. & Chory, J. Conservation and innovation in plant signaling pathways. Cell 103, 201±
211 (2000).
86. Weiss, C. A., Garnaat, C., Mukai, K., Hu, Y. & Ma, H. Molecular cloning of cDNAs from maize and
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
Arabidopsis encoding a G protein beta subunit. Proc. Natl Acad. Sci. USA 91, 9554±9558 (1994).
87. Bowler, C. et al. Cyclic GMP and calcium mediate phytochrome phototransduction. Cell 77, 73±81
(1994).
88. Stepanova, A. & Ecker, J. R. Ethylene signaling: from mutants to molecules. Curr. Opin. Plant Biol. 3,
353±360 (2000).
89. Urao, T., Yamaguchi-Shinozaki, K. & Shinozaki, K. Two-component systems in plant signal
transduction. Trends Plant Sci. 5, 67±74 (2000).
90. Makino, S. et al. Genes encoding pseudo-response regulators: Insight into His-to-Asp phosphorelay
and circadian rhythm in Arabidopsis thaliana. Plant Cell Physiol. 41, 791±803 (2000).
91. D'Agostino, I. B. & Kieber, J. J. Phosphorelay signal transduction: the emerging family of plant
response regulators. Trends Biol. Sci. 24, 452±456 (1999).
92. Strayer, C. et al. Cloning of the Arabidopsis clock gene TOC1, an autoregulatory response regulator
homologue. Science 289, 768±771 (2000).
93. Stahl, E. A. & Bishop, J. G. Plant-Pathogen arms races at the molecular level. Curr. Opin. Plant Biol. 3,
299±304 (2000).
94. McDowell, J. M. & Dangl, J. L. Signal transduction in the plant innate immune response. Trends
Biochem. Sci. 25, 79±82 (2000).
95. Van der Biezen, E. A. & Jones, J. D. Plant disease-resistance proteins and the gene-for-gene concept.
Trends Biochem Sci. 23, 454±456 (1998).
96. Belvin, M. P. & Anderson, K. V. A conserved signaling pathway: the Drosophila toll-dorsal pathway.
Annu. Rev. Cell. Dev. Biol. 12, 393±416 (1996).
97. Uren, A. G. et al. Identi®cation of paracaspases and metacaspases: Two ancient families of caspaselike proteins, one of which plays a key role in MALT lymphoma. Mol. Cell 6, 961±967 (2000).
98. Fankhauser, C. & Chory, J. Light control of plant development. Annu. Rev. Cell. Dev. Biol. 13, 203±
229 (1997).
99. Briggs, W. R. & Huala, E. Blue-light photoreceptors in higher plants. Annu. Rev. Cell. Dev. Biol. 15,
33±62 (1999).
100. Christie, J. M., Salomon, M., Nozue, K., Wada, M. & Briggs, W. R. LOV (light, oxygen, or voltage)
domains of the blue-light photoreceptor phototropin (nph1): binding sites for the chromophore
¯avin mononucleotide. Proc. Natl Acad. Sci. USA 96, 8779±8783 (1999).
101. Golbeck, J. H. Structure and function of photosystem I. Annu. Rev. Plant Physiol. Plant Mol. Biol. 43,
293±324 (1992).
102. Maier, R. M., Neckermann, K., Igloi, G. L. & Kossel, H. Complete sequence of the maize chloroplast
genome: gene content, hotspots of divergence and ®ne tuning of genetic information by transcript
editing. J. Mol. Biol. 251, 614±28 (1995).
103. Buchanan, B. B., Gruissem, W. & Jones, R. L. in Biochemistry and Molecular Biology of Plants 1367
(Am. Soc. Plant Physiol., Rockville, Maryland, 2000).
104. Mekhedov, S., MartõÂnez de IlaÂrduya, O. & Ohlrogge, J. Toward a functional catalog of the plant
genome. A survey of genes for lipid biosynthesis. Plant Physiol. 122, 389±401 (2000).
105. Somerville, C. R., & Ogren, W. L. Photorespiration de®cient mutants of Arabidopsis thaliana lacking
mitochrondrial serine transhydroxymethylase activity. Plant Physiol. 67, 666±671 (1981).
106. Richmond, T., & Somerville, C. R. The cellulose synthase superfamily. Plant Physiol 124, 495±499
(1999).
107. Carpita, N. Vergara C: A recipe for cellulose. Science 279, 672±673 (1998).
108. De Vries, H. Sur la loi de disjonction des hybrides. C. R. Acad. Sci. Paris 130, 845±847 (1900).
109. Alonso-Blanco, C. & Koornneef, M. Naturally occurring variation in Arabidopsis: an underexploited
resource for plant genetics. Trends Plant Sci. 5, 1360±1385 (1999).
110. Chory, J. Functional genomics and the virtual plant. A blueprint for understanding how plants are
built and how to improve them. Plant Physiology 123, 423±425 (2000).
111. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol.
268, 78±94 (1997).
112. Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene ®nding. Nucleic Acids
Res. 26, 1107±1115 (1998).
113. Uberbacher, E. C. & Mural, R. J. Locating protein-coding regions in human DNA sequences by a
multiple sensor-neural network approach. Proc. Natl Acad. Sci. USA 88, 11261±11265 (1991).
114. Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J. & Tettelin, H. Interpolated Markov models
for eukaryotic gene ®nding. Genomics 59, 24±31 (1999).
115. Hebsgaard, S. M. et al. Splice site prediction in Arabidopsis thaliana DNA by combining local and
global sequence information. Nucleic Acids Res. 24, 3439±3452 (1996).
116. Brendel, V. & Kleffe, J. Prediction of locally optimal splice sites in plant pre-mRNA with applications
to gene identi®cation in Arabidopsis thaliana genomic DNA. Nucleic Acids Res. 26, 4748±4757
(1998).
117. Quackenbush, J., Liang, F., Holt, I., Pertea, G. & Upton, J. The TIGR gene indices: reconstruction and
representation of expressed gene sequences. Nucleic Acids Res. 28, 141±145 (2000).
118. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R. A tool for analyzing and annotating genomic
sequences. Genomics 46, 37±45 (1997).
119. Altschul, S. F. et al. Basic local alignment search tool. J. Mol. Biol. 215, 403±410 (1990).
120. Morgenstern, B. DIALIGN2: improvement of the segment-to-segment approach to multiple
sequence alignment. BioInformatics 15, 211±218 (1999).
121. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classi®cation of proteins
database for the investigation of sequences and structures. J. Mol. Biol. 247, 536±540 (1995).
122. Emanuelsson, O., Nielsen, H., Brunak, S. & von Heijne, G. Predicting subcellular localization of
proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300, 1005±1016 (2000).
Supplementary information is available on Nature's World-Wide Web site
(http://www.nature.com) or as paper copy from the London editorial of®ce of Nature.
Acknowledgements
This work was supported by the National Science Foundation (NSF) Cooperative
Agreements (funded by the NSF, the US Department of Agriculture (USDA) and the US
Department of Energy (DOE)), the Kazusa DNA Research Institute Foundation, and by
the European Commission. Additional support from the USDA, MinisteÁre de la
Recherche, GSF-Forschungszentrum f. Umwelt u. Gesundheit, BMBF (Bundesministerium f. Bildung, Forschung und Technologie), the BBSRC (Biotechnology and Biological
© 2000 Macmillan Magazines Ltd
813
articles
Sciences Research Council) and the Plant Research International, Wageningen, is also
gratefully acknowledged. The authors wish to thank E. Magnien, D. Nasser and J. D.
Watson for their continual support and encouragement.
Correspondence and requests for materials should be addressed to The Arabidopsis
Genome Initiative (e-mail: [email protected] or [email protected]).
The Arabidopsis Genome Initiative
Three groups contributed to the work reported here. The Genome Sequencing groups,
arranged here in order of sequence contribution, sequenced and annotated assigned
chromosomal regions. The Genome Analysis group carried out the analyses described.
The Contributing Authors interpreted the genome analyses, incorporating other data and
analyses, with respect to selected biological topics.
Genome Sequencing Groups
Samir Kaul, Hean L. Koo, Jennifer Jenkins, Michael Rizzo, Timothy Rooney, Luke J. Tallon, Tamara Feldblyum, William Nierman,
Maria-Ines Benito, Xiaoying Lin, Christopher D. Town, J. Craig Venter & Claire M. Fraser
The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA
Satoshi Tabata, Yasukazu Nakamura, Takakazu Kaneko, Shusei Sato, Erika Asamizu, Tomohiko Kato, Hirokazu Kotani &
Shigemi Sasamoto
Kazusa DNA Research Institute, 1532-3 Yana, Kisarazu, Chiba 292, Japan
Joseph R. Ecker1*², Athanasios Theologis2*, Nancy A. Federspiel3*², Curtis J. Palm3, Brian I. Osborne2, Paul Shinn1,
Aaron B. Conway3, Valentina S. Vysotskaia2, Ken Dewar1, Lane Conn3, Catherine A. Lenz2, Christopher J. Kim1, Nancy F. Hansen3,
Shirley X. Liu2, Eugen Buehler1, Hootan Alta®3, Hitomi Sakano2, Patrick Dunn1, Bao Lam3, Paul K. Pham2, Qimin Chao1, Michelle Nguyen3, Guixia
Yu2, Huaming Chen1, Audrey Southwick3, Jeong Mi Lee2, Molly Miranda3, Mitsue J. Toriumi2 & Ronald W. Davis3
1, Plant Science Institute, Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 USA; 2, Plant Gene Expression
Center/USDA-U.C.Berkeley, 800 Buchanan Street, Albany, California 94710, USA; 3, Stanford Genome Technology Center, 855 California Avenue, Palo Alto, California
94304, USA. * These authors contributed equally to this work. ² Present addresses: The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla,
California 92037, USA (J.R.E.); Exelixis, Inc., 170 Harborway, P.O. Box 511, South San Francisco, California 94083-0511, USA (N.A.F)
European Union Chromosome 4 and 5 Sequencing Consortium: R. Wambutt1, G. Murphy2, A. DuÈsterhoÈft3, W. Stiekema4, T. Pohl5,
K.-D. Entian6, N. Terryn7 & G. Volckaert8
1, AGOWA GmbH, Glienicker Weg 185, D-12489 Berlin, Germany; 2, John Innes Centre, Colney Lane, Norwich NR4 7UH, UK; 3, QIAGEN GmbH, Max-Volmer-Str. 4,
D-40724 Hilden, Germany; 4, Greenomics, Plant Research International, Droevendaalsesleeg 1, NL 6700, AA Wageningen, The Netherlands; 5, GATC GmbH, FritzArnold Strasse 23, D-78467 Konstanz, Germany; 6, SRD GmbH, Oberurseler Str. 43, Oberursel 61440, Germany; 7, Department for Plant Genetics, (VIB), University of
Gent, K.L. Ledeganckstraat 35, B-9000 Gent, Belgium; 8, Katholieke Universiteit Leuven, Laboratory of Gene Technology, Kardinaal Mercierlaan 92, B-3001 Leuven,
Belgium
European Union Chromosome 3 Sequencing Consortium: M. Salanoubat1, N. Choisne1, M. Rieger2, W. Ansorge3, M. Unseld4,
B. Fartmann5, G. Valle6, F. Artiguenave1, J. Weissenbach1 & F. Quetier1
1, Genoscope and CNRS FRE2231, 2 rue G. CreÂmieux, 91057 Evry Cedex, France; 2, Genotype GmbH Angelhofweg 39, D-69259 Wilhemlsfeld, Germany; 3, European
Molecular Biology Laboratory, Biochemical Instrumentation Program, Meyerhoftstr. 1, D-69117 Heidelberg, Germany; 4, LION Bioscience AG, Im Neuenheimer Feld
515-517, 69120 Heidelberg, Germany; 5, MWG-Biotech AG, Anzinger Strasse 7a, 85560 Ebersberg, Germany; 6, CRIBI, UniversitaÁ di Padova, via G. Colombo 3, Padova
35131, Italy
The Cold Spring Harbor and Washington University Genome Sequencing Center Consortium: Richard K. Wilson1, Melissa de la Bastide2,
M. Sekhon1, Emily Huang2, Lori Spiegel2, Lidia Gnoj2, K. Pepin1, J. Murray1, D. Johnson1, Kristina Habermann2, Neilay Dedhia2,
Larry Parnell2, Raymond Preston2, L. Hillier1, Ellson Chen3, M. Marra2, Robert Martienssen4 & W. Richard McCombie2
1, Washington University Genome Sequencing Center, Washington University in St Louis School of Medicine, 4444 Forest Park Blvd., St. Louis, Missouri 63108 USA;
2, Lita Annenberg Hazen Genome Center, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; 3, Celera Genomics, 850 Lincoln Center Drive,
Foster City, California 94494, USA; 4, Plant Biology Group, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
Genome Analysis Group
Klaus Mayer1*, Owen White2*, Michael Bevan3, Kai Lemcke1, Todd H. Creasy2, Cord Bielke2, Brian Haas1, Dirk Haase1, Rama Maiti2,
Stephen Rudd1, Jeremy Peterson2, Heiko Schoof1, Dimitrij Frishman1, Burkhard Morgenstern1, Paulo Zaccaria1, Maria Ermolaeva 2, Mihaela
Pertea2, John Quackenbush2, Natalia Volfovsky2, Dongying Wu2, Todd M. Lowe4, Steven L. Salzberg 2 & Hans-Werner Mewes1
1, GSF-Forschungszentrum f. Umwelt u. Gesundheit, Munich Information Center for Protein Sequences, am Max-Planck-Institut f. Biochemie, Am Klopferspitz 18a,
D-82152, Germany; 2, The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA; 3, Molecular Genetics Deartment, John Innes
Centre, Colney Lane, Norwich NR4 7UH, UK; 4, Dept Genetics, Stanford University Medical School, Stanford, California 94305-5120, USA. * These authors contributed
equally to this work
Contributing Authors
Comparative analysis of the genomes of A. thaliana accessions. S. Rounsley, D. Bush, S. Subramaniam, I. Levin & S. Norris
Cereon Genomics LLC, 45 Sidney St, Cambridge, Massachussetts 02139, USA
Comparative analysis of the genomes of A. thaliana and other genera. R. Schmidt1, A. Acarkan1 & I. Bancroft2
1, Max-DelbruÈck-Laboratorium in der Max-Planck-Gesellschaft, Carl-von-LinneÂ-Weg 10, 50829 Cologne, Germany; 2, Brassicas and Oilseeds Research Department,
John Innes Centre, Norwich NR4 7UJ, UK
814
© 2000 Macmillan Magazines Ltd
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
articles
Integration of the three genomes in the plant cell: the extent of protein and nucleic acid traf®c between nucleus, plastids and
mitochondria. F. Quetier1, A. Brennicke2 & J. A. Eisen3.
1, Genoscope, Centre Nationale de Sequencage, 2 rue Gaston Cremieux, CP 5706, 91057 Evry Cedex, France; 2, Molekulare Botanik, UniversitaÈt Ulm, 89069 Ulm,
Germany; 3, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA
Transposable elements. T. Bureau1, B.-A. Legault1, Q.-H. Le1, N. Agrawal1, Z. Yu1 & R. Martienssen2
1, McGill University, Dept of Biology, 1205 rue Dr Pen®eld, Montreal, Quebec, H3A 1B1, Canada; 2, Plant Biology Group, Cold Spring Harbor Laboratory, Cold Spring
Harbor, New York 11724, USA
rDNA, telomeres and centromeres. G. P. Copenhaver1, S. Luo1, C. S. Pikaard2 & D. Preuss1
1, Howard Hughes Medical Institute, The University of Chicago, 1103 East 57th Street, Chicago, Illiois, USA; 2, Biology Department, Washington University in St Louis,
St Louis, Missouri 63130, USA
Membrane transport. I. T. Paulsen1 & M. Sussman2
1, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA; 2, University of Wisconsin Biotechnology Center, 425 Henry Mall,
Madison, Wisconsin 53706, USA
DNA repair and recombination. A. B. Britt1 & J. A. Eisen2
1, Section of Plant Biology, University of California, Davis, California 95616, USA; 2, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville,
Maryland 20850, USA
Gene regulation. D. A. Selinger1, R. Pandey1, D. W. Mount2, V. L. Chandler1, R. A. Jorgensen1 & C. Pikaard3
1, Department of Plant Sciences, University of Arizona, 303 Forbes Hall; and 2, Department of Molecular and Cellular Biology, University of Arizona, Tucson, Arizona
85721, USA; 3, Biology Department, Washington University in St Louis, St Louis, Missouri 63130, USA
Cellular organization. G. Juergens
Entwicklungsgenetik, ZMBP-Centre for Plant Molecular Biology, auf der Morgenstelle 1, Tuebingen D-72076, Germany
Development. E. M. Meyerowitz.
Division of Biology, California Institute of Biology, Pasadena, California 91125, USA
Signal transduction. J. R. Ecker1 & A. Theologis2.
1, The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, California 92037, USA; 2, Plant Gene Expression Center/USDA-UC Berkeley, 800
Buchanan Street, Albany, California 94710, USA
Recognition of and response to pathogens. J. Dangl1 & J. D. G. Jones2
1, Biology Department, Coker Hall, University of North Carolina, Chapel Hill, North Carolina 27599, USA; 2, Sainsbury Laboratory, John Innes Centre, Colney Lane,
Norwich NR4 7UJ, UK
Photomorphogenesis and photosynthesis. M. Chen & J. Chory
Howard Hughes Medical Institute and Plant Biology Laboratory, The Salk Institute, 10010 North Torrey Pines Road, La Jolla, California 92037, USA
Metabolism. C. Somerville
Carnegie Institution, 260 Panama Street, Stanford, California 94305, USA
NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com
© 2000 Macmillan Magazines Ltd
815
Vol 436|11 August 2005|doi:10.1038/nature03895
ARTICLES
The map-based sequence of the rice
genome
International Rice Genome Sequencing Project*
Rice, one of the world’s most important food plants, has important syntenic relationships with the other cereal species
and is a model plant for the grasses. Here we present a map-based, finished quality sequence that covers 95% of the
389 Mb genome, including virtually all of the euchromatin and two complete centromeres. A total of 37,544 nontransposable-element-related protein-coding genes were identified, of which 71% had a putative homologue in
Arabidopsis. In a reciprocal analysis, 90% of the Arabidopsis proteins had a putative homologue in the predicted rice
proteome. Twenty-nine per cent of the 37,544 predicted genes appear in clustered gene families. The number and
classes of transposable elements found in the rice genome are consistent with the expansion of syntenic regions in the
maize and sorghum genomes. We find evidence for widespread and recurrent gene transfer from the organelles to the
nuclear chromosomes. The map-based sequence has proven useful for the identification of genes underlying agronomic
traits. The additional single-nucleotide polymorphisms and simple sequence repeats identified in our study should
accelerate improvements in rice production.
Rice (Oryza sativa L.) is the most important food crop in the world
and feeds over half of the global population. As the first step in a
systematic and complete functional characterization of the rice
genome, the International Rice Genome Sequencing Project
(IRGSP) has generated and analysed a highly accurate finished
sequence of the rice genome that is anchored to the genetic map.
Our analysis has revealed several salient features of the rice
genome:
. We provide evidence for a genome size of 389 Mb. This size
estimation is ,260 Mb larger than the fully sequenced dicot plant
model Arabidopsis thaliana. We generated 370 Mb of finished
sequence, representing 95% coverage of the genome and virtually
all of the euchromatic regions.
. A total of 37,544 non-transposable-element-related protein-coding sequences were detected, compared with ,28,000–29,000 in
Arabidopsis, with a lower gene density of one gene per 9.9 kb in
rice. A total of 2,859 genes seem to be unique to rice and the other
cereals, some of which might differentiate monocot and dicot
lineages.
. Gene knockouts are useful tools for determining gene function
and relating genes to phenotypes. We identified 11,487 Tos17 retrotransposon insertion sites, of which 3,243 are in genes.
. Between 0.38 and 0.43% of the nuclear genome contains organellar DNA fragments, representing repeated and ongoing transfer of
organellar DNA to the nuclear genome.
. The transposon content of rice is at least 35% and is populated by
representatives from all known transposon superfamilies.
. We have identified 80,127 polymorphic sites that distinguish
between two cultivated rice subspecies, japonica and indica,
resulting in a high-resolution genetic map for rice. Single-nucleotide polymorphism (SNP) frequency varies from 0.53 to 0.78%,
which is 20 times the frequency observed between the Columbia
and Landsberg erecta ecotypes of Arabidopsis.
. A comparison between the IRGSP genome sequence and the
6.3 £ indica and 6 £ japonica whole-genome shotgun sequence
assemblies revealed that the draft sequences provided coverage of
69% by indica and 78% by japonica relative to the map-based
sequence.
Rice has played a central role in human nutrition and culture for
the past 10,000 years. It has been estimated that world rice production must increase by 30% over the next 20 years to meet
projected demands from population increase and economic development1. Rice grown on the most productive irrigated land has
achieved nearly maximum production with current strains1 .
Environmental degradation, including pollution, increase in night
time temperature due to global warming2, reductions in suitable
arable land, water, labour and energy-dependent fertilizer provide
additional constraints. These factors make steps to maximize rice
productivity particularly important. Increasing yield potential and
yield stability will come from a combination of biotechnology and
improved conventional breeding. Both will be dependent on a highquality rice genome sequence.
Rice benefits from having the smallest genome of the major cereals,
dense genetic maps and relative ease of genetic transformation3. The
discovery of extensive genome colinearity among the Poaceae4 has
established rice as the model organism for the cereal grasses. These
properties, along with the finished sequence and other tools under
development, set the stage for a complete functional characterization
of the rice genome.
The International Rice Genome Sequencing Project
The IRGSP, formally established in 1998, pooled the resources of
sequencing groups in ten nations to obtain a complete finished
quality sequence of the rice genome (Oryza sativa L. ssp. japonica
cv. Nipponbare). Finished quality sequence is defined as containing
less than one error in 10,000 nucleotides, having resolved ambiguities, and having made all state-of-the-art attempts to close gaps.
The IRGSP released a high-quality map-based draft sequence in
*Lists of participants and affiliations appear at the end of the paper
© 2005 Nature Publishing Group
793
ARTICLES
NATURE|Vol 436|11 August 2005
December 2002. Three completely sequenced chromosomes have
been published5–7, as well as two completely sequenced centromeres8–10. As the IRGSP subscribed to an immediate-release policy,
high-quality map-based sequence has been public for some time.
This has permitted rice geneticists to identify several genes underlying traits, and revealed very large and previously unknown segmental duplications that comprise 60% of the genome11–13. The
public sequence has also revealed new details about the syntenic
relationships and gene mobility between rice, maize and sorghum13–15.
Physical maps, sequencing and coverage
The IRGSP sequenced the genome of a single inbred cultivar, Oryza
sativa ssp. japonica cv. Nipponbare, and adopted a hierarchical cloneby-clone method using bacterial and P1 artificial chromosome clones
(BACs and PACs, respectively). This strategy used a high-density
genetic map16, expressed-sequence tags (ESTs)17, yeast artificial
chromosome (YAC)- and BAC-based physical maps18–20, BAC-end
sequences21 and two draft sequences22,23. A total of 3,401 BAC/PAC
clones (Table 1) were sequenced to approximately tenfold sequence
coverage, assembled, ordered and finished to a sequence quality of
less than one error per 10,000 bases. A majority of physical gaps in
the BAC/PAC tiling path were bridged using a variety of substrates,
including PCR fragments, 10-kb plasmids and 40-kb fosmid
clones. A total of 62 unsequenced physical gaps, including nine
centromere and 17 telomere gaps, remain on the 12 chromosomes
(Table 2). Chromosome arm and telomere gaps were measured,
and the nine centromere gaps were estimated on the basis of
CentO satellite DNA content. The remaining gaps are estimated to
total 18.1 Mb.
Ninety-seven percent of the BAC/PACs and gap sequences (3,360)
have been submitted as finished quality in the PLN division of
GenBank/DDBJ/EMBL. These and the remaining draft-sequenced
clones were used to construct pseudomolecules representing the 12
chromosomes of rice (Fig. 1). The total nucleotide sequence of the 12
pseudomolecules is 370,733,456 bp, with an N-average continuous
sequence length of 6.9 Mb (see Table 1 for a definition of N-average
length). Sequence quality was assessed by comparing 1.2 Mb of
overlapping sequence produced by different laboratories. The overall
accuracy was calculated as 99.99% (Supplementary Table 2). The
statistics of sequenced PAC/BAC clones and pseudomolecules for
each chromosome are shown in Table 1.
The genome size of rice (O. sativa ssp. japonica cv. Nipponbare)
was reported to have a haploid nuclear DNA content of 394 Mb on
the basis of flow cytometry24, and 403 Mb on the basis of lengths of
anchored BAC contigs and estimates of gap sizes20. Table 2 shows the
calculated size for each chromosome and the estimated coverage.
Adding the estimated length of the gaps to the sum of the nonoverlapping sequence, the total length of the rice nuclear genome was
calculated to be 388.8 Mb. Therefore, the pseudomolecules are
expected to cover 95.3% of the entire genome and an estimated
98.9% of the euchromatin. An independent measure of genome
coverage represented by the pseudomolecules was obtained by
searching for unique EST markers19; of 8,440 ESTs, 8,391 (99.4%)
were identified in the pseudomolecules.
Centromere location
Typical eukaryotic centromeres contain repetitive sequences, including satellite DNA at the centre and retrotransposons and transposons
in the flanking regions. All rice centromeres contain the highly
repetitive 155–165 bp CentO satellite DNA, together with centromere-specific retrotransposons25,26. The CentO satellites are located
within the functional domain of the rice centromere10,26. Complete
sequencing of the centromeres of rice chromosomes 4 and 8 revealed
that they consist of 59 kb and 69 kb of clustered CentO repeats
(respectively)8–10, tandemly arrayed head-to-tail within the clusters.
Numerous retrotransposons, including the centromere-specific
794
RIRE7, are found between and around the CentO repeats. CentO
clusters show differences in length and orientation for the two
centromeres.
BLASTN analysis of the pseudomolecules indicated that about
0.9 Mb of CentO repeats (corresponding to more than 5,800 copies of
the satellite) were sequenced and found to be associated with
centromere-specific retroelements. Locations of all CentO sequences
correspond to genetically identified centromere regions (Supplementary Table 3). Our pseudomolecules cover the centromere regions on
chromosomes 4, 5 and 8, and portions of the centromeres on the
remaining chromosomes (Fig. 1).
Gene content, expression and distribution
We masked the pseudomolecules for repetitive sequences and used
the ab initio gene finder FGENESH to identify only non-transposable-element-related genes. A total of 37,544 non-transposableelement protein-coding sequences were predicted, resulting in a
density of one gene per 9.9 kb (Supplementary Tables 4 and 5). As
the ability to identify unannotated and transposable-element-related
genes improves, the true protein-coding gene number in rice will
doubtless be revised.
Full-length complementary DNA sequences are available for rice27,
and provide a powerful resource for improving gene model structure
derived from ab initio gene finders28. Of the 37,544 non-transposable-element-related FGENESH models, 17,016 could be supported
by a total of 25,636 full-length cDNAs (Supplementary Table 6).
A total of 22,840 (61%) genes had a high identity match with a rice
ESTor full-length cDNA. On average, about 10.7 EST sequences were
present for each expressed rice gene. A total of 2,927 genes aligned
well with ESTs from other cereal species, and 330 of these genes
matched only with a non-rice cereal EST (Supplementary Fig. 1).
Except for the short arms of chromosomes 4, 9 and 10, which are
known to be highly heterochromatic, the density of expressed genes
is greater on the distal portions of the chromosome arms
compared with the regions around the centromeres (Supplementary
Fig. 2).
A total of 19,675 proteins had matches with entries in the SwissProt database; of these, 4,500 had no expression support. Domain
searches revealed a minimum of one motif or domain present in 63%
of the predicted proteins, with a total of 3,328 different domains
present in the predicted rice proteome. The five most abundant
domains were associated with protein kinases (Supplementary
Table 7). Fifty-one per cent of the predicted proteins could be
associated with a biological process (Supplementary Fig. 3a), with
metabolism (29.1%) and cellular physiological processes (11.9%)
representing the two most abundant classes.
Approximately 71% (26,837) of the predicted rice proteins have a
homologue in the Arabidopsis proteome (Supplementary Fig. 4). In a
reciprocal search, 89.8% (26,004) of the proteins from the Arabidopsis genome have a homologue in the rice proteome. Of the 23,170
rice genes with rice EST, cereal EST, or full-length cDNA support,
20,311 (88%) have a homologue in Arabidopsis. Fewer putative
homologues were found in other model species: 38.1% in Drosophila,
40.8% in human, 36.5% in Caenorhabditis elegans, 30.2% in yeast,
17.6% in Synechocystis and 10.2% in Escherichia coli.
There are profound differences in plant architecture and biochemistry between monocotyledonous and dicotyledonous angiosperms.
Only 2,859 rice genes with evidence of transcription lack homologues
in the Arabidopsis genome. We investigated these to learn what
functions they encoded. The vast majority had no matches, or
most closely matched unknown or hypothetical proteins. The grasses
have a class of seed storage proteins called prolamins that is not found
in dicots. There are also families of hormone response proteins and
defence proteins, such as proteinase inhibitors, chitinases, pathogenesis-related proteins and seed allergens, many of which are
tandemly repeated (Supplementary Table 8). Nevertheless, with a
large number of proteins of unknown function, the most interesting
© 2005 Nature Publishing Group
ARTICLES
NATURE|Vol 436|11 August 2005
differences between the genome content of these two groups of
angiosperms remain to be discovered.
Tos17 is an endogenous copia-like retrotransposon in rice that is
inactive under normal growth conditions. In tissue culture, it
becomes activated, transposes and is stably inherited when the
plant is regenerated29. There are only two copies of Tos17 in the
rice cultivar Nipponbare. These features, together with its preferential insertion into gene-rich regions, make Tos17 uniquely suitable for
the functional analysis of rice genes by gene disruption. About 50,000
Tos17-insertion lines carrying 500,000 insertions have been produced30. A total of 11,487 target loci were mapped on the 12
pseudomolecules (Supplementary Fig. 5), with at least one insertion
detected in 3,243 genes. The density of Tos17 insertions is higher in
euchromatic regions of the genome30, in contrast to the distribution
of high-copy retrotransposons, which are more frequently found in
pericentromeric regions. A similar target site preference has been
reported for T-DNA insertions in Arabidopsis31.
Tandem gene families
One surprising outcome of the Arabidopsis genome analysis was the
large percentage (17%) of genes arranged in tandem repeats32. When
performing a similar analysis with rice, the percentage was comparable (14%). However, manual curation on rice chromosome 10
showed one gene family encoding a glycine-rich protein with 27
copies and one encoding a TRAF/BTB domain protein with 48
copies33. These tandemly repeated families are interrupted with
other genes and are not included in strictly defined tandem repeats.
We therefore screened for all tandemly arranged genes in 5-Mb
intervals. Using these criteria, 29% of the genes (10,837) are amplified at least once in tandem, and 153 rice gene arrays contained 10–
134 members (Supplementary Fig. 6). Sixty five per cent of the
tandem arrays with over 27 members, and 33% of all the arrays with
over 10 members, contain protein kinase domains (Supplementary
Table 9).
Non-coding RNA genes
The nucleolar organizer, consisting of 17S–5.8S–25S ribosomal DNA
coding units, is found at the telomeric end of the short arm of
chromosome 9 (ref. 34) in O. sativa ssp. japonica, and is estimated to
comprise 7 Mb (ref. 35). A second 17S–5.8S–25S rDNA locus is
found at the end of the short arm of chromosome 10 in O. sativa ssp.
indica34. A single 5S cluster is present on the short arm of chromosome 11 in the vicinity of the centromere36, and encompasses
0.25 Mb.
A total of 763 transfer RNA genes, including 14 tRNA pseudogenes
were detected in the 12 pseudomolecules. In comparison, a total of
611 tRNA genes were detected in Arabidopsis32. Supplementary Fig. 7
shows the distribution of these tRNA genes in each chromosome.
Chromosome 4 has a single tRNA cluster6, and chromosome 10 has
two large clusters derived from inserted chloroplast DNA7. Except for
regions of intermediate density on chromosomes 1, 2, 8 and 12, there
seem to be no other large clusters.
MicroRNAs (miRNAs), a class of eukaryotic non-coding RNAs,
are believed to regulate gene expression by interacting with the target
messenger RNA37. miRNAs have been predicted from Arabidopsis38
and rice39, and we mapped 158 miRNAs onto the rice pseudomolecules (Supplementary Table 10). Among other non-coding RNAs, we
identified 215 small nucleolar RNA (snoRNA) and 93 spliceosomal
RNA genes, both showing biased chromosomal distributions, in the
rice genome (Supplementary Table 11).
Organellar insertions in the nuclear genome
Mitochondria and chloroplasts originated from alpha-proteobacteria and cyanobacteria endosymbionts. A continuous transfer of
organellar DNA to the nucleus has resulted in the presence of
chloroplast and mitochondrial DNA inserted in the nuclear chromosomes. Although the endosymbionts probably contained genomes of
several Mb at the time they were internalized, the organellar genomes
diminished so that the present size of the mitochondrial genome is
less than 600 kb, and that of the chloroplast is only 150 kb. Homology
searches detected 421–453 chloroplast insertions and 909–1,191
mitochondrial insertions, depending upon the stringency adopted
(Supplementary Fig. 8 and Supplementary Table 12). Thus, chloroplast and mitochondrial insertions contribute 0.20–0.24% and
0.18–0.19% of the nuclear genome of rice, respectively, and correspond to 5.3 chloroplast and 1.3 mitochondrial genome equivalents.
The distribution of chloroplast and mitochondrial insertions over
the 12 chromosomes indicates that mitochondrial and chloroplast
transfers occurred independently. Two chromosomes harbour more
insertions than the others (Supplementary Fig. 8 and Supplementary
Table 12), with chromosome 12 containing nearly 1% mitochondrial
DNA and chromosome 10 containing approximately 0.8% chlor-
Table 1 | Classification and distribution of sequenced PAC and BAC clones* on the 12 rice chromosomes
Chr
Sequencing laboratory†
PAC
BAC
OSJNBa/b
OJ
OSJNO
Others‡
Total§
Pseudomolecule (bp)
N-average lengthk (bp)
Accession no.
1
2
3
4
5
6
7
8
9
10
11
12
RGP, KRGRP
RGP, JIC
ACWW, TIGR
NCGR
ASPGC
RGP
RGP
RGP
RGP, KRGRP, BIOTEC, BRIGI
ACWW, TIGR, PGIR
ACWW, TIGR, IIRGS, PGIR, Genoscope
Genoscope
Total
251
117
1
2
67
169
102
113
72
1
10
2
907
77
16
8
7
11
20
19
23
24
5
6
6
222
42
80
263
275
113
78
68
56
72
172
236
179
1634
23
142
47
7
87
14
97
83
50
6
3
79
638
4
4
1
0
0
0
0
2
5
0
2
0
18
0
0
10
0
0
0
0
0
0
21
1
2
34
397
359
330
291
278
281
286
277
223
205
258
268
3453
43,260,640
35,954,074
36,189,985
35,489,479
29,733,216
30,731,386
29,643,843
28,434,680
22,692,709
22,683,701
28,357,783
27,561,960
370,733,456
9,688,259
7,793,366
5,196,992
1,427,419
3,086,418
8,669,608
14,923,781
14,872,702
5,219,517
2,124,647
1,087,274
7,600,514
6,928,182
AP008207
AP008208
AP008209
AP008210
AP008211
AP008212
AP008213
AP008214
AP008215
AP008216
AP008217
AP008218
Chr, chromosome.
* PAC, Rice Genome Research Program PAC; BAC, Rice Genome Research Program BAC; OSJNBa/b, Clemson University Genomics Institute BAC; OJ, Monsanto BAC; OSJNO, Arizona
Genomics Institute fosmid (http://www.genome.arizona.edu/orders/direct.html?library ¼ OSJNOa); Others, artificial gap-filling clones designated as OSJNA and OJA.
†ACWW (Arizona Genomics Institute, Cold Spring Harbor Laboratory, Washington University Genome Sequencing Center, University of Wisconcin) Rice Genome Sequencing Consortium;
ASPGC, Academia Sinica Plant Genome Center; BIOTEC, National Center for Genetic Engineering and Biotechnology; BRIGI, Brazilian Rice Genome Initiative; IIRGS, Indian Initiative for Rice
Genome Sequencing; JIC, John Innes Centre; KRGRP, Korea Rice Genome Research Program; NCGR, National Center for Gene Research; PGIR, Plant Genome Initiative at Rutgers; RGP, Rice
Genome Research Program; TIGR, The Institute for Genomic Research.
‡ Constructs derived by joining (mostly from the clone gap regions) sequence from PCR fragments, Monsanto or Syngenta sequences and the neighbouring clone sequences.
§A total of 2,494 BAC and 907 PAC clones were used for draft and finished sequencing. Monsanto draft-sequenced BACs underlie 638 finished clones. The Syngenta draft sequence
contributed to the assemblies of 140 IRGSP clone sequences. Thirty-four sequence submissions are artificial constructs derived by joining a regional sequence (mostly from the clone gap
regions) from PCR fragments, Monsanto or Syngenta sequences with the neighbouring clone sequences. This also includes 93 clones submitted as phase 1 or phase 2 to the HTG section of
GenBank.
kN-average length: the average length of a contiguous segment (without sequence or physical gaps) containing a randomly chosen nucleotide.
© 2005 Nature Publishing Group
795
ARTICLES
NATURE|Vol 436|11 August 2005
oplast DNA. It is clear that several successive transfer events have
occurred, as insertions of less than 10 kb have heterogeneous identities. The longest insertions, however, systematically show .98.5%
identity to organellar DNA (Supplementary Table 13), indicating
recent insertions for both chloroplast and mitochondrial genomes.
Transposable elements
The rice genome is populated by representatives from all known
transposon superfamilies, including elements that cannot be easily
classified into either class I or II (ref. 40). Previous estimates of the
transposon content in the rice genome range from 10 to 25% (refs 21,
40). However, the increased availability of transposon query
sequences and the use of profile hidden Markov models allow the
identification of more divergent elements41 and indicate that the
transposon content of the O. sativa ssp. japonica genome is at least
35% (Table 3). Chromosomes 8 and 12 have the highest transposon
content (38.0% and 38.3%, respectively), and chromosomes 1
(31.0%), 2 (29.8%) and 3 (29.0%) have the lowest proportion of
transposons. Conversely, elements belonging to the IS5/Tourist and
IS630/Tc1/mariner superfamilies, which are generally correlated with
gene density, are prevalent on the first three chromosomes and least
frequent on chromosomes 4 and 12.
Class II elements, characterized by terminal inverted-repeats and
including the hAT, CACTA, IS256/Mutator, IS5/Tourist, and IS630/
Tc1/mariner superfamilies, outnumber class I elements, which
include long terminal-repeat (LTR) retrotransposons (Ty1/copia,
Ty3/gypsy and TRIM) and non-LTR retrotransposons (LINEs and
SINEs, or long- and short-interspersed nucleotide elements, respectively), by more than twofold (Table 3). However, the nucleotide
contribution of class I is greater than that of class II, due mostly to the
large size of LTR retrotransposons and the small size of IS5/Tourist
and IS630/Tc1/mariner elements. The inverse is the case for maize,
for which class I elements outnumber class II elements42. Given their
larger sizes, differential amplification of LTR elements in maize
compared with rice is consistent with the genomic expansion
found between orthologous regions of rice and maize15,33.
Most class I elements are concentrated in gene-poor, heterochromatic regions such as the centromeric and pericentromeric regions
(Supplementary Table 14). In contrast, members of some transposon
superfamilies, including IS5/Tourist, IS630/Tc1/mariner and LINEs,
have a significant positive correlation with both recombination rate
and gene density. There is an effect of average element length
associated with these patterns: short elements generally show a
positive correlation with recombination rate and gene density, and
are under-represented in the centromere regions, whereas larger
elements have higher centromeric and pericentromeric abundance.
Intraspecific sequence polymorphism
Map-based cloning to identify genes that are associated with agronomic traits is dependent on having a high frequency of polymorphic
markers to order recombination events. In rice, most of the segregating populations are generated from crosses between the two major
subspecies of cultivated rice, Oryza sativa ssp. japonica and O. sativa
ssp. indica. Although several studies on the polymorphisms detected
between japonica and indica subspecies have been reported6,43,44, the
analysis reported here uses an approach that ensures comparison of
orthologous sequences. O. sativa ssp. indica cv. Kasalath and O. sativa
ssp. japonica cv. Nipponbare are the parents of the most densely
mapped rice population16. BAC-end sequences were obtained from a
Kasalath BAC library of 47,194 clones. Only high quality, single-copy
sequences were mapped to the Nipponbare pseudomolecules, and
only paired inverted sequences that mapped within 200 kb were
considered. A total of 26,632 paired Kasalath BAC-end sequences
were mapped to the 12 rice pseudomolecules (Supplementary
Table 15). Kasalath BAC clones spanned 308 Mb or 79% of the
Nipponbare genome. Sequence alignments with a PHRED quality
value of 30 covered 12,319,100 bp (3%) of the total rice genome. A
total of 80,127 sites differed in the corresponding regions in Nipponbare and Kasalath. The frequency of SNPs varied between
chromosomes (0.53–0.78%). Insertions and deletions were also
detected. The ratio of small insertion/deletion site nucleotides (1–
14 bases) against the alignment length (0.20–0.27%) was similar
among the different chromosomes, and there was no preference for
the direction of insertions or deletions. The main patterns of base
substitutions observed between Nipponbare and Kasalath are shown
in Supplementary Table 16. Transitions (70%) were the most
prominent substitutions; this is a substantially higher fraction than
found between Arabidopsis ecotypes Columbia and Landsberg erecta32.
Class 1 simple sequence repeats in the rice genome
Class 1 simple sequence repeats (SSRs) are perfect repeats .20
nucleotides in length45 that behave as hypervariable loci, providing
a rich source of markers for use in genetics and breeding. A total of
18,828 Class 1 di, tri and tetra-nucleotide SSRs, representing 47
distinctive motif families, were identified and annotated on the rice
genome (Supplementary Fig. 9). Supplementary Table 17 provides
information about the physical positions of all Class 1 SSRs in
relation to widely used restriction-fragment length polymorphisms
(RFLPs)16,46 and previously published SSRs45. There was an average of
51 hypervariable SSRs per Mb, with the highest density of markers
occurring on chromosome 3 (55.8 SSR Mb21) and the lowest occurring on chromosome 4 (41.0 SSR Mb21). A summary of information
about the Class 1 SSRs identified in the rice pseudomolecules appears
Table 2 | Size of each chromosome based on sequence data and estimated gaps
Chr
Sequenced bases (bp)
1
2
3
4
5
6
7
8
9
10
11
12
All
43,260,640
35,954,074
36,189,985
35,489,479
29,733,216
30,731,386
29,643,843
28,434,680
22,692,709
22,683,701
28,357,783
27,561,960
370,733,456
Gaps on arm regions
No.
Length (Mb)
5
3
4
3
6
1
1
1
4
4
4
0
36
0.33
0.10
0.96
0.46
0.22
0.02
0.31
0.09
0.13
0.68
0.21
0.00
3.51
Telomeric gaps* (Mb)
0.06
0.01
0.04
0.20
0.05
0.03
0.01
0.05
0.14
0.13
0.04
0.05
0.81
Centromeric gap† (Mb)
rDNA‡ (Mb)
1.40
0.72
0.18
0.82
0.32
0.62
0.47
1.90
0.16
6.59
* Estimated length including the telomeres, calculated with the average value of 3.2 kb for each chromosome24.
†Estimated length of centromere-specific CentO repeats on each chromosome26.
‡ Represents the estimated length of the17S–5.8S–25S rDNA cluster on Chr 9 (ref. 35) and the 5S cluster on Chr 11 (ref. 24).
§Coverage of the pseudomolecules for the euchromatic regions in each chromosome.
kCoverage of the pseudomolecules over the full length of each chromosome.
796
© 2005 Nature Publishing Group
6.95
0.25
7.20
Total (Mb)
45.05
36.78
37.37
36.15
30.00
31.60
30.28
28.57
30.53
23.96
30.76
27.77
388.82
Coverage§ (%)
Coveragek (%)
99.1
99.7
97.3
98.7
99.3
99.8
98.9
99.7
98.8
96.6
99.1
99.8
98.9
96.0
97.7
96.8
98.2
99.1
97.2
97.9
99.5
74.3
94.7
92.2
99.2
95.3
ARTICLES
NATURE|Vol 436|11 August 2005
Figure 1 | Maps of the twelve rice chromosomes. For each chromosome
(Chr 1–12), the genetic map is shown on the left and the PAC/BAC contigs
on the right. The position of markers flanking the PAC/BAC contigs (green)
is indicated on the genetic map. Physical gaps are shown in white and the
nucleolar organizer on chromosome 9 is represented with a dotted green
line. Constrictions in the genetic maps and arrowheads to the right of
physical maps represent the chromosomal positions of centromeres for
which rice CentO satellites are sequenced. The maps are scaled to genetic
distances in centimorgans (cM) and the physical maps are depicted in
relative physical lengths. Please refer to Table 2 for estimated lengths of the
chromosomes.
in Supplementary Table 18. Several thousand of these SSRs have
already been shown to amplify well and be polymorphic in a panel of
diverse cultivars45, and thus are of immediate use for genetic analysis.
that a substantial portion of the contigs from each assembly were
non-homologous, misaligned or provided duplicate coverage.
Indeed, the whole-genome shotgun assembly differed by 0.05%
base-pair mismatches for the two aligned regions from the same
Nipponbare cultivar. The two assemblies were further examined for
the presence of the CentO sequence (Supplementary Table 21). Sixtyeight per cent of the copies observed in the 93-11 assembly and 32%
of the CentO-containing contigs in the whole-genome shotgun
Nipponbare assembly were found outside the centromeric regions.
In contrast, the CentO repeats were restricted to the centromeric
regions in the IRGSP pseudomolecules. It is unlikely that there are
dispersed centromeres in indica rice; misassembly of the wholegenome shotgun sequences is a more likely explanation for dispersed
CentO repeats. These observations indicate that the draft sequences,
although providing a useful preliminary survey of the genome, might
not be adequate for gene annotation, functional genomics or the
identification of genes underlying agronomic traits.
Genome-wide comparison of draft versus finished sequences
Two whole-genome shotgun assemblies of draft-quality rice
sequence have been published23,47, and reassemblies of both have
just appeared48. One of these is an assembly of 6.28 £ coverage of O.
sativa ssp. indica cv. 93-11. The second sequence is a ,6 £ coverage
of O. sativa ssp. japonica cv. Nipponbare23,48. These assemblies
predict genome sizes of 433 Mb for japonica and 466 Mb for. indica,
which differ from our estimation of a 389 Mb japonica genome.
Contigs from the whole-genome shotgun assembly of 93-11 and
Nipponbare48 were aligned with the IRGSP pseudomolecules. Nonredundant coverage of the pseudomolecules by the indica assembly
varied from 78% for chromosome 3 to 59% for chromosome 12, with
an overall coverage of 69% (Supplementary Table 19). When genes
supported by full-length cDNA coverage were aligned to the covered
regions, we found that 68.3% were completely covered by the indica
sequences. The average size of the indica contigs is 8.2 kb, so it is not
surprising that many did not completely cover the gene models
defined here. The coverage of the Nipponbare whole-genome shotgun assembly varied from 68–82%, with an overall coverage of 78%
of the genome, and 75.3% of the full-length cDNAs supported gene
models.
We undertook a detailed comparison of the first Mb of these
assemblies on 1S (the short arm of chromosome 1) with the IRGSP
chromosome 1 (Supplementary Fig. 10 and Supplementary Table
20). The numbers from this comparison agree with the wholegenome comparison described above. In addition, we observed
Concluding remarks
The attainment of a complete and accurate map-based sequence for
rice is compelling. We now have a blueprint for all of the rice
chromosomes. We know, with a high level of confidence, the
distribution and location of all the main components—the genes,
repetitive sequences and centromeres. Substantial portions of the
map-based sequence have been in public databases for some time,
and the availability of provisional rice pseudomolecules based on this
sequence has provided the scientific community with numerous
opportunities to evaluate the genome, as indicated by the number of
publications in rice biology and genetics over the past few years.
Furthermore, the wealth of SNP and SSR information provided here
© 2005 Nature Publishing Group
797
ARTICLES
NATURE|Vol 436|11 August 2005
and elsewhere will accelerate marker-assisted breeding and positional
cloning, facilitating advances in rice improvement.
The syntenic relationships between rice and the cereal grasses have
long been recognized4. Comparing genome organization, genes and
intergenic regions between cereal species will permit identification of
regions that are highly conserved or rapidly evolving. Such regions
are expected to yield crucial insights into genome evolution, speciation and domestication.
METHODS
Physical map and sequencing. Nine genomic libraries from Oryza sativa ssp.
japonica cultivar Nipponbare were used to establish the physical map of rice
chromosomes by polymerase chain reaction (PCR) screening19, fingerprinting20
and end-sequencing21. The PAC, BAC and fosmid clones on the physical map
were subjected to random shearing and shotgun sequencing to tenfold redundancy, using both universal primers and the dye-terminator or dye-primer
methods. The sequences were assembled using PHRED (http://www.genome.washington.edu/UWGC/analysistools/Phred.cfm) and PHRAP (http://www.genome.washington.edu/UWGC/analysistools/Phrap.cfm) software packages or
using the TIGR Assembler (http://www.tigr.org/software/assembler/).
Sequence gaps were resolved by full sequencing of gap-bridge clones, PCR
fragments or direct sequencing of BACs. Sequence ambiguities (indicated by
PHRAP scores less than 30) were resolved by confirming the sequence data using
alternative chemistries or different polymerases. We empirically determined that
a PHRAP score of 30 or above exceeds the standard of less than one error in
10,000 bp. BAC and PAC assemblies were tested for accuracy by comparing
computationally derived fingerprint patterns with experimentally determined
patterns of restriction enzyme digests. Sequence quality was also evaluated by
comparing independently obtained overlapping sequences.
Small physical gaps were filled by long-range PCR. Remaining physical gaps
were measured using fluorescence in situ hybridization analysis. We used the
length of CentO arrays26 to estimate the size of each of the remaining centromere
gaps.
Annotation and bioinformatics. Gene models were predicted using FGENESH
(http://www.softberry.com/berry.phtml?topic ¼ fgenesh) using the monocot
trained matrix on the native and repeat-masked pseudomolecules. Gene models
with incomplete open reading frames, those encoding proteins of less than 50
amino acids, or those corresponding to organellar DNA were omitted from the
final set. The coordinates of transposable elements, excluding MITEs (miniature
inverted-repeat transposable elements), were used to mask the pseudomolecules.
Conserved domain/motif searches and association with gene ontologies were
performed using InterproScan (http://www.ebi.ac.uk/InterProScan/) in combination with the Interpro2Go program. For biological processes, the number of
detected domains was re-calculated as number of non-redundant proteins.
The predicted rice proteome was searched using BLASTP against the
proteomes of several model species for which a complete genome sequence
and deduced protein set was available. Each rice chromosome was searched
against the TIGR rice gene index (http://www.tigr.org/tdb/tgi/ogi/) and against
gene index entries that aligned to gene models corresponding to expressed genes.
In addition, five cereal gene indices (http://www.tigr.org/tdb/tgi/) were searched
Table 3 | Transposons in the rice genome
Copy no. ( £ 103)
Class I
LINEs
SINEs
Ty1/copia
Ty3/gypsy
Other class I
Total class I
Class II
hAT
CACTA
IS630/Tc1/mariner
IS256/Mutator
IS5/Tourist
Other class II
Total class II
Other TEs
Total TEs
9.6
1.8
11.6
23.5
15.4
61.9
1.1
10.8
67.0
8.8
57.9
18.2
163.8
23.6
249.3
Coverage (kb)
Fraction of genome (%)
4161.3
209.9
14266.7
40363.3
12733.3
71734.4
1.12
0.06
3.85
10.90
3.43
19.35
1405.9
9987.3
8388.3
13485.7
12095.8
2703.6
48066.6
6797.7
129019.3*
0.38
2.69
2.26
3.64
3.26
0.73
12.96
1.80
34.79
TE, transposable element.
* Total length; corrected for 2420.7 kb in overlaps of multiple, non-nested elements.
798
against the rice chromosomes, and gene index matches were recorded. We
searched the Oryza sativa ssp. japonica cv. Nipponbare collection of full-length
cDNAs (ftp://cdna01.dna.affrc.go.jp/pub/data/), after first removing the transposable-element-related sequences, against the FGENESH models.
Gene models with rice full-length cDNA, EST or cereal EST matches but
without identifiable homologues in the Arabidopsis genome were searched for
conserved domains/motifs using InterproScan, and for homologues in the
Swiss-Prot database (http://us.expasy.org/sprot/) using BLASTP. All proteins
with positive blast matches were further compared with the nr database (http://
www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html#protein_databases), using
BLASTP to eliminate truncated proteins and those with matches to other dicots.
Tandem gene families. The rice genome was subjected to a BLASTP search as
previously described32. The search was also performed by permitting more than
one unrelated gene within the arrays, and the limit of the search was set to 5-Mb
intervals to exclude large chromosomal duplications.
Non-coding RNAs. Transfer-RNA genes were detected by the program tRNAscan SE (http://www.genetics.wustl.edu/eddy/tRNAscan-SE/). The miRNA registry in the Rfam database (http://www.sanger.ac.uk/Software/Rfam/) was used
as a reference database for miRNAs. In addition, experimentally validated
miRNAs of other species, excluding Arabidopsis miRNAs, were used for BLASTN
queries against the pseudomolecules. Spliceosomal and snoRNAs were retrieved
from the Rfam database and used for queries. BLASTN was used to find the
location of snoRNAs and spliceosomal RNAs in the pseudomolecules.
Organellar insertions. Oryza sativa ssp. japonica Nipponbare chloroplast
(GenBank NC_001320) and mitochondrial (GenBank BA000029) sequences
were aligned with the pseudomolecules using BLASTN and MUMmer49.
Transposable elements. The TIGR Oryza Repeat Database, together with other
published and unpublished rice transposable element sequences, was used to
create RTEdb (a rice transposable element database)50 and determine transposable element coordinates on the rice pseudomolecules. In the case of hAT, IS256/
Mutator, IS5/Tourist and IS630/Tc1/mariner elements, family-specific profile
hidden Markov models were applied using HMMER41 (http://hmmer.wustl.edu/).
The remaining superfamilies were annotated using RepeatMasker (http://
www.repeatmasker.org/).
Tos17 insertions. Flanking sequences of transposed copies of 6,278 Tos17
insertion lines were isolated by modified thermal asymmetric interlaced
(TAIL)-PCR and suppression PCR, and screened against the pseudomolecule
sequences.
SNP discovery. BAC clones from an O. sativa ssp. indica var. Kasalath BAC
library were end-sequenced. Sequence reads were omitted if they contained more
than 50% nucleotides of low quality or high similarity to known repeats. The
remaining sequences were subjected to BLASTN analysis against the pseudomolecules. Gaps within the alignments were classified as small insertions/
deletions.
SSR loci. The Simple Sequence Repeat Identification Tool (http://www.gramene.
org/) was used to identify simple sequence repeat motifs, and the physical
position of all Class 1 SSRs was recorded. The copy number of SSR markers was
estimated using electronic (e)-PCR to determine the number of independent hits
of primer pairs on the pseudomolecules.
Whole-genome shotgun assembly analysis. Contigs from the BGI 6.28 £
whole genome assembly of O. sativa ssp. indica 93-11 (GenBank/DDBJ/EMBL
accession number AAAA02000001–AAAA02050231) and the Syngenta 6 £
whole genome assembly of O. sativa ssp. japonica cv. Nipponbare
(AACV01000001–AACV01035047; ref. 48) were aligned with the pseudomolecules using MUMmer49. The number of IRGSP Nipponbare full-length cDNAsupported gene models completely covered by the aligned contigs was tabulated.
The 155-bp CentO consensus sequence was used for BLAST analysis against the
93-11 and Nipponbare whole-genome shotgun contigs, and the coordinates of
the positive hits recorded. Locations of centromeres for each indica chromosome
were obtained with the CentO sequence positions on the IRGSP pseudomolecule
of the corresponding chromosome. A detailed comparison of the BGI-assembled
and -mapped Syngenta contigs (AACV01000001–AACV01000070) and the 9311 contigs (AAAA02000001–AAAA02000093) was obtained by BLAST analysis
against the IRGSP chromosome 1 pseudomolecule.
Detailed procedures for the analyses described above can be found in the
Supplementary Information.
Received 29 December 2004; accepted 25 May 2005.
1.
2.
Peng, S., Cassman, K. G., Virmani, S. S., Sheehy, J. & Khush, G. S. Yield
potential trends of tropical rice since the release of IR8 and the challenge of
increasing rice yield potential. Crop Sci. 39, 1552–-1559 (1999).
Peng, S. et al. Rice yields decline with higher night temperature from global
warming. Proc. Natl Acad. Sci. USA 101, 9971–-9975 (2004).
© 2005 Nature Publishing Group
ARTICLES
NATURE|Vol 436|11 August 2005
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
Sasaki, T. & Burr, B. International Rice Genome Sequencing Project: the effort to
completely sequence the rice genome. Curr. Opin. Plant Biol. 3, 138–-141 (2000).
Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Cereal genome evolution:
Grasses, line up and form a circle. Curr. Biol. 5, 737–-739 (1995).
Sasaki, T. et al. The genome sequence and structure of rice chromosome 1.
Nature 420, 312–-316 (2002).
Feng, Q. et al. Sequence and analysis of rice chromosome 4. Nature 420,
316–-320 (2002).
Rice Chromosome 10 Sequencing Consortium, In-depth view of structure,
activity, and evolution of rice chromosome 10. Science 300, 1566–-1569 (2003).
Wu, J. et al. Composition and structure of the centromeric region of rice
chromosome 8. Plant Cell 16, 967–-976 (2004).
Zhang, Y. et al. Structural features of the rice chromosome 4 centromere.
Nucleic Acids Res. 32, 2023–-2030 (2004).
Nagaki, K. et al. Sequencing of a rice centromere uncovers active genes. Nature
Genet. 36, 138–-145 (2004).
Guyot, R. & Keller, B. Ancestral genome duplication in rice. Genome 47,
610–-614 (2004).
Simillion, C., Vandepoele, K., Saeys, Y. & Van de Peer, Y. Building genomic
profiles for uncovering segmental homology in the twilight zone. Genome Res.
14, 1095–-1106 (2004).
Paterson, A. H., Bowers, J. E. & Chapman, B. A. Ancient polyploidization
predating divergence of the cereals, and its consequences for comparative
genomics. Proc. Natl Acad. Sci. USA 101, 9903–-9908 (2004).
Salse, J., Piegu, B., Cooke, R. & Delseny, M. New in silico insight into the
synteny between rice (Oryza sativa L.) and maize (Zea mays L.) highlights
reshuffling and identifies new duplications in the rice genome. Plant J. 38,
396–-409 (2004).
Lai, J. et al. Gene loss and movement in the maize genome. Genome Res. 14,
1924–-1931 (2004).
Harushima, Y. et al. A high-density rice genetic linkage map with 2275 markers
using a single F2 population. Genetics 148, 479–-494 (1998).
Yamamoto, K. & Sasaki, T. Large-scale EST sequencing in rice. Plant Mol. Biol.
35, 135–-144 (1997).
Saji, S. et al. A physical map with yeast artificial chromosome (YAC) clones
covering 63% of the 12 rice chromosomes. Genome 44, 32–-37 (2001).
Wu, J. et al. A comprehensive rice transcript map containing 6591 expressed
sequence tag sites. Plant Cell 14, 525–-535 (2002).
Chen, M. et al. An integrated physical and genetic map of the rice genome.
Plant Cell 14, 537–-545 (2002).
Mao, L. et al. Rice transposable elements: a survey of 73,000 sequencetagged-connectors. Genome Res. 10, 982–-990 (2000).
Barry, G. F. The use of the Monsanto draft rice genome sequence in research.
Plant Physiol. 125, 1164–-1165 (2001).
Goff, S. A. et al. A draft sequence of the rice genome (Oryza sativa L. ssp.
japonica). Science 296, 92–-100 (2002).
Ohmido, N., Kijima, K., Akiyama, Y., de Jong, J. H. & Fukui, K. Quantification of
total genomic DNA and selected repetitive sequences reveals concurrent
changes in different DNA families in indica and japonica rice. Mol. Gen. Genet.
263, 388–-394 (2000).
Dong, F. et al. Rice (Oryza sativa) centromeric regions consist of complex DNA.
Proc. Natl Acad. Sci. USA 95, 8135–-8140 (1998).
Cheng, Z. et al. Functional rice centromeres are marked by a satellite repeat
and a centromere-specific retrotransposon. Plant Cell 14, 1691–-1704 (2002).
Kikuchi, S. et al. Collection, mapping, and annotation of over 28,000 cDNA
clones from japonica rice. Science 301, 376–-379 (2003).
Castelli, V. et al. Whole genome sequence comparisons and “full-length” cDNA
sequences: a combined approach to evaluate and improve Arabidopsis genome
annotation. Genome Res. 14, 406–-413 (2004).
Hirochika, H., Sugimoto, K., Otsuki, Y., Tsugawa, H. & Kanda, M.
Retrotransposons of rice involved in mutations induced by tissue culture. Proc.
Natl Acad. Sci. USA 93, 7783–-7788 (1996).
Miyao, A. et al. Target site specificity of the Tos17 retrotransposon shows a
preference for insertion within genes and against insertion in retrotransposonrich regions of the genome. Plant Cell 15, 1771–-1780 (2003).
Alonso, J. M. et al. Genome-wide insertional mutagenesis of Arabidopsis
thaliana. Science 301, 653–-657 (2003).
Arabidopsis Genome Initiative, Analysis of the genome sequence of the
flowering plant Arabidopsis thaliana. Nature 408, 796–-815 (2000).
Song, R., Llaca, V. & Messing, J. Mosaic organization of orthologous sequences
in grass genomes. Genome Res. 12, 1549–-1555 (2002).
Shishido, R., Sano, Y. & Fukui, K. Ribosomal DNAs: an exception to the
conservation of gene order in rice genomes. Mol. Gen. Genet. 263, 586–-591
(2000).
Oono, K. & Sugiura, M. Heterogeneity of the ribosomal RNA gene clusters in
rice. Chromosoma 76, 85–-89 (1980).
Kamisugi, Y. et al. Physical mapping of the 5S ribosomal RNA genes on rice
chromosome 11. Mol. Gen. Genet. 245, 133–-138 (1994).
37. Bartel, D. P. MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell
116, 281–-297 (2004).
38. Wang, X. J., Reyes, J. L., Chua, N. H. & Gaasterland, T. Prediction and
identification of Arabidopsis thaliana microRNAs and their mRNA targets.
Genome Biol. 5, R65 (2004).
39. Wang, J. F., Zhou, H., Chen, Y. Q., Luo, Q. J. & Qu, L. H. Identification of 20
microRNAs from Oryza sativa. Nucleic Acids Res. 32, 1688–-1695 (2004).
40. Turcotte, K., Srinivasan, S. & Bureau, T. Survey of transposable elements from
rice genomic sequences. Plant J. 25, 169–-179 (2001).
41. Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–-763 (1998).
42. Messing, J. et al. Sequence composition and genome organization of maize.
Proc. Natl Acad. Sci. USA 101, 14349–-14354 (2004).
43. Shen, Y. J. et al. Development of genome-wide DNA polymorphism database
for map-based cloning of rice genes. Plant Physiol. 135, 1198–-1205 (2004).
44. Feltus, F. A. et al. An SNP resource for rice genetics and breeding based on
subspecies indica and japonica genome alignments. Genome Res. 14, 1812–-1819
(2004).
45. McCouch, S. R. et al. Development and mapping of 2240 new SSR markers for
rice (Oryza sativa L.). DNA Res. 9, 257–-279 (2002).
46. Causse, M. A. et al. Saturated molecular map of the rice genome based on an
interspecific backcross population. Genetics 138, 1251–-1274 (1994).
47. Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica).
Science 296, 79–-92 (2002).
48. Yu, J. et al. The genomes of Oryza sativa: A history of duplications. PLoS Biol. 3,
e38 (2005).
49. Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27,
2369–-2376 (1999).
50. Juretic, N., Bureau, T. E. & Bruskiewich, R. M. Transposable element annotation
of the rice genome. Bioinformatics 20, 155–-160 (2004).
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature.
Acknowledgements Work at the RGP was supported by the Ministry of
Agriculture, Forestry and Fisheries of Japan. Work at TIGR was supported by
grants to C.R.B. from the USDA Cooperative State Research, Education and
Extension Service–National Research Initiative, the National Science Foundation
and the US Department of Energy. Work at the NCGR was supported by the
Chinese Ministry of Science and Technology, the Chinese Academy of Sciences,
the Shanghai Municipal Commission of Science and Technology, and the
National Natural Science Foundation of China. Work at Genoscope was
supported by le Ministère de la Recherche, France. Funding for the work at the
AGI and AGCoL was provided by grants to R.A.W. and C.S. from the USDA
Cooperative State Research, Education and Extension Service–National Research
Initiative, the National Science Foundation, the US Department of Energy and
the Rockefeller Foundation. Work at CSHL was supported by grants from the
USDA Cooperative State Research, Education and Extension Service–National
Research Initiative and from the National Science Foundation. Work at the
ASPGC was supported by Academia Sinica, National Science Council, Council of
Agriculture, and Institute of Botany, Academia Sinica. The IIRGS acknowledges
the Department of Biotechnology, Government of India, for financial assistance
and the Indian Council of Agricultural Research, New Delhi, for support. Work at
Rice Gene Discovery was supported by BIOTECH and the Princess Sirindhorn’s
Plant Germplasm Conservation Initiative Program. Work at PGIR was supported
by Rutgers University. The BRIGI was supported by Coordenação de
Aperfeiçoamento de Pessoal de Nı́vel Superior (CAPES), Conselho Nacional de
Desenvolvimento Cientı́fico e Tecnológico (CNPq), Financiadora de Estudos e
Projetos - Ministério de Ciência e Tecnologia (FINEP-MCT), Fundação de
Amparo a Pesquisa do Rio Grande do Sul (FAPERGS) and Universidade Federal
de Pelotas (UFPel). Work at McGill and York Universities was supported by the
National Science and Engineering Research Council of Canada and the Canadian
International Development Agency. Funding for H.H. at the National Institute of
Agrobiological Sciences was from the Ministry of Agriculture, Forestry, and
Fisheries of Japan, and the Program for Promotion of Basic Research Activities
for Innovative Biosciences. Funding at Brookhaven National Laboratory was from
The Rockefeller Foundation and the Office of Basic Energy Science of the United
States Department of Energy. We would like to thank G. Barry and S. Goff for
their help in negotiating agreements that permitted the sharing of materials and
sequence with the IRGSP. We also acknowledge the work of G. Barry, S. Goff
and their colleagues in facilitating the transfer of sequence information and
supporting data.
Author Information The genomic sequence is available under accession
numbers AP008207–AP008218 in international databases (DDBJ, GenBank and
EMBL). Reprints and permissions information is available at npg.nature.com/
reprintsandpermissions. The authors declare no competing financial interests.
Correspondence and requests for materials should be addressed to Takuji
Sasaki ([email protected]).
© 2005 Nature Publishing Group
799
ARTICLES
NATURE|Vol 436|11 August 2005
International Rice Genome Sequencing Project (Participants are arranged by area of contribution and then by institution.)
Physical Maps and Sequencing: Rice Genome Research Program (RGP) Takashi Matsumoto1, Jianzhong Wu1, Hiroyuki Kanamori1, Yuichi
Katayose1, Masaki Fujisawa1, Nobukazu Namiki1, Hiroshi Mizuno1, Kimiko Yamamoto1, Baltazar A. Antonio1, Tomoya Baba1, Katsumi Sakata1,
Yoshiaki Nagamura1, Hiroyoshi Aoki1, Koji Arikawa1, Kohei Arita1, Takahito Bito1, Yoshino Chiden1, Nahoko Fujitsuka1, Rie Fukunaka1, Masao
Hamada1, Chizuko Harada1, Akiko Hayashi1, Saori Hijishita1, Mikiko Honda1, Satomi Hosokawa1, Yoko Ichikawa1, Atsuko Idonuma1, Masumi
Iijima1, Michiko Ikeda1, Maiko Ikeno1, Kazue Ito1, Sachie Ito1, Tomoko Ito1, Yuichi Ito1, Yukiyo Ito1, Aki Iwabuchi1, Kozue Kamiya1, Wataru
Karasawa1, Kanako Kurita1, Satoshi Katagiri1, Ari Kikuta1, Harumi Kobayashi1, Noriko Kobayashi1, Kayo Machita1, Tomoko Maehara1,
Masatoshi Masukawa1, Tatsumi Mizubayashi1, Yoshiyuki Mukai1, Hideki Nagasaki1, Yuko Nagata1, Shinji Naito1, Marina Nakashima1, Yuko
Nakama1, Yumi Nakamichi1, Mari Nakamura1, Ayano Meguro1, Manami Negishi1, Isamu Ohta1, Tomoya Ohta1, Masako Okamoto1, Nozomi
Ono1, Shoko Saji1, Miyuki Sakaguchi1, Kumiko Sakai1, Michie Shibata1, Takanori Shimokawa1, Jianyu Song1, Yuka Takazaki1, Kimihiro
Terasawa1, Mika Tsugane1, Kumiko Tsuji1, Shigenori Ueda1, Kazunori Waki1, Harumi Yamagata1, Mayu Yamamoto1, Shinichi Yamamoto1,
Hiroko Yamane1, Shoji Yoshiki1, Rie Yoshihara1, Kazuko Yukawa1, Huisun Zhong1, Masahiro Yano1, Takuji Sasaki (Principal Investigator)1 ;
The Institute for Genomic Research (TIGR) Qiaoping Yuan2, Shu Ouyang2, Jia Liu2, Kristine M. Jones2, Kristen Gansberger2, Kelly Moffat2,
Jessica Hill2, Jayati Bera2, Douglas Fadrosh2, Shaohua Jin2, Shivani Johri2, Mary Kim2, Larry Overton2, Matthew Reardon2, Tamara Tsitrin2,
Hue Vuong2, Bruce Weaver2, Anne Ciecko2, Luke Tallon2, Jacqueline Jackson2, Grace Pai2, Susan Van Aken2, Terry Utterback2, Steve
Reidmuller2, Tamara Feldblyum2, Joseph Hsiao2, Victoria Zismann2, Stacey Iobst2, Aymeric R. de Vazeille2, C. Robin Buell (Principal
Investigator)2; National Center for Gene Research Chinese Academy of Sciences (NCGR) Kai Ying3, Ying Li3, Tingting Lu3, Yuchen
Huang3, Qiang Zhao3, Qi Feng3, Lei Zhang3, Jingjie Zhu3, Qijun Weng3, Jie Mu3, Yiqi Lu3, Danlin Fan3, Yilei Liu3, Jianping Guan3, Yujun
Zhang3, Shuliang Yu3, Xiaohui Liu3, Yu Zhang3, Guofan Hong3, Bin Han (Principal Investigator)3; Genoscope Nathalie Choisne4, Nadia
Demange4, Gisela Orjeda4, Sylvie Samain4, Laurence Cattolico4, Eric Pelletier4, Arnaud Couloux4, Beatrice Segurens4, Patrick Wincker4,
Angelique D’Hont5, Claude Scarpelli4, Jean Weissenbach4, Marcel Salanoubat4, Francis Quetier (Principal Investigator)4; Arizona
Genomics Institute (AGI) and Arizona Genomics Computational Laboratory (AGCol) Yeisoo Yu6, Hye Ran Kim6, Teri Rambo6, Jennifer
Currie6, Kristi Collura6, Meizhong Luo6, Tae-Jin Yang6, Jetty S. S. Ammiraju6, Friedrich Engler6, Carol Soderlund6, Rod A. Wing (Principal
Investigator)6; Cold Spring Harbor Laboratory (CSHL) Lance E. Palmer7, Melissa de la Bastide7, Lori Spiegel7, Lidia Nascimento7, Theresa
Zutavern7, Andrew O’Shaughnessy7, Sujit Dike7, Neilay Dedhia7, Raymond Preston7, Vivekanand Balija7, W. Richard McCombie (Principal
Investigator)7; Academia Sinica Plant Genome Center (ASPGC) Teh-Yuan Chow8, Hong-Hwa Chen9, Mei-Chu Chung8, Ching-San
Chen8, Jei-Fu Shaw8, Hong-Pang Wu8, Kwang-Jen Hsiao10, Ya-Ting Chao8, Mu-kuei Chu8, Chia-Hsiung Cheng8, Ai-Ling Hour8, Pei-Fang
Lee8, Shu-Jen Lin8, Yao-Cheng Lin8, John-Yu Liou8, Shu-Mei Liu8, Yue-Ie Hsing (Principal Investigator)8; Indian Initiative for Rice Genome
Sequencing (IIRGS), University of Delhi South Campus (UDSC) S. Raghuvanshi11, A. Mohanty11, A. K. Bharti11,13, A. Gaur11, V. Gupta11, D.
Kumar11, V. Ravi11, S. Vij11, A. Kapur11, Parul Khurana11, Paramjit Khurana11, J. P. Khurana11, A. K. Tyagi (Principal Investigator)11; Indian
Initiative for Rice Genome Sequencing (IIRGS), Indian Agricultural Research Institute (IARI) K. Gaikwad12, A. Singh12, V. Dalal12, S.
Srivastava12, A. Dixit12, A. K. Pal12, I. A. Ghazi12, M. Yadav12, A. Pandit12, A. Bhargava12, K. Sureshbabu12, K. Batra12, T. R. Sharma12, T.
Mohapatra12, N. K. Singh (Principal Investigator)12; Plant Genome Initiative at Rutgers (PGIR) Joachim Messing (Principal Investigator)13,
Amy Bronzino Nelson13, Galina Fuks13, Steve Kavchok13, Gladys Keizer13, Eric Linton Victor Llaca13, Rentao Song13, Bahattin Tanyolac13,
Steve Young13; Korea Rice Genome Research Program (KRGRP) Kim Ho-Il14, Jang Ho Hahn (Principal Investigator)14; National Center for
Genetic Engineering and Biotechnology (BIOTEC) G. Sangsakoo15, A. Vanavichit (Principal Investigator)15; Brazilian Rice Genome
Initiative (BRIGI) Luiz Anderson Teixeira de Mattos16, Paulo Dejalma Zimmer16, Gaspar Malone16, Odir Dellagostin16, Antonio Costa de
Oliveira (Principal Investigator)16; John Innes Centre (JIC) Michael Bevan17, Ian Bancroft17; Washington University School of Medicine
Genome Sequencing Center Pat Minx18, Holly Cordum18, Richard Wilson18; University of Wisconsin–Madison Zhukuan Cheng19, Weiwei
Jin19, Jiming Jiang19, Sally Ann Leong20
Annotation and Analysis: Hisakazu Iwama21, Takashi Gojobori21,22, Takeshi Itoh22,23, Yoshihito Niimura24, Yasuyuki Fujii25, Takuya
Habara25, Hiroaki Sakai23,25, Yoshiharu Sato22, Greg Wilson26, Kiran Kumar27, Susan McCouch26, Nikoleta Juretic28, Douglas Hoen28,
Stephen Wright29, Richard Bruskiewich30, Thomas Bureau28, Akio Miyao23, Hirohiko Hirochika23, Tomotaro Nishikawa23, Koh-ichi
Kadowaki23 & Masahiro Sugiura31
Coordination: Benjamin Burr32
Affiliations for participants: 1National Institute of Agrobiological Sciences/Institute of the Society for Techno-innovation of Agriculture, Forestry and Fisheries, 2-1-2 Kannondai,
Tsukuba, Ibaraki 305-8602, Japan. 2The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA. 3Shanghai Institutes for Biological
Sciences, Chinese Academy of Sciences (CAS), 500 Caobao Road, Shanghai 200233, China. 4Centre National de Séquençage, INRA-URGV, and CNRS UMR-8030, 2, rue Gaston
Crémieux, CP 5706, 91057 EVRY Cedex, France. 5UMR PIA, Cirad-Amis, TA40-03 avenue Agropolis, 34398 Montpellier Cedex 05, France. 6Department of Plant Sciences, BIO5
Institute, The University of Arizona, Tucson, Arizona 85721, USA. 7Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11723, USA. 8Institute of Botany, Academia
Sinica, 128, Sec. 2, Yen-Chiu-Yuan Rd, Nankang, Taipei 11529, Taiwan. 9National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan 701, Taiwan. 10National Yang-Ming
University, 155, Sec. 2, Li-Nong St, Peitou, Taipei 112, Taiwan. 11Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110021, India. 12National
Research Centre on Plant Biotechnology, Indian Agricultural Research Institute, New Delhi 110012, India. 13Waksman Institute, Rutgers University, Piscataway, New Jersey
08854, USA. 14National Institute of Agricultural Science and Technology, RDA, Suwon, 441-707 Republic of Korea. 15Rice Gene Discovery Unit, Kasetsart University, Nakron
Pathom 73140, Thailand. 16Centro de Genomica e Fitomelhoramento, UFPel, Pelotas, RS, l 96001-970, Brazil. 17John Innes Centre, Norwich Research Park, Colney, Norwich NR4
7UH, UK. 18Washington University Genome Sequencing Center, 3333 Forest Park Boulevard, St. Louis, Missouri 63108, USA. 19University of Wisconsin, Department of
Horticulture, Madison, Wisconsin 53706, USA. 20University of Wisconsin, Department of Plant Pathology, Madison, Wisconsin 53706, USA. 21Center for Information Biology and
DNA Data Bank of Japan, National Institute of Genetics, Mishima 411-8540, Japan. 22Biological Information Research Center, National Institute of Advanced Industrial Science
and Technology, Koto-ku, Tokyo 135-0064, Japan. 23National Institute of Agrobiological Sciences, Tsukuba, Ibaraki 305-8602, Japan. 24Medical Research Institute, Tokyo
Medical and Dental University, Bunkyo-ku, Tokyo 113-8510, Japan. 25Japan Biological Information Research Center, Japan Biological Informatics Consortium, Koto-ku, Tokyo 1350064, Japan. 26Plant Breeding Dept, Cornell University, Ithaca, New York 14850-1901, USA. 27Cold Spring Harbor Laboratory, PO Box 100, 1 Bungtown Road, Cold Spring Harbor,
New York 11724, USA. 28Department of Biology, McGill University, 1205 Dr Penfield Avenue, Montreal, Quebec H3A 1B1, Canada. 29Department of Biology, York University,
4700 Keele Street, Toronto, Ontario M3J 1P3, Canada. 30Biometrics and Bioinformatics Unit, International Rice Research Institute, DAPO Box 7777, Metro Manila, Philippines.
31
Graduate School of Natural Sciences, Nagoya City University, Nagoya 467-8501, Japan. 32Biology Department, Brookhaven National Laboratory, Upton, New York 11973, USA.
800
© 2005 Nature Publishing Group
RESEARCH ARTICLES
G. A. Tuskan,1,3* S. DiFazio,1,4† S. Jansson,5† J. Bohlmann,6† I. Grigoriev,9† U. Hellsten,9†
N. Putnam,9† S. Ralph,6† S. Rombauts,10† A. Salamov,9† J. Schein,11† L. Sterck,10† A. Aerts,9
R. R. Bhalerao,5 R. P. Bhalerao,12 D. Blaudez,13 W. Boerjan,10 A. Brun,13 A. Brunner,14
V. Busov,15 M. Campbell,16 J. Carlson,17 M. Chalot,13 J. Chapman,9 G.-L. Chen,2 D. Cooper,6
P. M. Coutinho,19 J. Couturier,13 S. Covert,20 Q. Cronk,7 R. Cunningham,1 J. Davis,22
S. Degroeve,10 A. Déjardin,23 C. dePamphilis,18 J. Detter,9 B. Dirks,24 I. Dubchak,9,25
S. Duplessis,13 J. Ehlting,7 B. Ellis,6 K. Gendler,26 D. Goodstein,9 M. Gribskov,27 J. Grimwood,28
A. Groover,29 L. Gunter,1 B. Hamberger,7 B. Heinze,30 Y. Helariutta,12,31,33 B. Henrissat,19
D. Holligan,21 R. Holt,11 W. Huang,9 N. Islam-Faridi,34 S. Jones,11 M. Jones-Rhoades,35
R. Jorgensen,26 C. Joshi,15 J. Kangasjärvi,32 J. Karlsson,5 C. Kelleher,6 R. Kirkpatrick,11
M. Kirst,22 A. Kohler,13 U. Kalluri,1 F. Larimer,2 J. Leebens-Mack,21 J.-C. Leplé,23 P. Locascio,2
Y. Lou,9 S. Lucas,9 F. Martin,13 B. Montanini,13 C. Napoli,26 D. R. Nelson,36 C. Nelson,37
K. Nieminen,31 O. Nilsson,12 V. Pereda,13 G. Peter,22 R. Philippe,6 G. Pilate,23 A. Poliakov,25
J. Razumovskaya,2 P. Richardson,9 C. Rinaldi,13 K. Ritland,8 P. Rouzé,10 D. Ryaboy,25
J. Schmutz,28 J. Schrader,38 B. Segerman,5 H. Shin,11 A. Siddiqui,11 F. Sterky,39 A. Terry,9
C.-J. Tsai,15 E. Uberbacher,2 P. Unneberg,39 J. Vahala,32 K. Wall,18 S. Wessler,21 G. Yang,21
T. Yin,1 C. Douglas,7‡ M. Marra,11‡ G. Sandberg,12‡ Y. Van de Peer,10‡ D. Rokhsar9,24‡
We report the draft genome of the black cottonwood tree, Populus trichocarpa. Integration of shotgun
sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome.
More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome
revealed a whole-genome duplication event; about 8000 pairs of duplicated genes from that event
survived in the Populus genome. A second, older duplication event is indistinguishably coincident with
the divergence of the Populus and Arabidopsis lineages. Nucleotide substitution, tandem gene
duplication, and gross chromosomal rearrangement appear to proceed substantially more slowly in
Populus than in Arabidopsis. Populus has more protein-coding genes than Arabidopsis, ranging on
average from 1.4 to 1.6 putative Populus homologs for each Arabidopsis gene. However, the relative
frequency of protein domains in the two genomes is similar. Overrepresented exceptions in Populus
include genes associated with lignocellulosic wall biosynthesis, meristem development, disease
resistance, and metabolite transport.
orests cover 30% (about 3.8 billion ha)
of Earth_s terrestrial surface, harbor substantial biodiversity, and provide humanity
with benefits such as clean air and water, lumber,
fiber, and fuels. Worldwide, one-quarter of all
industrial feedstocks have their origins in forestbased resources (1). Large and long-lived forest
trees grow in extensive wild populations across
continents, and they have evolved under selective
pressures unlike those of annual herbaceous plants.
Their growth and development involves extensive
secondary growth, coordinated signaling and distribution of water and nutrients over great distances, and strategic storage and redistribution of
metabolites in concordance with interannual climatic cycles. Their need to survive and thrive in
fixed locations over centuries under continually
changing physical and biotic stresses also sets them
apart from short-lived plants. Many of the features
that distinguish trees from other organisms,
especially their large sizes and long-generation
times, present challenges to the study of the
cellular and molecular mechanisms that underlie
their unique biology. To enable and facilitate such
investigations in a relatively well-studied model
F
1596
tree, we describe here the draft genome of black
cottonwood, Populus trichocarpa (Torr. & Gray),
and compare it to other sequenced plant genomes.
P. trichocarpa was selected as the model
forest species for genome sequencing not only
because of its modest genome size but also
because of its rapid growth, relative ease of
experimental manipulation, and range of available genetic tools (2, 3). The genus is phenotypically diverse, and interspecific hybrids
facilitate the genetic mapping of economically
important traits related to growth rate, stature,
wood properties, and paper quality. Dozens of
quantitative trait loci have already been mapped
(4), and methods of genetic transformation have
been developed (5). Under appropriate conditions, Populus can reach reproductive maturity in as few as 4 to 6 years, permitting selective
breeding for large-scale sustainable plantation
forestry. Finally, rapid growth of trees coupled
with thermochemical or biochemical conversion of the lignocellulosic portion of the plant
has the potential to provide a renewable energy
resource with a concomitant reduction of greenhouse gases (6–8).
15 SEPTEMBER 2006
VOL 313
SCIENCE
1
Environmental Sciences Division, 2Life Sciences Division, Oak
Ridge National Laboratory, Oak Ridge, TN 37831, USA. 3Plant
Sciences Department, University of Tennessee, TN 37996, USA.
4
Department of Biology, West Virginia University, Morgantown,
WV 26506, USA. 5Umeå Plant Science Centre, Department of
Plant Physiology, Umeå University, SE-901 87, Umeå, Sweden.
6
Michael Smith Laboratories, 7Department of Botany, 8Department of Forest Sciences, University of British Columbia,
Vancouver, BC V6T 1Z4, Canada. 9U.S. Department of Energy,
Joint Genome Institute, Walnut Creek, CA 94598, USA.
10
Department of Plant Systems Biology, Flanders Interuniversity
Institute for Biotechnology (VIB), Ghent University, B-9052
Ghent, Belgium. 11Genome Sciences Centre, 100-570 West 7th
Avenue, Vancouver, BC V5Z 4S6, Canada. 12Umeå Plant
Science Centre, Department of Forest Genetics and Plant
Physiology, Swedish University of Agricultural Sciences, SE-901
83 Umeå, Sweden. 13Tree-Microbe Interactions Unit, Institut
National de la Recherche Agronomique (INRA)–Université Henri
Poincaré, INRA-Nancy, 54280 Champenoux, France. 14Department of Forestry, Virginia Polytechnic Institute and State
University, Blacksburg, VA 24061, USA. 15Biotechnology
Research Center, School of Forest Resources and Environmental
Science, Michigan Technological University, Houghton, MI
49931, USA. 16Department of Cell and Systems Biology,
University of Toronto, 25 Willcocks Street, Toronto, Ontario,
M5S 3B2 Canada. 17School of Forest Resources and Huck
Institutes of the Life Sciences, 18Department of Biology, Institute
of Molecular Evolutionary Genetics, and Huck Institutes of Life
Sciences, The Pennsylvania State University, University Park, PA
16802, USA. 19Architecture et Fonction des Macromolécules
Biologiques, UMR6098, CNRS and Universities of Aix-Marseille I
and II, case 932, 163 avenue de Luminy, 13288 Marseille,
France. 20Warnell School of Forest Resources, 21Department of
Plant Biology, University of Georgia, Athens, GA 30602, USA.
22
School of Forest Resources and Conservation, Genetics
Institute, and Plant Molecular and Cellular Biology Program,
University of Florida, Gainesville, FL 32611, USA. 23INRAOrléans, Unit of Forest Improvement, Genetics and Physiology,
45166 Olivet Cedex, France. 24Center for Integrative Genomics,
University of California, Berkeley, CA 94720, USA. 25Genomics
Division, Lawrence Berkeley National Laboratory, Berkeley, CA
94720, USA. 26Department of Plant Sciences, University of
Arizona, Tucson, AZ 85721, USA. 27Department of Biological
Sciences, Purdue University, West Lafayette, IN 47907, USA.
28
The Stanford Human Genome Center and the Department of
Genetics, Stanford University School of Medicine, Palo Alto, CA
94305, USA. 29Institute of Forest Genetics, United States
Department of Agriculture, Forest Service, Davis, CA 95616,
USA. 30Federal Research Centre for Forests, Hauptstrasse 7,
A-1140 Vienna, Austria. 31Plant Molecular Biology Laboratory, Institute of Biotechnology, 32Department of Biological and
Environmental Sciences, University of Helsinki, FI-00014
Helsinki, Finland. 33Department of Biology, 200014, University of Turku, FI-20014 Turku, Finland. 34Southern Institute of
Forest Genetics, United States Department of Agriculture,
Forest Service and Department of Forest Science, Texas A&M
University, College Station, TX 77843, USA. 35Whitehead
Institute for Biomedical Research and Department of Biology,
Massachusetts Institute of Technology, Cambridge, MA 02142,
USA. 36Department of Molecular Sciences and Center of Excellence in Genomics and Bioinformatics, University of
Tennessee, Memphis, TN 38163, USA. 37Southern Institute
of Forest Genetics, United States Department of Agriculture,
Forest Service, Saucier, MS 39574, USA. 38Developmental
Genetics, University of Tübingen, D-72076 Tübingen, Germany. 39Department of Biotechnology, KTH, AlbaNova University
Center, SE-106 91 Stockholm, Sweden.
*To whom correspondence should be addressed. E-mail:
[email protected]
†These authors contributed equally to this work as second
authors.
‡These authors contributed equally to this work as senior
authors.
www.sciencemag.org
Downloaded from www.sciencemag.org on February 28, 2010
The Genome of Black Cottonwood,
Populus trichocarpa (Torr. & Gray)
Sequencing and Assembly
A single female genotype, ‘‘Nisqually 1,’’ was
selected and used in a whole-genome shotgun
RESEARCH ARTICLES
Table 1. Characterization of polymorphisms
according to their positions relative to predicted
coding sequences, introns, and untranslated regions
(UTRs). Rate shows the percentage of potential sites
of each class that were polymorphic. Most indels
within exons resulted in frame shifts, but we could
not quantify this due to difficulties with assembly
and sequencing of regions containing indels.
Nonsense mutations created stop codons within
predicted exons.
Source
Noncoding
INTRON
3¶UTR
5¶UTR
Exon
Within exons:
Indels
Nonsense
Nonsynonymous
Number of
loci
Rate
(%)
1,027,322
141,199
6,731
3,306
62,656
0.32
0.25
0.25
0.24
0.14
2,722
926
32,207
0.01
0.02
0.10
ing to chloroplast (fig. S5) and mitochondrial
genomes were assembled into circular genomes
of 157 and 803 kb, respectively (9).
We anchored the 410 Mb of assembled
scaffolds to a sequence-tagged genetic map (fig.
S3). In total, 356 microsatellite markers were used
to assign 155 scaffolds (335 Mb of sequence) to
the 19 P. trichocarpa chromosome-scale linkage
groups (13). The vast majority (91%) of the
mapped microsatellite markers were colinear with
the sequence assembly. At the extremes, the
smallest chromosome, LGIX [79 centimorgans
(cM)], is covered by two scaffolds containing 12.5
Mb of assembled sequence, whereas the largest
chromosome, LGI (265 cM), contains 21 scaffolds
representing 35.5 Mb (fig. S3). We also generated
a physical map based on bacterial artificial
chromosome (BAC) fingerprint contigs using a
Nisqually-1 BAC library representing an estimated
9.5-fold genome coverage (fig. S2). Paired BACend sequences from most of the physical map were
linked to the large-scale assembly, permitting 2460
of the physical map contigs to be positioned on the
genome assembly. Combining the genetic and
physical map, nearly 385 Mb of the 410 Mb of
assembled sequence are placed on a linkage group.
Unlike Arabidopsis, where predominantly
self-fertilizing ecotypes maintain low levels of
allelic polymorphism, Populus species are predominantly dioecious, which results in obligate
outcrossing. This compulsory outcrossing, along
with wind pollination and wind-dispersed plumose seeds, results in high levels of gene flow
and high levels of heterozygosity (that is, within
individual genetic polymorphisms). Within the
heterozygous Nisqually-1 genome, we identified
1,241,251 single-nucleotide polymorphisms
(SNPs) or small insertion/deletion polymorphisms (indels) for an overall rate of approximately 2.6 polymorphisms per kilobase. Of these
polymorphisms, the overwhelming majority
(83%) occurred in noncoding portions of the
genome (Table 1). Short indels and SNPs within
exons resulted in frameshifts and nonsense stop
codons within predicted exons, respectively, suggesting that null alleles of these genes exist in one
of the haplotypes. Some of the polymorphisms
may be artifacts from the assembly process,
although these errors were minimized by using
stringent criteria for SNP identification (9).
Gene Annotation
We tentatively identified a first-draft reference set
of 45,555 protein-coding gene loci in the Populus
nuclear genome (www.jgi.doe.gov/poplar)
using a variety of ab initio, homology-based,
and expressed sequence tag (EST)–based methods (14–17) (table S5). Similarly, 101 and 52
genes were annotated in the chloroplast and mitochondrial genomes, respectively (9). To aid the
annotation process, 4664 full-length sequences,
from full-length enriched cDNA libraries from
Nisqually 1, were generated and used in
training the gene-calling algorithms. Before
gene prediction, repetitive sequences were characterized (fig. S15 and table S14) and masked;
additional putative transposable elements were
identified and subsequently removed from the
reference gene set (9). Given the current draft
nature of the genome, we expect that the gene
set in Populus will continue to be refined.
About 89% of the predicted gene models had
homology [expectation (E) value e 1 10j8] to
the nonredundant (NR) set of proteins from the
National Center for Biotechnology Information,
including 60% with extensive homology that
spans 75% of both model and NR protein lengths.
Nearly 12% (5248) of the predicted Populus
genes had no detectable similarity to Arabidopsis
genes (E value e 1 10j3); conversely, in the
more refined Arabidopsis set, only 9% (2321) of
the predicted genes had no similarity to the
Populus reference set. Of the 5248 Populus genes
without Arabidopsis similarity, 1883 have expression evidence from the manually curated Populus
EST data set, and of these, 274 have no hits (E
value Q 1 10j3) to the NR database (9).
Whole-genome oligonucleotide microarray analysis provided evidence of tissue-based expression
for 53% of the reference gene models (Fig. 1). In
addition, a signal was detected from 20% of
genes that were initially annotated and excluded
from the reference set, suggesting that as many
as 4000 additional genes (or gene fragments)
may be present. Within the reference gene set,
we identified 13,019 pairs of orthologs between
Downloaded from www.sciencemag.org on February 28, 2010
sequence and assembly strategy (9). Roughly 7.6
million end-reads representing 4.2 billion highquality (i.e., Q20 or higher) base pairs were
assembled into 2447 major scaffolds containing
an estimated 410 megabases (Mb) of genomic
DNA (tables S1 and S2). On the basis of the depth
of coverage of major scaffolds (È7.5 depth) and
the total amount of nonorganellar shotgun
sequence that was generated, the Populus
genome size was estimated to be 485 T 10 Mb
(TSD), in rough agreement with previous
cytogenetic estimates of about 550 Mb (10).
The near completeness of the shotgun assembly
in protein-coding regions is supported by the
identification of more than 95% of known
Populus cDNA in the assembly.
The È75 Mb of unassembled genomic sequence is consistent with cytogenetic evidence
that È30% of the genome is heterochromatic (9).
The amount of euchromatin contained within the
Populus genome was estimated in parallel by
subtraction on the basis of direct measurements
of 4¶,6¶-diamidino-2-phenylindole–stained prophase and metaphase chromosomes (fig. S4).
On average, 69.5 T 0.3% of the genome consisted
of euchromatin, with a significantly lower proportion of euchromatin in linkage group I (LGI)
(66.4 T 1.1%) compared with the other 18 chromosomes (69.7 T 0.03%, P e 0.05). In contrast,
Arabidopsis chromosomes contain roughly 93%
euchromatin (11). The unassembled shotgun sequences were derived from variants of organellar
DNA, including recent nuclear translocations;
highly repetitive genomic DNA; haplotypic
segments that were redundant with short subsegments of the major scaffolds (separated as a result
of extensive sequence polymorphism and allelic
variants); and contaminants of the template DNA,
such as endophytic microbes inhabiting the leaf
and root tissues used for template preparation (12)
(fig. S1 and table S3). The end-reads correspond-
Fig. 1. Whole-genome oligonucleotide microarray
expression data for all predicted gene models in P.
trichocarpa. Values represent the proportion of genes
expressed above negative
controls at a 5% false discovery rate. The x axis represents the subsets of
predicted genes that were
analyzed for the annotated
and promoted P. trichocarpa
gene set (42,373 genes),
chloroplast gene set (49
genes), mitochondria gene set (49 genes), annotated, nonpromoted gene set (10,875 genes), and
microRNAs (48 miRNAs).
www.sciencemag.org
SCIENCE
VOL 313
15 SEPTEMBER 2006
1597
RESEARCH ARTICLES
genes in Populus and Arabidopsis using the
best bidirectional Basic Local Alignment
Search Tool (BLAST) hits, with average
mutual coverage of these alignments equal to
93%; 11,654 pairs of orthologs had greater
than 90% alignment of gene lengths, whereas
only 156 genes had less than 50% coverage. As
of 1 June 2006, È10% (4378) gene models have
been manually validated and curated.
Genome Organization
Genome duplication in the Salicaceae. Populus
and Arabidopsis lineages diverged about 100 to
120 million years ago (Ma). Analysis of the
Populus genome provided evidence of a more
recent duplication event that affected roughly 92%
of the Populus genome. Nearly 8000 pairs of
paralogous genes of similar age (excluding tandem
or local duplications) were identified (Fig. 2). The
relative age of the duplicate genes was estimated
by the accumulated nucleotide divergence at
fourfold synonymous third-codon transversion
position (4DTV) values. A sharp peak in 4DTV
values, corrected for multiple substitutions,
representing a burst of gene duplication, is evident
at 0.0916 T 0.0004 (Fig. 3A). Comparison of 1825
Populus and Salix orthologous genes derived
from Salix EST suggests that both genera share
this whole-genome duplication event (Fig. 3B).
Moreover, the parallel karyotypes and collinear
genetic maps (18) of Salix and Populus also
support the conclusion that both lineages share
the same large-scale genome history.
If we naively calibrated the molecular clock
using synonymous rates observed in the Brassicaceae (19) or derived from the Arabidopsis-Oryza
divergence (20), we would conclude that the
genome duplication in Populus is very recent [8
to 13 Ma, as reported by Sterk (21)]. Yet the
fossil record shows that the Populus and Salix
lineages diverged 60 to 65 Ma (22–25). Thus, the
molecular clock in Populus must be ticking at only
Downloaded from www.sciencemag.org on February 28, 2010
Fig. 2. Chromosome-level reorganization of the most recent
genome-wide duplication event in
Populus. Common colors refer to
homologous genome blocks, presumed to have arisen from the
salicoid-specific genome duplication
65 Ma, shared by two chromosomes.
Chromosomes are indicated by their
linkage group number (I to XIX). The
diagram to the left uses the same
color coding and further illustrates
the chimeric nature of most linkage
groups.
Fig. 3. (A) The 4DTV metrics for paralogous gene pairs in PopulusPopulus and Populus-Arabidopsis. Three separate genome-wide duplications events are detectable, with the most recent event contained within
1598
15 SEPTEMBER 2006
VOL 313
the Salicaceae and the middle event apparently shared among the
Eurosids. (B) Percent identity distributions for mutual best EST hit to
Populus trichocarpa CDS.
SCIENCE
www.sciencemag.org
one-sixth the estimated rate for Arabidopsis (that is,
8 to 13 Ma divided by 60 to 65 Ma). Qualitatively
similar slowing of the molecular clock is found in
the Populus chloroplast and mitochondrial
genomes (9). Because Populus is a long-lived
vegetatively propagated species, it has the potential
to successfully contribute gametes to multiple
generations. A single Populus genotype can
persist as a clone on the landscape for millennia
(26), and we propose that recurrent contributions
of ‘‘ancient gametes’’ from very old individuals
could account for the markedly reduced rate of
sequence evolution. As a result of the slowing of
the molecular clock, the Populus genome most
likely resembles the ancestral eurosid genome.
To test whether the burst of gene creation 60 to
65 Ma was due to a single whole-genome event or
to independent but near-synchronous gene duplication events, we used a variant of the algorithm of
Hokamp et al. (27) to identify segments of
conserved synteny within the Populus genome.
The longest conserved syntenic block from the
4DTV È0.09 epoch spanned 765 pairs of
paralogous genes. In total, 32,577 genes were
contained within syntenic blocks from the salicoid
epoch; half of these genes were contained in
segments longer than 142 paralogous pairs. The
same algorithm, when applied to randomly
shuffled genes, typically yields duplicate blocks
with fewer than 8 to 9 genes, indicating that the
Populus gene duplications occurred as a single
genome-wide event. We refer to this duplication
event as the ‘‘salicoid’’ duplication event.
Nearly every mapped segment of the Populus
genome had a parallel ‘‘paralogous’’ segment
elsewhere in the genome as a result of the
salicoid event (Fig. 2). The pinwheel patterns can
be understood as a whole-genome duplication
followed by a series of reciprocal tandem terminal
fusions between two separate sets of four chromosomes each—the first involving LGII, V, VII,
and XIV and the second involving LGI, XI, IV,
and IX. In addition, several chromosomes appear
to have experienced minor reorganizational
exchanges. Furthermore, LGI appears to be the
result of multiple rearrangements involving three
major tandem fusions. These results suggest that
the progenitor of Populus had a base chromosome
number of 10. After the whole-genome duplication
event, this base chromosome number experienced
a genome-wide reorganization and diploidization
of the duplicated chromosomes into four pairs of
complete paralogous chromosomes (LGVI, VIII,
X, XII, XIII, XV, XVI, XVIII, and XIX); two sets
of four chromosomes, each containing a terminal
translocation (LGI, II, IV, V, VII, IX, and XI); and
one chromosome containing three terminally
joined chromosomes (LGIII with I or XVII with
VII). The colinearity of genetic maps among
multiple Populus species suggests that the
genome reorganization occurred before the evolution of the modern taxa of Populus.
Genome duplication in a common ancestor of Populus and Arabidopsis. The distribution of 4DTV values for paralogous pairs of genes
also shows that a large fraction of the Populus
genome falls in a set of duplicated segments
anchored by gene pairs with 4DTV at 0.364 T
0.001, representing the residue of a more ancient,
large-scale, apparently synchronous duplication
event (Fig. 3A). This relatively older duplication
event covers about 59% of the Populus genome
with 16% of genes in these segments present in
two copies. Because this duplication preceded
and is therefore superimposed upon the salicoid
event, each genomic region is potentially
covered by four such segments. Similarly, the
Arabidopsis genome experienced an older
‘‘beta’’ duplication that preceded the Brassicaceae-specific ‘‘alpha’’ event (28–32).
We next asked whether the Arabidopsis
‘‘beta’’ (30, 32) and Populus 4DTV È0.36 duplication events were (i) independent genomewide duplications that occurred after the split from
the last common eurosid ancestor (H1) or (ii) a
single shared duplication event that occurred in an
ancestral lineage (i.e., before the divergence of
eurosid lineages I and II) (H2). These two
hypotheses have very different implications for
the interpretation of homology between Populus
and Arabidopsis. Under H1, each genomic
segment in one species is homologous to four
segments in the other; whereas under H2, each
segment is homologous to only two segments in
the other species. These hypotheses were tested
by comparing the relative distances between gene
pairs sampled within and between Populus and
Arabidopsis. H2 was generally supported (9), but
we could not reject H1. We can only conclude
that the Populus genome duplication occurred
very close to the time of divergence of the
eurosid I and II lineages (9), with slight support
for a shared duplication. This coincident timing
raises the possibility of a causal link between
this duplication and rapid diversification early
in eurosid (and perhaps core eudicot) history.
We refer to this older Populus/Arabidopsis
duplication event as the ‘‘eurosid’’ duplication
event. We note that the salicoid duplication
occurred independently of the eurosid duplication observed in the Arabidopsis genome.
Gene Content
Although Populus has substantially more protein-coding genes than Arabidopsis, the relative
frequency of domains represented in protein
databases (Prints, Prosite, Pfam, ProDom, and
SMART) in the two genomes is similar (9).
However, the most common domains occur in
Populus compared with Arabidopsis in a ratio
ranging from 1.4:1 to 1.8:1. Noteworthy outliers
in Populus include genes and gene domains
associated with disease and insect resistance (such
as, in Populus versus Arabidopsis, respectively:
leucine-rich repeats, 1271 versus 527; NB-ARC
domain, 302 versus 141; and thaumatin, 55 versus
24), meristem development (such as NAC
transcription factors, 157 versus 100, respectively), and metabolite and nutrient transport
[such as oligopeptide transporter of the proton-
www.sciencemag.org
SCIENCE
VOL 313
dependent oligopeptide transporter (POT) and
oligopeptide transporter (OPT) families, 129 versus 61, and potassium transporter, 30 versus 13,
respectively].
Some domains were underrepresented in
Populus compared with Arabidopsis. For example, the F-box domain was twice as prevalent in
Arabidopsis as in Populus (624 versus 303,
respectively). The F-box domain is involved in
diverse and complex interactions involving
protein degradation through the ubiquitin-26S
proteosome pathway (33). Many of the ubiquitinassociated domains are underrepresented in
Populus compared with Arabidopsis (for example, the Ulp1 protease family and the C-terminal
catalytic domain, 10 versus 63, respectively).
Moreover, the RING-finger domains are nearly
equally present in both genomes (503 versus
407, respectively), suggesting that protein degradation pathways in the two organisms are
metabolically divergent.
The common eurosid gene set. The Populus
and Arabidopsis gene sets were compared to infer
the conserved gene complement of their common
eurosid ancestor, integrating information from
nucleotide divergence, synteny, and mutual best
BLAST-hit analysis (9). The ancestral eurosid
genome contained at least 11,666 protein-coding
genes, along with an undetermined number that
were either lost in one or both of the lineages or
whose homology could not be detected. These
ancestral genes were the progenitors of gene
families of typically one to four descendents in
each of the complete plant genomes and account
for 28,257 Populus and 17,521 Arabidopsis
genes. Gene family lists are accessible at www.
phytozome.net. The gene predictions in these
two genomes that could not be accounted for in
the eurosid clusters were often fragmentary or
difficult to categorize, and we could not confidently assign orthology to them. They may
include previously unidentified or rapidly evolving genes in the Populus and/or Arabidopsis
lineages, as well as poorly predicted genes.
Noncoding RNAs. Based on a series of
publicly available RNA detection algorithms
(34), including tRNAScan-SE, INFERNAL, and
snoScan, we identified 817 putative tRNAs; 22
U1, 26 U2, 6 U4, 23 U5, and 11 U6 spliceosomal
small nuclear RNAs (snRNAs); 339 putative C/D
small nucleolar RNAs (snoRNAs); and 88
predicted H/ACA snoRNAs in the Populus
assembly. All 57 possible anticodon tRNAs were
found. One selenocysteine tRNA was detected and
two possible suppressor tRNAs (anticodons that
bind stop codons) were also identified. Populus
has nearly 1.3 times as many tRNA genes as
Arabidopsis. In contrast to Arabidopsis (fig. S7A),
the copy number of tRNA in Populus was significantly and positively correlated with amino acid
occurrence in predicted gene models (fig. S7B).
The ratio of the number of snRNAs in Populus
compared with the number in Arabidopsis is 1.3 to
1.0, yet U1, U2, and U5 are overrepresented in
Populus, whereas U4 is underrepresented. Further-
15 SEPTEMBER 2006
Downloaded from www.sciencemag.org on February 28, 2010
RESEARCH ARTICLES
1599
more, U14 was not detected in Arabidopsis. The
snRNAs and snoRNAs have not been experimentally verified in Populus.
There are 169 identified microRNA (miRNA)
genes representing 21 families in Populus
(table S7). In Arabidopsis, these 21 families
contain 91 miRNA genes, representing a 1.9X
expansion in Populus, primarily in miR169 and
miR159/319. All 21 miRNA families have
regulatory targets that appear to be conserved
among Arabidopsis and Populus (table S8).
Similar to the miRNA genes themselves, the
number of predicted targets for these miRNA
is expanded in Populus (147) compared with
Arabidopsis (89). Similarly, the genes that
mediate RNA interference (RNAi) are also
overrepresented in Populus (21) compared to
Arabidopsis (11) [e.g., AGO1 class, 7 versus
3; RNA helicase 2 versus 1; HEN, 2 versus 1;
HYL1-like (double-stranded RNA binding
proteins), 9 versus 5, respectively].
Tandem duplications. In Populus there
were 1518 tandemly duplicated arrays of two
or more genes based on a Smith-Waterman
alignment E value e 10j25 and a 100-kb
window. The total number of genes in such
arrays was 4839 and the total length of
tandemly duplicated segments in Populus was
47.9 Mb, or 15.6% of the genome (fig. S8). By
the same criteria, there are 1366 tandemly
duplicated segments in Arabidopsis, covering
32.4 Mb, or 27% of the genome. By far the
most common number of genes within a single
array was two, with 958 such arrays in Populus
and 805 in Arabidopsis. Arabidopsis had a
larger number of arrays containing six or more
genes than did Populus. Tandem duplications
thus appear to be relatively more common in
Arabidopsis than in Populus. This may in part
be due to difficulties in assembling tandem
repeats from a whole-genome shotgun sequencing approach, particularly when tandemly
duplicated genes are highly conserved. Alternatively, the Populus genome may be undergoing rearrangements at a slower rate than the
Arabidopsis genome, which is consistent with
our observations of reduced chromosomal rearrangements and slower nucleotide substitution rates in Populus.
In some cases, genes were highly duplicated
in both species, and some tandem duplications
predated the Populus-Arabidopsis split (9). The
largest number of tandem repeats in Populus in
a single array was 24 and contained genes with
high homology to S locus–specific glycoproteins. Genes of this class also occur as tandem
repeats in Arabidopsis, with the largest segments containing 14 tandem duplicates on
chromosome 1. One of the InterPro domains
in this protein, IPR008271, a serine/threonine
protein kinase active site, was the most frequent
domain in tandemly repeated genes in both
species (fig. S8). Other common domains in
both species were the leucine-rich repeat
(IPR007090, primarily from tandem repeats of
1600
disease resistance genes), the pentatricopeptide
repeat RNA-binding proteins (IPR002885), and
the uridine diphosphate (UDP)-glucuronosyl/
UDP-glucosyltransferase domain (IPR002213)
(table S9).
In contrast, some genes were highly expanded
in tandem duplicates in one genome and not in
the other (fig. S8). For example, one of the most
frequent classes of tandemly duplicated genes in
Arabidopsis was F-box genes, with a total of
342 involved in tandem duplications, the largest
segment of which contained 24 F-box genes.
Populus contains only 37 F-box genes in tandem duplications, with the largest segment containing only 3 genes.
Postduplication Gene Fate
Functional expression divergence. In Populus,
20 of the 66 salicoid-event duplicate gene pairs
contained in 19 Populus EST libraries (2.3% of
the total) showed differential expression (9)
[displaying significant deviation in EST frequencies per library (Fig. 4)]. Out of 18 eurosid-event
duplicate gene pairs (2.7% of the total), 11 also
displayed significant deviation in EST frequencies
per library. Many of the duplicate gene pairs that
displayed significant overrepresentation in one or
more of the 19 sampled libraries were involved in
protein-protein interactions (such as annexin) or
protein folding (such as cyclophilins). In the
eurosid set, there was a greater divergence in
the best BLAST hit among pairwise sets of genes.
These results support the premise of functional
expression divergence among some duplicated
gene pairs in Populus.
To further test for variation in gene expression among duplicated genes, we examined
whole-genome oligonucleotide microarray data
containing the 45,555 promoted genes (9). There
was significantly lower differential expression
in the salicoid duplicated pairs of genes (mean 0
5%) relative to eurosid duplications (mean 0
11%), again suggesting that differential expression patterns for retained paralogous gene pairs is
an ongoing process that has had more time to
occur in eurosid pairs (Fig. 5). This difference
could also be due to absolute expression level,
which may vary systematically between the two
duplication events. Moreover, differential expression was more evident in the wood-forming
organs. Almost 14 and 13% (2632 pairs of genes)
of eurosid duplicated genes in the nodes and
internodes, respectively, displayed differential
expression, compared with 8% or less in roots
and young leaves (Fig. 5).
Single-nucleotide polymorphisms. Populus is a highly polymorphic taxon and substantial numbers of SNPs are present even within a
single individual (Table 1). The ratio of nonsynonymous to synonymous substitution rate
(w 0 dN/dS) was calculated as an index of
selective constraints for alleles of individual
genes (9). The overall average dN across all
genes was 0.0014, whereas the dS value was
0.0035, for a total w of 0.40, suggesting that the
majority of coding regions in the Populus
genome are subject to purifying selection.
There was a significant, negative correlation
between w and the 4DTV distance to the most
closely related paralog (r 0 –0.034, P 0 0.028),
which is consistent with the expectation of
higher levels of nonsynonymous polymorphism
in recently duplicated genes as a result of functional redundancy (20, 35). Similarly, genes
with recent tandem duplicates (4DTV e 0.2)
had significantly higher w than did genes with
no recent tandem duplicates (Wilcoxon rank
sum Z 0 8.65, P e 0.0001) (table S10).
The results for tandemly duplicated genes
were consistent with expectations for accelerated
evolution of duplicated genes (20). However, this
expectation was not upheld for paralogous pairs
of genes from the whole-genome duplication
events. Relative rates of nonsynonymous substitution were actually lower for genes with paralogs
from the salicoid and eurosid whole-genome
duplication events than for genes with no paralogs
(table S11). One possible explanation for this
Fig. 4. KolmogorovSmirnov (K-S) test for differential expression for 5methyltetrahydropteroyltriglutamate-homocysteine
S-methyltransferase genes
[for descriptions of the EST
data set, see Sterky et al.
(79)]. Results suggest that
the duplicated genes in
Populus are differentially
expressed in alternate tissues. Tissue types include:
cambial zone (1), young
leaves (2), flower buds (3),
tension wood (4), senescing leaves (5), apical shoot
(6), dormant cambium (7),
active cambium (8), cold stressed leaves (9), roots (10), bark (11), shoot meristem (12), male catkins (13),
dormant buds (14), female catkins (15), petioles (16), wood cell death (17), imbibed seeds (18) and infected
leaves (19).
15 SEPTEMBER 2006
VOL 313
SCIENCE
www.sciencemag.org
Downloaded from www.sciencemag.org on February 28, 2010
RESEARCH ARTICLES
discrepancy is that the apparent single-copy genes
have a corresponding overrepresentation of rapidly evolving pseudogenes. However, this does
not appear to be the case, as demonstrated by an
analysis of gene size, synonymous substitution
rate, and minimum genetic distance to the closest
paralog as covariates in an analysis of variance
with w as the response variable (table S11).
Therefore, genes with no paralogs from the
salicoid and eurosid duplication events seem to
be under lower selective constraints, and purifying
selection is apparently stronger for genes with
paralogs retained from the whole-genome duplications. Chapman et al. (36) have recently
proposed the concept of functional buffering to
account for similar reduction in detected mutations in paralogs from whole-genome duplications
in Arabidopsis and Oryza. The vegetative
propagation habit of Populus may also favor
the conservation of nucleotide sequences among
duplicated genes, in that complementation
among duplicate pairs of genes would minimize
loss of gene function associated with the
accumulation of deleterious somatic mutations.
Gene family evolution. The expansion of
several gene families has contributed to the
evolution of Populus biology.
Lignocellulosic wall formation. Among the
processes unique to tree biology, one of the most
obvious is the yearly development of secondary
xylem from the vascular cambium. We identified
Populus orthologs of the approximately 20
Arabidopsis genes and gene families involved in
or associated with cellulose biosynthesis. The
Populus genome has 93 cellulose synthesis–
related genes compared with 78 in Arabidopsis.
The Arabidopsis genome encodes 10 CesA genes
belonging to six classes known to participate in
cellulose microfibril biosynthesis (37). Populus
has 18 CesA genes (38), including duplicate
copies of CesA7 and CesA8 homologs. Populus
homologs of Arabidopsis CesA4, CesA7, and
CesA8 are coexpressed during xylem development and tension wood formation (39). Furthermore, one pair of CesA genes appears unique to
Populus, with no homologs found in Arabi-
dopsis (40). Many other types of genes
associated with cellulose biosynthesis, such as
KOR, SuSY, COBRA, and FRA2, occur in
duplicate pairs in Populus relative to single-copy
Arabidopsis genes (39). For example, COBRA,
a regulator of cellulose biogenesis (41), is a
single-copy gene in Arabidopsis, but in Populus
there are four copies.
The repertoire of acknowledged hemicellulose biosynthetic genes in Populus is generally
similar to that in Arabidopsis. However, Populus
has more genes encoding a-L-fucosidases and
fewer genes encoding a-L-fucosyltransferases
than does Arabidopsis, which is consistent with
the lower xyloglucan fucose content (42) in
Populus relative to Arabidopsis.
Lignin, the second most abundant cell wall
polymer after cellulose, is a complex polymer of
monolignols (hydroxycinnamyl alcohols) that
encrusts and interacts with the cellulose/hemicellulose matrix of the secondary cell wall (43).
The full set of 34 Populus phenylpropanoid and
lignin biosynthetic genes (table S13) was
identified by sequence alignment to the known
Arabidopsis phenylpropanoid and lignin genes
(44, 45). The size of the Populus gene families
that encode these enzymes is generally larger
than in Arabidopsis (34 versus 18, respectively).
The only exception is cinnamyl alcohol dehydrogenase (CAD), which is encoded by a
single gene in Populus and two genes in
Arabidopsis (Fig. 6C); CAD is also encoded
by only a single gene in Pinus taeda (46, 47).
Two lignin-related Populus C4H genes are
strongly coexpressed in tissues related to wood
formation, whereas the three Populus C3H
genes show reciprocally exclusive expression
patterns (48) (Fig. 6, A and B).
Secondary metabolism. Populus trees produce a broad array of nonstructural, carbon-rich
secondary metabolites that exhibit wide variation
in abundance, stress inducibility, and effects on
tree growth and host-pest interactions (49–53).
Shikimate-phenylpropanoid–derived phenolic
esters, phenolic glycosides, and condensed
tannins and their flavonoid precursors comprise
Fig. 5. Proportion of eurosid and
salicoid duplicated gene sets differentially expressed in stems (nodes
and internode), leaves (young and
mature), and whole roots. Samples
from four biological replicates collected from the reference genotype
Nisqually 1 were individually hybridized to whole-genome oligonucleotide
microarrays containing three 60oligomer oligonucleotide probes for
each gene. Differential expression
between duplicated genes was evaluated in t tests and declared significant at a 5% false discovery rate (9).
www.sciencemag.org
SCIENCE
VOL 313
the largest classes of these metabolites. Phenolic glycosides and condensed tannins alone can
constitute up to 35% leaf dry weight and are
abundant in buds, bark, and roots of Populus
(50, 54, 55).
The flavonoid biosynthetic genes are well
annotated in Arabidopsis (56) and almost all
(with the exception of flavonol synthase) are
encoded by single-copy genes. In contrast, all
but three such enzymes (chalcone isomerase,
flavonoid 3¶-hydroxylase, and flavanone 3hydroxylase) are encoded by multiple genes in
Populus (53). For example, the chalcone synthase, controlling the committed step to flavonoid
biosynthesis, has expanded to at least six genes
in Populus. In addition, Populus contains two
genes each for flavone synthase II (cytochrome
accession number CYP98B) and flavonoid 3¶,5¶hydroxylase (CYP75A12 and CYP75A13), both
of which are absent in Arabidopsis. Furthermore,
three Populus genes encode leucoanthocyanidin
reductase, required for the synthesis of condensed
tannin precursor 2,3-trans-flavan-3-ols, a stereochemical configuration also lacking in Arabidopsis
(57). In contrast to the 32 terpenoid synthase
(TPS) genes of secondary metabolism identified
in the Arabidopsis genome (58), the Populus
genome contains at least 47 TPS genes, suggesting a wide-ranging capacity for the formation of terpenoid secondary metabolites.
A number of phenylpropanoid-like enzymes
have been annotated in the Arabidopsis genome
(44, 45, 59–61). One example is the family encoding CAD. In addition to the single Populus
CAD gene involved in lignin biosynthesis,
several other clades of CAD-like (CADL) genes
are present, most of which fall within larger
subfamilies containing enzymes related to multifunctional alcohol dehydrogenases (Fig. 6).
This comparative analysis makes it clear that
there has been selective expansion and retention
of Populus CADL gene families. For example,
Populus contains seven CADL genes (PoptrCADL1 to PoptrCADL7; Fig. 6C) encoding
enzymes related to the Arabidopsis BAD1 and
BAD2 enzymes with apparent benzyl alcohol
dehydrogenase activities (62). BAD1 and BAD2
are known to be pathogen inducible, suggesting
that this group of Populus genes, including the
Populus SAD gene, previously characterized as
encoding a sinapaldehyde-specific CAD enzyme
(63), may be involved in chemical defense.
Disease resistance. The likelihood that a
perennial plant will encounter a pathogen or
herbivore before reproduction is near unity. The
long-generation intervals for trees make it
difficult for such plants to match the evolutionary rates of a microbial or insect pest. Aside
from the formation of thickened cell walls and
the synthesis of secondary metabolites that
constitute a first line of defense against microbial and insect pests, plants use a variety of
disease-resistance (R) genes.
The largest class of characterized R genes
encodes intracellular proteins that contain a
15 SEPTEMBER 2006
Downloaded from www.sciencemag.org on February 28, 2010
RESEARCH ARTICLES
1601
RESEARCH ARTICLES
1602
families, coding for adenosine 5¶-triphosphate–
binding cassette proteins (ABC transporters, 226
gene models), major facilitator superfamily
proteins (187 genes), drug/metabolite transporters
(108 genes), amino acid/auxin permeases (95
genes), and POT transporters (90 genes), accounted for more than 40% of the total number of
transporter gene models (fig. S14). Some large
families such as those encoding POT (4.3X relative to Arabidopsis), glutamate-gated ion channels (3.7X), potassium uptake permeases (2.3X),
and ABC transporters (1.9X) are expanded in
Populus. We identified a subfamily of five
putative aquaporins, lacking in the Arabidopsis.
Populus also harbors seven transmembrane re-
ceptor genes that have previously only been
found in fungi, and two genes, identified as
mycorrhizal-specific phosphate transporters, that
confirm that the mycorrhizal symbiosis may have
an impact on the mineral nutrition of this longlived species. This expanded inventory of transporters could conceivably play a role in adaptation
to nutrient-limited forest soils, long-distance transport and storage of water and metabolites,
secretion and movement of secondary metabolites, and/or mediation of resistance to pathogenproduced secondary metabolites or other toxic
compounds.
Phytohormones. Both physiological and molecular studies have indicated the importance of
Fig. 6. Phylogenetic analysis of gene families in Populus, Arabidopsis, and Oryza encoding
selected lignin biosynthetic and related enzymes. (A) Cinnamate-4-hydroxylase (C4H) gene family.
(B) 4-coumaroyl-shikimate/quinate-3-hydroxlase (C3H) gene family. (C) Cinnamyl alcohol dehydrogenase (CAD) and related multifunctional alcohol dehydrogenase gene family. Arabidopsis gene names
are the same as those in Ehlting et al. (80). Populus and Oryza gene names were arbitrarily assigned;
corresponding gene models are listed in table S13. Genes encoding enzymes for which biochemical
data are available are highlighted with a green flash. Yellow circles indicate monospecific clusters of
gene family members.
Table 2. Numbers of genes that encode domains similar to plant R proteins in Populus, Arabidopsis (81),
and Oryza (82). *, BED finger and/or DUF1544 domain; CC, coiled coil; –, not detected.
Predicted protein domains
Letter code
Populus
Arabidopsis
Oryza
TN
TNL
TNLT
TNLN
NLT
TCNL
CN
CNL
BN
NB
BNL
NL
N
–
10
64
13
1
1
2
19
119
5
1
24
90
49
–
398
21
83
–
–
–
–
4
51
–
–
–
6
1
41
207
–
–
–
–
–
–
7
159
–
–
–
40
45
284
535
TIR-NBS
TIR-NBS-LRR
TIR-NBS-LRR-TIR
TIR-NBS-LRR-NBS
NBS-LRR-TIR
TIR-CC-NBS-LRR
CC-NBS
CC-NBS-LRR
BED/DUF1544*-NBS
NBS-BED/DUF1544*
BED/DUF1544*-NBS-LRR
NBS-LRR
NBS
Others
Total NBS genes
15 SEPTEMBER 2006
VOL 313
SCIENCE
www.sciencemag.org
Downloaded from www.sciencemag.org on February 28, 2010
nucleotide-binding site (NBS) and carboxyterminal leucine-rich-repeats (LRR) (64). The
NBS-coding R gene family is one of the largest
in Populus, with 399 members, approximately
twice as high as in Arabidopsis. The NBS family
can be divided into multiple subfamilies with
distinct domain organizations, including 64 TIRNBS-LRR genes, 10 truncated TIR-NBS that
lack an LRR, 233 non–TIR-NBS-LRR genes,
and 17 unusual TIR-NBS–containing genes that
have not been identified previously in Arabidopsis (TNLT, TNLN, or TCNL domains)
(Table 2). Five gene models coding for TNL
proteins contained a predicted N-terminal
nuclear localization signal (65). The number of
non–TIR-NBS-LRR genes in Populus is also
much higher than that in Arabidopsis (209 versus
57, respectively). Notably, 40 non–TIR-NBS
genes, not found in Arabidopsis, carry an Nterminal BED DNA-binding zinc finger domain
that was also found in the Oryza Xa1 gene. These
findings suggest that domain cooption occurred in
Populus. Most NBS-LRR (about 65%) in Populus occur as singletons or in tandem duplications,
and the distribution of pairwise genetic distances
among these genes suggests a recent expansion of
this family. That is, only 10% of the NBS-LRR
genes are associated with the eurosid and salicoid
duplication events, compared with 55% of the
extracellular LRR receptor-like kinase genes (for
example, fig. S10).
Several conserved signaling components
such as RAR1, EDS1, PAD4, and NPR1, known
to be recruited by R genes, also contain multiple
homologs in Populus. For example, two copies
of the PAD4 gene, which functions upstream of
salicylic acid accumulation, and five copies of
the NPR1 gene, an important regulator of responses downstream of salicylic acid, are found
in Populus. Nearly all genes known to control
disease resistance signaling in Arabidopsis have
putative orthologs in Populus. Populus has a
larger number of b-1,3-glucanase and chitinase
genes than does Arabidopsis (131 versus 73,
respectively). In summary, the structural and
genetic diversity that exists among R genes and
their signaling components in Populus is remarkable and suggests that unlike the rest of the
genome, contemporary diversifying selection has
played an important role in the evolution of
disease resistance genes in Populus. Such diversification suggests that enhanced ability to detect
and respond to biotic challenges through R
gene–mediated signaling may be critical over
a decades-long life span of this genus.
Membrane transporters. Attributes of Populus
biology such as massive interannual, seasonal,
and diurnal metabolic shifts and redeployment
of carbon and nitrogen may require an elaborate array of transporters. Investigation of gene
families coding for transporter proteins (http://
plantst.genomics.purdue.edu/) in the Populus genome revealed a general expansion relative to
Arabidopsis (1722 versus 959, in Populus versus
Arabidopsis, respectively) (table S12). Five gene
RESEARCH ARTICLES
gene families have been implicated in the negative regulation of cytokinin signaling (67, 77),
which is consistent with the idea of increased
complexity in regulation of cytokinin signal transduction in Populus.
Populus and Arabidopsis genomes contain
almost identical numbers of genes for the three
enzymes of ethylene biosynthesis, whereas the
number of genes for proteins involved in
ethylene perception and signaling is higher in
Populus. For example, Populus has seven
predicted genes for ethylene receptor proteins
and Arabidopsis has five; the constitutive triple
response kinase that acts just downstream of the
receptor is encoded by four genes in Populus
and only one in Arabidopsis (78). The number
of ethylene-responsive element binding factor
(ERF) proteins (a subfamily of the AP2/ERF
family) is higher in Populus than in Arabidopsis
(172 versus 122, respectively). The increased
variation in the number of ERF transcription factors may be involved in the ethylene-dependent
processes specific to trees, such as tension wood
formation (68) and the establishment of dormancy (71).
Conclusions
Our initial analyses provide a flavor of the opportunities for comparative plant genomics made
possible by the generation of the Populus genome
sequence. A complex history of whole-genome
duplications, chromosomal rearrangements and
tandem duplications has shaped the genome that
we observe today. The differences in gene
content between Populus and Arabidopsis have
provided some tantalizing insights into the
possible molecular bases of their strongly contrasting life histories, although factors unrelated
to gene content (such as regulatory elements,
miRNAs, posttranslational modification, or epigenetic modifications) may ultimately be of
equal or greater importance. With the sequence
of Populus, researchers can now go beyond what
could be learned from Arabidopsis alone and
explore hypotheses to linking genome sequence
features to wood development, nutrient and water
movement, crown development, and disease
resistance in perennial plants. The availability of
the Populus genome sequence will enable
continuing comparative genomics studies among
species that will shed new light on genome
reorganization and gene family evolution. Furthermore, the genetics and population biology of
Populus make it an immense source of allelic
variation. Because Populus is an obligate
outcrossing species, recessive alleles tend to be
maintained in a heterozygous state. Informatics
tools enabled by the sequence, assembly, and
annotation of the Populus genome will facilitate
the characterization of allelic variation in wild
Populus populations adapted to a wide range of
environmental conditions and gradients over
large portions of the Northern Hemisphere. Such
variants represent a rich reservoir of molecular
resources useful in biotechnological applications,
www.sciencemag.org
SCIENCE
VOL 313
development of alternative energy sources, and
mitigation of anthropogenic environmental problems. Finally, the keystone role of Populus in
many ecosystems provides the first opportunity
for the application of genomics approaches to
questions with ecosystem-scale implications.
References and Notes
1. Food and Agricultural Organization of the United Nations,
State of the World’s Forests 2003 (FAO, Rome, 2003).
2. R. F. Stettler, H. D. Bradshaw Jr., in Biology of Populus
and Its Implications for Management and Conservation,
R. F. Stettler, H. D. Bradshaw Jr., P. E. Heilman, T. M.
Hinckley, Eds. (NRC Research Press, Ottawa, 1996),
pp. 1–7.
3. G. A. Tuskan, S. P. DiFazio, T. Teichmann, Plant Biol. 6, 2
(2004).
4. T. M. Yin, S. P. DiFazio, L. E. Gunter, D. Riemenschneider,
G. A. Tuskan, Theor. Appl. Genet. 109, 451 (2004).
5. R. Meilan, C. Ma, in Agrobacterium Protocols, vol. 344 of
Methods in Molecular Biology, K. Wang, Ed. (Humana
Press, Totowa, NJ, 2006), pp. 143–151.
6. G. A. Tuskan, Biomass Bioenerg. 14, 307 (1998).
7. G. A. Tuskan, M. Walsh, For. Chron. 77, 259 (2001).
8. S. Wullschleger et al., Can. J. For. Res. 35, 1779
(2005).
9. Materials and methods are available as supporting
material on Science Online.
10. H. D. Bradshaw, R. F. Stettler, Theor. Appl. Genet. 86, 301
(1993).
11. M. Koornneef, P. Fransz, H. de Jong, Chromosome Res.
11, 183 (2003).
12. O. Santamaria, J. J. Diez, For. Pathol. 35, 95 (2005).
13. G. A. Tuskan et al., Can. J. For. Res. 34, 85 (2004).
14. A. A. Salamov, V. V. Solovyev, Genome Res. 10, 516 (2000).
15. E. Birney, R. Durbin, Genome Res. 10, 547 (2000).
16. T. Schiex, A. Moisan, P. Rouzé, in Computational Biology:
Selected Papers from JOBIM’2000, number 2066 in
LNCS (Springer-Verlag, Heidelberg, Germany, 2001),
pp. 118–133.
17. Y. Xu, E. C. Uberbacher, J. Comput. Biol. 4, 325 (1997).
18. S. J. Hanley, M. D. Mallott, A. Karp, Tree Genet. Genomes,
in press.
19. M. A. Koch, B. Haubold, T. Mitchell-Olds, Mol. Biol. Evol.
17, 1483 (2000).
20. M. Lynch, J. S. Conery, Science 290, 1151 (2000).
21. L. Sterck et al., New Phytol. 167, 165 (2005).
22. L. A. Dode, Bull. Soc. Hist. Nat. Autun 18, 161 (1905).
23. R. Regnier, in Revue des Sociétés Savantes de Normandie
(Rouen, France, 1956), vol. 1, pp. 1–36.
24. M. E. Collinson, Proc. R. Soc. Edinburgh B Bio. Sci. 98,
155 (1992).
25. J. E. Eckenwalder, in Biology of Populus and Its Implications
for Management and Conservation, R. F. Stettler,
H. D. Bradshaw Jr., P. E. Heilman, T. M. Hinckley, Eds. (NRC
Research Press, Ottawa, 1996), chap. 1.
26. J. B. Mitton, M. C. Grant, Bioscience 46, 25 (1996).
27. K. Hokamp, A. McLysaght, K. H. Wolfe, J. Struct. Funct.
Genomics 3, 95 (2003).
28. J. E. Bowers, B. A. Chapman, J. K. Rong, A. H. Paterson,
Nature 422, 433 (2003).
29. L. M. Zahn, J. Leebens-Mack, C. W. dePamphilis, H. Ma,
G. Theissen, J. Hered. 96, 225 (2005).
30. S. De Bodt, S. Maere, Y. Van de Peer, Trends Ecol. Evol.
20, 591 (2005).
31. K. L. Adams, J. F. Wendel, Trends Genet. 21, 539 (2005).
32. G. Blanc, K. Hokamp, K. H. Wolfe, Genome Res. 13, 137
(2003).
33. B. A. Schulman et al., Nature 408, 381 (2000).
34. S. Griffiths-Jones et al., Nucleic Acids Res. 33, D121
(2005).
35. S. Lockton, B. S. Gaut, Trends Genet. 21, 60 (2005).
36. B. A. Chapman, J. E. Bowers, F. A. Feltus, A. H. Paterson,
Proc. Natl. Acad. Sci. U.S.A. 103, 2730 (2006).
37. T. A. Richmond, C. R. Somerville, Plant Physiol. 124, 495
(2000).
38. S. Djerbi, M. Lindskog, L. Arvestad, F. Sterky, T. T. Teeri,
Planta 221, 739 (2005).
15 SEPTEMBER 2006
Downloaded from www.sciencemag.org on February 28, 2010
hormonal regulation underlying plant development. Auxin, gibberellin, cytokinin, and ethylene
responses are of particular interest in tree biology.
Many auxin responses (66–71) are controlled
by auxin response factor (ARF) transcription
factors, which work together with cognate
AUX/IAA repressor proteins to regulate auxinresponsive target genes (72, 73). A phylogenetic analysis using the known and predicted
ARF protein sequences showed that Populus
and Arabidopsis ARF gene families have
expanded independently since they diverged
from their common ancestor. Six duplicate ARF
genes in Populus encode paralogs of ARF
genes that are single-copy Arabidopsis genes,
including ARF5 (MONOPTEROS), an important gene required for auxin-mediated signal
transduction and xylem development. Furthermore, five Arabidopsis ARF genes have four or
more predicted Populus ARF gene paralogs. In
contrast to ARF genes, Populus does not contain a notably expanded repertoire of AUX/IAA
genes relative to Arabidopsis (35 versus 29,
respectively) (74). Interestingly, there is a group
of four Arabidopsis AUX/IAA genes with no apparent Populus orthologs, suggesting Arabidopsisspecific functions.
Gibberellins (GAs) are thought to regulate
multiple processes during wood and root development, including xylem fiber length (75).
Among all gibberellin biosynthesis and signaling
genes, the Populus GA20-oxidase gene family is
the only family with approximately two times
the number of genes relative to Arabidopsis,
indicating that most of the duplicated genes that
arose from the salicoid duplication event have
been lost. GA20-oxidase appears to control flux
in the biosynthetic pathway leading to the
bioactive gibberellins GA1 and GA4. The higher
complement of GA20-oxidase genes may have
biological importance in Populus with respect to
secondary xylem and fiber cell development.
Cytokinins are thought to control the identity and proliferation of cell types relevant for
wood formation as well as general cell division
(67). The total number of members in gene
families encoding cytokinin homeostasis related
isopentenyl transferases (IPT) and cytokinin
oxidases is roughly similar between Populus
and Arabidopsis, although there appears to be
lineage-specific expansion of IPT subfamilies.
The cytokinin signal transduction pathway
represents a two-component phosphorelay system, in which a two-component hybrid receptor
initiates a phosphotransfer by means of histidinecontaining phosphotransmitters (HPt) to phosphoaccepting response regulators (RR). One family
of genes, encoding the two-component receptors (such as CKI1), is notably expanded in
Populus (four versus one in Populus and
Arabidopsis, respectively) (76). Gene families
coding for recently identified pseudo-HPt and
atypical RR are overrepresented in Populus
relative to Arabidopsis (2.5- and 4.0-fold increase in Populus, respectively). Both of these
1603
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
C. P. Joshi et al., New Phytol. 164, 53 (2004).
A. Samuga, C. P. Joshi, Gene 334, 73 (2004).
F. Roudier et al., Plant Cell 17, 1749 (2005).
R. M. Perrin et al., Science 284, 1976 (1999).
R. W. Whetten, J. J. Mackay, R. R. Sederoff, Annu. Rev.
Plant Physiol. Plant Mol. Biol. 49, 585 (1998).
J. Ehlting et al., Plant J. 42, 618 (2005).
J. Raes, A. Rohde, J. H. Christensen, Y. Van de Peer,
W. Boerjan, Plant Physiol. 133, 1051 (2003).
D. M. O’Malley, S. Porter, R. R. Sederoff, Plant Physiol.
98, 1364 (1992).
J. J. Mackay, W. W. Liu, R. Whetten, R. R. Sederoff,
D. M. Omalley, Mol. Gen. Genet. 247, 537 (1995).
J. Schrader et al., Plant Cell 16, 2278 (2004).
S. Whitham, S. McCormick, B. Baker, Proc. Natl. Acad.
Sci. U.S.A. 93, 8776 (1996).
G. M. Gebre, T. J. Tschaplinski, G. A. Tuskan, D. E. Todd,
Tree Physiol. 18, 645 (1998).
G. Arimura, D. P. W. Huber, J. Bohlmann, Plant J. 37, 603
(2004).
D. J. Peters, C. P. Constabel, Plant J. 32, 701 (2002).
C.-J. Tsai, S. A. Harding, T. J. Tschaplinski, R. L. Lindroth,
Y. Yuan, New Phytol. 172, 47 (2006).
M. M. De Sá, R. Subramaniam, F. E. Williams, C. J. Douglas,
Plant Physiol. 98, 728 (1992).
R. L. Lindroth, S. Y. Hwang, Biochem. Syst. Ecol. 24, 357
(1996).
B. Winkel-Shirley, Curr. Opin. Plant Biol. 5, 218
(2002).
G. J. Tanner et al., J. Biol. Chem. 278, 31647 (2003).
S. Aubourg, A. Lecharny, J. Bohlmann, Mol. Genet.
Genomics 267, 730 (2002).
M. A. Costa et al., Phytochemistry 64, 1097 (2003).
60. D. Cukovic, J. Ehlting, J. A. VanZiffle, C. J. Douglas, Biol.
Chem. 382, 645 (2001).
61. J. M. Shockey, M. S. Fulda, J. Browse, Plant Physiol. 132,
1065 (2003).
62. I. E. Somssich, P. Wernert, S. Kiedrowski, K. Hahlbrock,
Proc. Natl. Acad. Sci. U.S.A. 93, 14199 (1996).
63. L. Li et al., Plant Cell 13, 1567 (2001).
64. B. C. Meyers, S. Kaushik, R. S. Nandety, Curr. Opin. Plant
Biol. 8, 129 (2005).
65. L. Deslandes et al., Proc. Natl. Acad. Sci. U.S.A. 99, 2404
(2002).
66. E. J. Mellerowicz, M. Baucher, B. Sundberg, W. Boerjan,
Plant Mol. Biol. 47, 239 (2001).
67. A. P. Mähönen et al., Science 311, 94 (2006).
68. S. Andersson-Gunneras et al., Plant J. 34, 339 (2003).
69. J. M. Hellgren, K. Olofsson, B. Sundberg, Plant Physiol.
135, 212 (2004).
70. M. G. Cline, K. Dong-Il, Ann. Bot. (London) 90, 417 (2002).
71. R. Ruonala, P. Rinne, M. Baghour, H. Tuominen,
J. Kangasjärvi, Plant J., in press.
72. R. Moyle et al., Plant J. 31, 675 (2002).
73. D. Weijers et al., EMBO J. 24, 1874 (2005).
74. G. Hagen, T. Guilfoyle, Plant Mol. Biol. 49, 373 (2002).
75. M. E. Eriksson, M. Israelsson, O. Olsson, T. Moritz, Nat.
Biotechnol. 18, 784 (2000).
76. T. Kakimoto, Science 274, 982 (1996).
77. T. Kiba, K. Aoki, H. Sakakibara, T. Mizuno, Plant Cell
Physiol. 45, 1063 (2004).
78. T. Nakano, K. Suzuki, T. Fujimura, H. Shinshi, Plant
Physiol. 140, 411 (2006).
79. F. Sterky et al., Proc. Natl. Acad. Sci. U.S.A. 101, 13951
(2004).
80. J. Ehlting et al., Plant J. 42, 618 (2005).
Opposing Activities Protect Against
Age-Onset Proteotoxicity
Ehud Cohen,1* Jan Bieschke,2* Rhonda M. Perciavalle,1 Jeffery W. Kelly,2 Andrew Dillin1†
Aberrant protein aggregation is a common feature of late-onset neurodegenerative diseases, including
Alzheimer’s disease, which is associated with the misassembly of the Ab1-42 peptide. Aggregation-mediated
Ab1-42 toxicity was reduced in Caenorhabiditis elegans when aging was slowed by decreased insulin/
insulin growth factor–1–like signaling (IIS). The downstream transcription factors, heat shock factor 1,
and DAF-16 regulate opposing disaggregation and aggregation activities to promote cellular survival in
response to constitutive toxic protein aggregation. Because the IIS pathway is central to the regulation of
longevity and youthfulness in worms, flies, and mammals, these results suggest a mechanistic link
between the aging process and aggregation-mediated proteotoxicity.
ate-onset human neurodegenerative diseases
including Alzheimer_s (AD), Huntington_s,
and Parkinson_s diseases are genetically and pathologically linked to aberrant protein aggregation (1, 2). In AD, formation of
aggregation-prone peptides, particularly Ab1-42,
by endoproteolysis of the amyloid precursor protein (APP) is associated with the disease through
an unknown mechanism (3, 4). Whether intracellular accumulation or extracellular deposition of
Ab1-42 initiates the pathological process is a key
unanswered question (5). Typically, individuals
who carry AD-linked mutations present with clinical symptoms during their fifth or sixth decade,
whereas sporadic cases appear after the seventh
decade. Why aggregation-mediated toxicity emerges
late in life and whether it is mechanistically linked
to the aging process remain unclear.
Perhaps the most prominent pathway that
regulates life span and youthfulness in worms,
L
1604
flies, and mammals is the insulin/insulin growth
factor (IGF)–1–like signaling (IIS) pathway (6).
In the nematode Caenorhabditis elegans, the
sole insulin/IGF-1 receptor, DAF-2 (7), initiates
the transduction of a signal that causes the phosphorylation of the FOXO transcription factor,
DAF-16 (8, 9), preventing its translocation to
the nucleus (10). This negative regulation of
DAF-16 compromises expression of its target
genes, decreases stress resistance, and shortens
the worm_s life span. Thus, inhibition of daf-2
expression creates long-lived, youthful, stressresistant worms (11). Similarly, suppression of
the mouse DAF-2 ortholog, IGF1-R, creates longlived mice (12). Recent studies indicate that, in
worms, life-span extension due to reduced daf-2
activity is also dependent upon heat shock factor 1
(HSF-1). Moreover, increased expression of hsf-1
extends worm life span in a daf-16–dependent
manner (13). That the DAF-16 and HSF-1 tran-
15 SEPTEMBER 2006
VOL 313
SCIENCE
81. B. C. Meyers, S. Kaushik, R. S. Nandety, Curr. Opin. Plant
Biol. 8, 129 (2005).
82. M. W. Jones-Rhoades, D. P. Bartel, Mol. Cell 14, 787 (2004).
83. We thank the U.S. Department of Energy, Office of
Science for supporting the sequencing and assembly
portion of this study; Genome Canada and the Province of
British Columbia for providing support for the BAC end,
BAC genotyping, and full-length cDNA portions of this
study; the Umeå University and the Royal Technological
Institute (KTH) in Stockholm for supporting the EST
assembly and annotation portion of this study; the
membership of the International Populus Genome
Consortium for supplying genetic and genomics resources
used in the assembly and annotation of the genome; the
NSF Plant Genome Program for supporting the development of Web-based tools; T. H. D. Bradshaw and
R. Stettler for input and reviews on draft copies of the
manuscript; J. M. Tuskan for guidance and input during
the analysis and writing of the manuscript; and the
anonymous reviewers who provided critical input and
recommendations on the manuscript. GenBank Accession
Number: AARH00000000.
Supporting Online Material
www.sciencemag.org/cgi/content/full/313/5793/1596/DC1
Materials and Methods
Figs. S1 to S15
Tables S1 to S14
References
13 April 2006; accepted 9 August 2006
10.1126/science.1128691
scriptomes result in the expression of numerous chaperones (13, 14) suggests that the
integrity of protein folding could play a key role
in life-span determination and the amelioration
of aggregation-associated proteotoxicity. Indeed,
amelioration of Huntington-associated proteotoxicity by slowing the aging process in worms has
been reported (13, 15, 16).
Reduced IIS activity lowers Ab1-42 toxicity.
One hypothesis to explain late-onset aggregationassociated toxicity posits that the deposition of
toxic aggregates is a stochastic process, governed
by a nucleated polymerization and requiring
many years to initiate disease. Alternatively,
aging could enable constitutive aggregation to
become toxic as a result of declining detoxification activities. To distinguish between these two
possibilities, we asked what role the aging process plays in Ab1-42 aggregation-mediated toxicity in a C. elegans model featuring intracellular
Ab1-42 expression (17). If Ab1-42 toxicity results
from a non-age-related nucleated polymerization,
animals that express Ab1-42 and whose life span
has been extended would be expected to succumb
to Ab1-42 toxicity at the same rate as those with a
natural life span. However, if the aging process
plays a role in detoxifying an ongoing protein aggregation process, alteration of the aging program
1
Molecular and Cell Biology Laboratory, Salk Institute for
Biological Studies, 10010 North Torrey Pines Road, La Jolla, CA
92037, USA. 2Department of Chemistry and Skaggs Institute
of Chemical Biology, Scripps Research Institute, 10550 North
Torrey Pines Road, La Jolla, CA 92037, USA.
*These authors contributed equally to this work.
†To whom correspondence should be addressed E-mail:
[email protected]
www.sciencemag.org
Downloaded from www.sciencemag.org on February 28, 2010
RESEARCH ARTICLES
Vol 449 | 27 September 2007 | doi:10.1038/nature06148
LETTERS
The grapevine genome sequence suggests ancestral
hexaploidization in major angiosperm phyla
The French–Italian Public Consortium for Grapevine Genome Characterization*
The analysis of the first plant genomes provided unexpected evidence for genome duplication events in species that had previously
been considered as true diploids on the basis of their genetics1–3.
These polyploidization events may have had important consequences in plant evolution, in particular for species radiation and
adaptation and for the modulation of functional capacities4–10. Here
we report a high-quality draft of the genome sequence of grapevine
(Vitis vinifera) obtained from a highly homozygous genotype. The
draft sequence of the grapevine genome is the fourth one produced
so far for flowering plants, the second for a woody species and the
first for a fruit crop (cultivated for both fruit and beverage).
Grapevine was selected because of its important place in the cultural heritage of humanity beginning during the Neolithic period11.
Several large expansions of gene families with roles in aromatic
features are observed. The grapevine genome has not undergone
recent genome duplication, thus enabling the discovery of ancestral
traits and features of the genetic organization of flowering plants.
This analysis reveals the contribution of three ancestral genomes to
the grapevine haploid content. This ancestral arrangement is common to many dicotyledonous plants but is absent from the genome
of rice, which is a monocotyledon. Furthermore, we explain the
chronology of previously described whole-genome duplication
events in the evolution of flowering plants.
All grapevine varieties are highly heterozygous; preliminary data
showed that there was as much as 13% sequence divergence between
alleles, which would hinder reliable contig assembly when a wholegenome shotgun strategy was used for sequencing. Our consortium
therefore selected the grapevine PN40024 genotype for sequencing.
This line, originally derived from Pinot Noir, has been bred close to
full homozygosity (estimated at about 93%) by successive selfings,
permitting a high-quality whole-genome shotgun assembly.
A total of 6.2 million end-reads were produced by our consortium,
representing an 8.4-fold coverage of the genome. Within the assembly, performed with Arachne12, 316 supercontigs represent putative
allelic haplotypes that constitute 11.6 million bases (Mb). These
values are in good fit with the 7% residual heterozygosity of
PN40024 assessed by using genetic markers. When considering only
one of the haplotypes in each heterozygous region, the assembly
(Table 1a) consists of 19,577 contigs (N50 5 65.9 kilobases (kb),
where N50 corresponds to the size of the shorter supercontig or
contig in a subset representing half of the assembly size) and 3,514
supercontigs (N50 5 2.07 Mb) totalling 487 Mb. This value is
close to the 475 Mb previously reported for the grapevine genome
size13.
Using a set of 409 molecular markers from the reference grapevine
map14, 69% of the assembled 487 Mb, arranged into 45 ultracontigs
Table 1 | Global statistics on the genome of Vitis vinifera
(a) Assembly
Contigs
Supercontigs
Status
Number
N50 (kb)
Longest (kb)
Size (Mb)
Percentage of the
assembly
All
All
Anchored on chromosomes
Anchored on chromosomes
and oriented
19,577
3,514
191
143
65.9
2,065
3,189
3,827
557
12,675
12,675
12,675
467.5
487.1
335.6
296.9
–
100
68.9
60.9
Number
Median size (bp)
Total length (Mb)
Percentage of the genome
%GC
30,434
149,351
118,917
30,453
600
164
3,399
130
213
3,544
73
103.5
225.6
33.6
178.6
261.5
0.04
0.002
46.3
6.9
36.7
34.7
NS
NS
36.2
44.5
34.7
33.0
43.0
35.9
Number of orthologous proteins
Mean identity (%)
12,996
11,404
9,731
10,547
8,121
72.7
65.5
59.8
(b) Annotation
Gene
Exons CDS
Introns CDS
Intergenic
tRNA*
miRNA{
(c) Orthology
P. trichocarpa
A. thaliana
O. sativa
Common to eudicotyledons{
Common to Magnoliophyta1
* Transfer RNA (tRNA) values were computed on exons.
{ Micro RNAs (miRNAs) are members of known conserved miRNA families.
{ Eudicotyledons are represented by P. trichocarpa and A. thaliana.
1 Magnoliophyta (most flowering plants) are represented by P. trichocarpa, A. thaliana and O. sativa.
*A list of participants and their affiliations appears at the end of the paper.
463
©2007 Nature Publishing Group
LETTERS
NATURE | Vol 449 | 27 September 2007
and 51 single supercontigs, were anchored along the 19 linkage
groups. Thirty-seven ultracontigs and 22 single supercontigs were
oriented, representing 61% of the genome assembly (Supplementary Tables 2 and 3).
This assembly has been annotated by using a combination of evidence. The major features of the genome annotation are presented in
Table 1b. The 8.4-fold draft sequence of the grapevine genome contains a set of 30,434 protein-coding genes (an average of 372 codons
and 5 exons per gene). This value is considerably lower than the
45,555 protein-coding genes reported for the poplar (Populus trichocarpa) genome, which has a similar size, at 485 Mb (ref. 1), and even
lower than the 37,544 protein-coding genes identified in the 389 Mb
of the rice genome2.
Three different approaches revealed that 41.4% (average value) of
the grapevine genome is composed of repetitive/transposable elements (TEs), a slightly higher proportion than that identified in the
rice genome, which has a somewhat smaller size2. The distribution of
repeats and TEs along the chromosomes is quite uneven (see below).
All classes and superfamilies of TEs are represented in the grapevine
genome, with a large prevalence of class I elements over class II and
helitrons (rolling-circle transposons) (Supplementary Table 7). An
analysis of the distribution of the repetitive elements in the different
fractions of the grapevine genome based on the current annotation
shows that introns are quite rich in repeats and TEs (data not shown).
In addition, 12.4% of the intron sequence contains transposons as
determined using our set of manually annotated elements, most of
which (75%) correspond to LINE (long interspersed element) retrotransposons, which therefore seem to have contributed specifically to
the intron size observed in grapevine (Supplementary Table 8).
In eukaryotes with large genomes, the coding and repeated elements are distributed over the chromosomes and may be more or less
interlaced, hence defining gene-poor and gene-rich regions. It has
previously been noticed that the distribution of the genes along
the chromosomes of rice and Arabidopsis thaliana is fairly homogeneous2,3. In contrast, we observe large regions that alternate
between high and low gene density in V. vinifera (Supplementary
Figs 2 and 3). As expected, the density of TEs reflects a pattern
substantially complementary to gene density. We observe a similar
characteristic in the genome sequence of poplar, therefore indicating
a dynamic for the invasion of TEs that is shared with the grapevine
(Supplementary Fig. 3).
A striking feature of the grapevine proteome lies in the existence of
large families related to wine characteristics, which have a higher gene
copy number than in the other sequenced plants. Stilbene synthases
(STSs) drive the synthesis of resveratrol, the grapevine phytoalexin
that has been associated with the health benefits associated with
moderate consumption of red wine15,16. The family of genes encoding
STSs has a noticeable expansion: 43 genes have been identified. Of
these, 20 have previously been shown to be expressed after infection
by Plasmopara viticola, thus confirming that they are likely to be
functional. The terpene synthases (TPSs) drive the synthesis of
terpenoids; these secondary metabolites are major components of
resins, essential oils and aromas (their relative abundance is directly
correlated with the aromatic features of wines17) and are involved in
plant–environment interactions. In comparison with the 30–40
genes of this family in Arabidopsis, rice and poplar, the grapevine
TPS family is more than twice as large, with 89 functional genes and
27 pseudogenes. Classification based on known plant homologues
reveals that the subclass of putative monoterpene synthases represents only 15% of the Arabidopsis TPS family18 whereas this subclass
represents 40% of the grapevine TPS family. This result suggests a
high diversification of grapevine monoterpene synthases that specifically produce C10 terpenoids present in aroma (such as geraniol,
linalool, cineole and a-terpineol). Furthermore, the grapevine genome annotation has also revealed genes encoding homologues to the
two forms of geranyl diphosphate synthases (GPPSs), the enzymes
that produce the substrate for monoterpene synthases: both the
homodimeric GPPS and the heterodimeric form are present; the
latter is present only in plants such as Mentha piperita and Clarkia
breweri, which produce large quantities of monoterpenes19. Most of
the STS and TPS genes occur as 20 clusters, including up to 33 paralogous genes located in a 680-kb stretch.
Because global duplication events seem to be a frequent event in
plant evolution20, we searched the genome of V. vinifera for paralogous regions by using protein sequence similarity. Paralogous regions
are defined as chromosome fragments in which homologous genes
are present in clusters. Statistical analysis21 of these clusters reveals
that 94.5% have high probability of being paralogous (P , 1024;
Supplementary Table 11). Most Vitis gene regions have two different
paralogous regions, which we have grouped together as triplets
(Supplementary Fig. 5; coverage details in Supplementary Table
10). We conclude that the present-day grapevine haploid genome
originated from the contribution of three ancestral genomes. It is
yet to be demonstrated whether this content came from a true hexaploidization event or through successive genome duplications. The
resulting plant had a diploid content that corresponds to the three
full diploid contents of the three ancestors; it may therefore be
described as a ‘palaeo-hexaploid’ organism. A number of rearrangements have affected the original three complements after the formation of the palaeo-hexaploid state. However, the gene order has been
sufficiently conserved to permit the alignment of most regions with
their two siblings.
We explored the time of formation of the palaeo-hexaploid
arrangement by comparing grapevine gene regions with those of
other completely sequenced plant genomes. If the palaeo-hexaploid
complement is present in another species, it should result in a onefor-one pairing of gene regions between the two species considered.
In contrast, if another species’s genome evolved before palaeohexaploid formation, it should result in a one-to-three relationship
between the other species and the grapevine genome. The available
genome sequences were those of poplar1, Arabidopsis3 and rice (Oryza
sativa2), of which poplar is considered to be most closely related to
grapevine. All clusters constructed between the orthologues in the
three comparisons have P , 1024 (Table 1c). When the gene order in
poplar is compared with that in grapevine, there are two clear distributions. First, the grapevine regions align with two poplar segments, as would be expected from a recent whole-genome
duplication (WGD) in the poplar lineage1. Second, each of the three
grapevine regions that form a homologous triplet recognizes different pairs of poplar segments (Fig. 1a and Supplementary Fig. 6). This
shows that the palaeo-hexaploidy observed in grapevine was already
present in its common ancestor with poplar.
Poplar belongs to the Eurosid I clade. The sister clade to Eurosid I
is that of Eurosid II, which contains the model species Arabidopsis. Its
gene order was compared with that in the grapevine genome. Two
distributions appear: first, most grapevine regions correspond to four
Arabidopsis segments (Supplementary Fig. 7); second, each component of a triplicated group in grapevine recognizes four different
regions in Arabidopsis (Fig. 1b). This shows that the grapevine
palaeo-hexaploidy was present in the common ancestor to
Arabidopsis and grapevine, and therefore that it is a trait common
to all Eurosids. This is confirmed by the homology level distribution
between paralogues of the grapevine, indicating a lower conservation
than between Vitis/Arabidopsis orthologues (Supplementary Fig. 4).
The Eurosid group contains many economically important flowering
plants such as legumes, cotton and Brassicaceae. Our present results
establish these species as having a palaeo-hexaploid common
ancestor. The grapevine/Arabidopsis comparison also reveals that
the Arabidopsis lineage underwent two WGDs after its separation
from the Eurosid I clade21–24. This contradicts some models based
on more indirect evidence that placed the most ancient of these two
duplications at the base of the Eurosid group, or even earlier4,20–22.
Some studies had also suggested a possible third duplication event in
the distant past of the Arabidopsis lineage, potentially at the base of
464
©2007 Nature Publishing Group
LETTERS
NATURE | Vol 449 | 27 September 2007
the angiosperm radiation. The controversy about this third event is
now resolved by the Vitis genome comparisons: this event corresponds to the palaeo-hexaploidy formation that remains evident in
the grapevine genome but has been difficult to characterize in
Arabidopsis and poplar because of the more recent WGDs. In particular, the Arabidopsis genome lineage has undergone many rearrangements and chromosome fusions such that the ancestral gene
order is particularly difficult to deduce from this species (Fig. 2).
Grapevines, like Arabidopsis and poplar, are dicotyledonous plants
that diverged from monocotyledons about 130–240 Myr ago25,26.
a
Because rice is a monocotyledon, we assessed the presence or absence
of palaeo-hexaploidy in its genome sequence. The observed pattern is
the opposite of that seen for Arabidopsis and poplar: constituents of a
grapevine triplet are generally orthologous to the same group of rice
regions (Fig. 1c and Supplementary Fig. 11). Because rice and grapevine are phylogenetically distant, it is more difficult to detect relations of orthology across the two whole genomes: rearrangements,
duplication and gene loss have affected the gene orders differently in
the two lineages (Supplementary Fig. 10). Even with this limitation,
we observed numerous cases of one-to-three relationships between
b
c
Figure 1 | Comparison between three paralogous Vitis genomic regions and
their orthologues in P. trichocarpa, A. thaliana and O. sativa. Orthologous
gene pairs are joined with a different colour for each of the three paralogous
grapevine chromosomes 6 (green), 8 (blue) and 13 (red). a, Orthologous
regions in the poplar genome are different for each of the three Vitis
chromosomes, showing that the triplication predates the poplar/Vitis
separation. One Vitis region recognizes two poplar segments because of a
WGD in the poplar lineage after the separation. b, Orthologous regions with
Arabidopsis are different for each of the three Vitis chromosomes. This
shows that the Arabidopsis/Vitis ancestor had the same palaeo-hexaploid
content. One Vitis region corresponds to four Arabidopsis segments,
indicating the presence of two WGDs in the Arabidopsis lineage after
separation from the Vitis lineage. c, Orthologous regions in rice are the same
for the three paralogous chromosomes. This indicates that the triplication
was not present in the common ancestor of monocotyledons and
dicotyledons. The presence in rice of different homologous blocks is due to
global duplications in the rice lineage after divergence from dicotyledons.
465
©2007 Nature Publishing Group
LETTERS
NATURE | Vol 449 | 27 September 2007
a
b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
V. vinifera
P. trichocarpa
1 2 3 4 5
A. thaliana
Figure 2 | Schematic representation of paralogous regions derived from
the three ancestral genomes in the karyotypes of V. vinifera, P. trichocarpa
and A. thaliana. Each colour corresponds to a syntenic region between the
three ancestral genomes that were defined by their occurrence as linked
clusters in grapevine, independently of intrachromosomal rearrangements.
The V. vinifera genome (a) is by far the closest to the ancestral arrangement,
whereas that of Arabidopsis (c) is thoroughly rearranged, and P. trichocarpa
(b) presents an intermediate situation. The seven colours probably
correspond to linkage groups at the time of the palaeo-hexaploid ancestor.
rice and grapevine (Supplementary Figs 8, 9 and 11); 23% of orthologous blocks include the paralogous regions that originate from the
grapevine palaeo-hexaploidy. For Arabidopsis, this number is as low
as 1.4% (this difference is significant at 5%: x2 5 8.9; Supplementary
Table 12), despite the fact that the Arabidopsis genome has suffered
many gene losses since its two WGDs. These gene losses would be
expected to obscure the orthologous relations with the grapevine
genome, but they are clearly insufficient to explain the high number
of one-to-three relationships observed in the rice–grapevine comparison. The most probable explanation for this excess is that the rice
ancestor did not exhibit the palaeo-hexaploidy observed in the grapevine, poplar and Arabidopsis.
These findings are summarized in Fig. 3: the triplicated arrangement is apparent after the separation of the monocotyledons and
dicotyledons and before the spread of the Eurosid clade. Future genome sequencing projects for other clades of dicotyledons, such as
Solanaceae or basal eudicots, will help in situating the triplication
event more precisely, and eventually in establishing its precise nature
(hexaploidization or genome duplications at distant times).
Public access to the grapevine genome sequence will help in the
identification of genes underlying the agricultural characteristics of
this species, including domestication traits. A selective amplification
of genes belonging to the metabolic pathways of terpenes and tannins
has occurred in the grapevine genome, in contrast with other plant
genomes. This suggests that it may become possible to trace the
diversity of wine flavours down to the genome level. Grapevine is
also a crop that is highly susceptible to a large diversity of pathogens
including powdery mildew, oidium and Pierce disease. Other Vitis
species such as V. riparia or V. cinerea, which are known to be resistant to several of these pathogens, are interfertile with V. vinifera
and can be used for the introduction of resistance traits by advanced
backcrosses27 or by gene transfer. Access to the Vitis sequence and the
exploitation of synteny will speed up this process of introgression of
pathogen resistance traits. As a consequence of this, it is hoped that it
will also prompt a strong decrease in pesticide use.
The high quality of the assembly, due mainly to the highly homozygous nature of the PN40024 line, enables the discovery of three
ancestral genomes constituting the diploid content of grapevine. The
Greek historian Thucydides wrote that Mediterranean people began
to emerge from ignorance when they learnt to cultivate olives and
grapes. This first characterization of the grapevine genome, with its
indication of a palaeo-hexaploid ancestral genome for many dicotyledonous plants, addresses fundamental questions related to the
origin and importance of this event in the history of flowering plants.
Future work may help in correlating the differential fates of the three
gene complements with phenotypic traits of dicotyledonous species.
Monocotyledons
Dicotyledons
Eurosids I
O. sativa
Eurosids II
P. trichocarpa V. vinifera A. thaliana
METHODS SUMMARY
Gene annotation. Protein-coding genes were predicted by combining ab initio
models, V. vinifera complementary DNA alignments, and alignments of proteins
and genomic DNA from other species. The integration of the data was performed
with GAZE28. Details are given in Supplementary Information.
Paralogous and orthologous gene sets. Statistical testing of homologous regions
was performed as described in ref. 21.
?
Formation of the
palaeo-hexaploid
genome
Full Methods and any associated references are available in the online version of
the paper at www.nature.com/nature.
Received 5 April; accepted 7 August 2007.
Published online 26 August 2007.
1.
Flowering plants
2.
Figure 3 | Positions of the polyploidization events in the evolution of plants
with a sequenced genome. Each star indicates a WGD (tetraploidization)
event on that branch. The question mark indicates that ancient events are
visible in the rice genome that would require other monocotyledon genome
sequences to be resolved. The formation of the palaeo-hexaploid ancestral
genome occurred after divergence from monocotyledons and before the
radiation of the Eurosids.
3.
4.
5.
Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. &
Gray). Science 313, 1596–1604 (2006).
International Rice Genome Sequencing Project. The map-based sequence of the
rice genome. Nature 436, 793–800 (2005).
Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering
plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
De Bodt, S., Maere, S. & Van de Peer, Y. Genome duplication and the origin of
angiosperms. Trends Ecol. Evol. 20, 591–597 (2005).
Scannell, D. R., Byrne, K. P., Gordon, J. L., Wong, S. & Wolfe, K. H. Multiple rounds
of speciation associated with reciprocal gene loss in polyploid yeasts. Nature 440,
341–345 (2006).
466
©2007 Nature Publishing Group
LETTERS
NATURE | Vol 449 | 27 September 2007
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
Jaillon, O. et al. Genome duplication in the teleost fish Tetraodon nigroviridis
reveals the early vertebrate proto-karyotype. Nature 431, 946–957 (2004).
Aury, J. M. et al. Global trends of whole-genome duplications revealed by the
ciliate Paramecium tetraurelia. Nature 444, 171–178 (2006).
Maere, S. et al. Modeling gene and genome duplications in eukaryotes. Proc. Natl
Acad. Sci. USA 102, 5454–5459 (2005).
Blanc, G. & Wolfe, K. H. Functional divergence of duplicated genes formed by
polyploidy during Arabidopsis evolution. Plant Cell 16, 1679–1691 (2004).
Seoighe, C. & Gehring, C. Genome duplication led to highly selective expansion of
the Arabidopsis thaliana proteome. Trends Genet. 20, 461–464 (2004).
McGovern, P. E., Hartung, U., Badler, V., Glusker, D. L. & Exner, L. J. The beginnings
of wine making and viniculture in the anciant Near East and Egypt. Expedition 39,
3–21 (1997).
Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes:
Arachne 2. Genome Res. 13, 91–96 (2003).
Lodhi, M. A., Daly, M. J., Ye, G. N., Weeden, N. F. & Reisch, B. I. A molecular marker
based linkage map of Vitis. Genome 38, 786–794 (1995).
Doligez, A. et al. An integrated SSR map of grapevine based on five mapping
populations. Theor. Appl. Genet. 113, 369–382 (2006).
Baur, J. A. et al. Resveratrol improves health and survival of mice on a high-calorie
diet. Nature 444, 337–342 (2006).
Baur, J. A. & Sinclair, D. A. Therapeutic potential of resveratrol: the in vivo
evidence. Nature Rev. Drug Discov. 5, 493–506 (2006).
Mateo, J. J. & Jimenez, M. Monoterpenes in grape juice and wines. J. Chromatogr. A
881, 557–567 (2000).
Aubourg, S., Lecharny, A. & Bohlmann, J. Genomic analysis of the terpenoid
synthase (AtTPS) gene family of Arabidopsis thaliana. Mol. Genet. Genomics 267,
730–745 (2002).
Tholl, D. et al. Formation of monoterpenes in Antirrhinum majus and Clarkia breweri
flowers involves heterodimeric geranyl diphosphate synthases. Plant Cell 16,
977–992 (2004).
Adams, K. L. & Wendel, J. F. Polyploidy and genome evolution in plants. Curr. Opin.
Plant Biol. 8, 135–141 (2005).
Simillion, C., Vandepoele, K., Van Montagu, M. C., Zabeau, M. & Van de Peer, Y.
The hidden duplication past of Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 99,
13627–13632 (2002).
Bowers, J. E., Chapman, B. A., Rong, J. & Paterson, A. H. Unravelling angiosperm
genome evolution by phylogenetic analysis of chromosomal duplication events.
Nature 422, 433–438 (2003).
Vision, T. J., Brown, D. G. & Tanksley, S. D. The origins of genomic duplications in
Arabidopsis. Science 290, 2114–2117 (2000).
Blanc, G., Hokamp, K. & Wolfe, K. H. A recent polyploidy superimposed on older
large-scale duplications in the Arabidopsis genome. Genome Res. 13, 137–144
(2003).
Wolfe, K. H., Gouy, M., Yang, Y. W., Sharp, P. M. & Li, W. H. Date of the
monocot–dicot divergence estimated from chloroplast DNA sequence data. Proc.
Natl Acad. Sci. USA 86, 6201–6205 (1989).
Crane, P. R., Friis, E. M. & Pedersen, K. R. The origin and early diversification of
angiosperms. Nature 374, 27–33 (1995).
Eshed, Y. & Zamir, D. An introgression line population of Lycopersicon pennellii in
the cultivated tomato enables the identification and fine mapping of yieldassociated QTL. Genetics 141, 1147–1162 (1995).
Howe, K. L., Chothia, T. & Durbin, R. GAZE: a generic framework for the
integration of gene-prediction data by dynamic programming. Genome Res. 12,
1418–1427 (2002).
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature.
Acknowledgements The sequencing of the grapevine genome was launched and
carried out after a scientific cooperation agreement between the Ministry of
Agriculture in France and the Ministry of Agriculture in Italy, involving l’Institut
National de la Recherche Agronomique (INRA), Consiglio per la Ricerca e
Sperimentazione in Agricoltura (CRA) and Friuli Venezia Giulia Region. This work
was financially supported by Consortium National de Recherche en Génomique,
Agence Nationale de la Recherche, INRA, and by MiPAF (VIGNA-CRA), Friuli
Innovazione, Università di Udine, Federazione BCC, Fondazione CRUP, Fondazione
Carigo, Fondazione CRT, Vivai Cooperativi Rauscedo, Eurotech, Livio Felluga,
Marco Felluga, Venica e Venica, Le Vigne di Zamò (IGA). We thank S. Cure for
correcting the manuscript; F. Câmara and R. Guigo for the calibration of the GeneID
gene prediction software, and the Centre Informatique National de l’Enseignement
Supérieur for computing resources.
Author Information The final assembly and annotation are deposited in the EMBL/
Genbank/DDBJ databases under accession numbers CU459218–CU462737 (for
all scaffolds) and CU462738–CU462772 (for chromosome reconstitutions and
unanchored scaffolds). An annotation browser and further information on the
project are available from http://www.genoscope.cns.fr/vitis, http://
www.vitisgenome.it/ and http://www.appliedgenomics.org/. Reprints and
permissions information is available at www.nature.com/reprints. The authors
declare no competing financial interests. Correspondence and requests for
materials should be addressed to P.W. ([email protected]).
The French-Italian Public Consortium for Grapevine Genome Characterization
Olivier Jaillon1*, Jean-Marc Aury1*, Benjamin Noel1, Alberto Policriti2,3, Christian
Clepet4, Alberto Casagrande2,5, Nathalie Choisne1,4, Sébastien Aubourg4, Nicola
Vitulo6,15, Claire Jubin1, Alessandro Vezzi6,15, Fabrice Legeai7, Philippe Hugueney8,
Corinne Dasilva1, David Horner9,15, Erica Mica9,15, Delphine Jublot4, Julie Poulain1,
Clémence Bruyère4, Alain Billault1, Béatrice Segurens1, Michel Gouyvenoux1, Edgardo
Ugarte1, Federica Cattonaro2, Véronique Anthouard1, Virginie Vico1, Cristian Del
Fabbro2,3, Michaël Alaux7, Gabriele Di Gaspero2,5,Vincent Dumas8, Nicoletta Felice2,5,
Sophie Paillard4, Irena Juman2,5, Marco Moroldo4, Simone Scalabrin2,3, Aurélie
Canaguier4, Isabelle Le Clainche4, Giorgio Malacrida6,15, Eléonore Durand7, Graziano
Pesole10,11,15, Valérie Laucou12, Philippe Chatelet13, Didier Merdinoglu8, Massimo
Delledonne14,15, Mario Pezzotti15,16, Alain Lecharny4, Claude Scarpelli1, François
Artiguenave1, M. Enrico Pè9,15, Giorgio Valle6,15, Michele Morgante2,5, Michel
Caboche4, Anne-Françoise Adam-Blondon4, Jean Weissenbach1, Francis Quétier1 &
Patrick Wincker1
*These authors contributed equally to this work.
Affiliations for participants: 1Genoscope (CEA) and UMR 8030
CNRS-Genoscope-Université d’Evry, 2 rue Gaston Crémieux, BP5706, 91057 Evry,
France. 2Istituto di Genomica Applicata, Parco Scientifico e Tecnologico di Udine, Via
Linussio 51, 33100 Udine, Italy. 3Dipartimento di Matematica ed Informatica, Università
degli Studi di Udine, via delle Scienze 208, 33100 Udine, Italy. 4URGV, UMR INRA 1165,
CNRS-Université d’Evry Genomique Végétale, 2 rue Gaston Crémieux, BP5708, 91057
Evry cedex, France. 5Dipartimento di Scienze Agrarie ed Ambientali, Università degli
Studi di Udine, via delle Scienze 208, 33100 Udine, Italy. 6CRIBI, Università degli Studi di
Padova, viale G. Colombo 3, 35121 Padova, Italy. 7URGI, UR1164 Génomique Info, 523,
Place des Terrasses, 91034 Evry Cedex, France. 8UMR INRA 1131, Université de
Strasbourg, Santé de la Vigne et Qualité du Vin, 28 rue de Herrlisheim, BP20507, 68021
Colmar, France. 9Dipartimento di Scienze Biomolecolari e Biotecnologie, Università degli
Studi di Milano, via Celoria 26, 20133 Milano, Italy. 10Dipartimento di Biochimica e
Biologia Molecolare, Università degli Studi di Bari, via Orabona 4, 70125 Bari, Italy.
11
Istituto Tecnologie Biomediche, Consiglio Nazionale delle Ricerche, via Amendola 122/
D, 70125 Bari, Italy. 12UMR INRA 1097, IRD-Montpellier SupAgro-Univ. Montpellier II,
Diversité et Adaptation des Plantes Cultivées, 2 Place Pierre Viala, 34060 Montpellier
Cedex 1, France. 13UMR INRA 1098, IRD-Montpellier SupAgro-CIRAD, Développement
et Amélioration des Plantes, 2 Place Pierre Viala, 34060 Montpellier Cedex 1, France.
14
Dipartimento Scientifico e Tecnologico, Università degli Studi di Verona Strada Le
Grazie 15 – Ca’ Vignal, 37134 Verona, Italy. 15Dipartimento di Scienze, Tecnologie e
Mercati della Vite e del Vino, Università degli Studi di Verona, via della Pieve, 70 37029 S.
Floriano (VR), Italy. 16VIGNA-CRA Initiative; Consorzio Interuniversitario Nazionale per
la Biologia Molecolare delle Piante, c/o Università degli Studi di Siena, via Banchi di Sotto
55, 53100 Siena, Italy.
467
©2007 Nature Publishing Group
doi:10.1038/nature06148
METHODS
Genome sequencing. The V. vinifera PN40024 genome was sequenced with the
use of a whole-genome shotgun strategy. All data were generated by paired-end
sequencing of cloned inserts using Sanger technology on ABI3730xl sequencers.
Supplementary Table 2 gives the number of reads obtained per library.
Genome assembly and chromosome anchoring. All reads were assembled with
Arachne12. We obtained 20,784 contigs that were linked into 3,830 supercontigs
of more than 2 kb. The contig N50 was 64 kb, and the supercontig N50 was 1.9 Mb.
The total supercontig size was 498 Mb, remarkably close to the expected size of
475 Mb. This indicates that the PN40024 has retained few heterozygous regions.
Remaining heterozygosity was assessed by aligning all supercontigs with each
other. We first selected the supercontigs more than 30 kb in size that were
covered over more than 40% of their length by another supercontig with more
than 95% identity. After visual inspection of the alignments, we added to this list
the supercontigs more than 10 kb in size that aligned at more than 40% of their
length with supercontigs identified previously. All potential cases were then
inspected visually to discard potential heterozygous regions (aligning relatively
homogeneously across their complete length) and retained repeated regions
(with more heterogeneous alignments). This treatment identified 11 Mb of
potentially allelic supercontigs. We confirmed that in most cases their coverage
was about half the average of the homozygous supercontigs. Only one supercontig of each allelic pair was therefore conserved in the final assembly, which
consists of 3,514 supercontigs (N50 5 2 Mb) containing 19,577 contigs
(N50 5 66 kb), totalling 487 Mb. If the haploid genome size of 475 Mb is considered correct, then our final assembly contains only about 12 Mb of remaining
heterozygosity, or 2.6%.
A set of 30,151 bacterial artificial chromosome (BAC) fingerprints of the BAC
clones of a Cabernet–Sauvignon library29 were assembled into 1,763 contigs with
FPC30, v. 8. In parallel, 1,981 markers were anchored on a subset of BAC clones31,
among which 388 markers mapped onto the genetic map, and 77,237 BAC end
sequences were obtained31. Blat32 alignments (90% identity on 80% of the length,
fewer than five hits) were performed with BAC end sequences on the 3,830
supercontigs of sequences with lengths over 2 kb. The results were then filtered
with homemade Perl scripts to keep only the occurrences in which two paired
ends were matching at a distance of less than 300 kb and with a consistent
orientation. Two supercontigs were considered linked to each other if two
BAC links could be found or one BAC link and a BAC contig link. A total number
of 111 ultracontigs were constructed with this procedure.
Genome annotation. Several resources were used to build V. vinifera gene models automatically with GAZE28. We used predictions of repetitive regions by
repeatscout33, conserved coding regions predicted by the exofish method34,35,
genewise36 alignments of proteins from Uniprot37, Geneid38 and Snap39 ab initio
gene predictions, and alignments of several cDNA resources (Supplementary
Information).
A weight was assigned to each resource to further reflect its reliability and
accuracy in predicting gene models. This weight acts as a multiplier for the score
of each information source, before being processed by GAZE. When applied to
the entire assembled sequence, GAZE predicted 30,434 gene models.
Paralogous and orthologous gene sets. We identified orthologous genes in
six pairs of genomes from four species: A. thaliana, O. sativa, P. trichocarpa
and V. vinifera. Each pair of predicted gene sets was aligned with the Smith–
Waterman algorithm, and alignments with a score higher than 300 (BLOSUM62;
gapo 5 10, gape 5 1) were retained. Two genes, A from genome GA and B from
genome GB, were considered orthologues if B was the best match for gene A in
GB and A was the best match for B in GA.
For each orthologous gene set with V. vinifera, clusters of orthologous genes
were generated. A single linkage clustering with a euclidean distance was used to
group genes. The distances were calculated with the gene index in each chromosome rather than the genomic position. The minimal distance between two
orthologous genes was adapted in accordance with the selected genomes. Finally,
we retained only clusters that were composed of at least six genes for Arabidopsis
and O. sativa, and eight genes for P. trichocarpa (Supplementary Table 10).
To validate the clustering quality we used a method described previously21. For
each cluster we computed the probability of finding this cluster in the gene
homology matrix (Supplementary Table 11). This matrix was constructed from
two compared chromosomes with genes numbered according to their position
on each chromosome, with no reference to physical distances.
Paralogous genes were computed by comparing all-against-all of V. vinifera
proteins by using blastp, and alignments with an expected value of less than 0.1
were retained and realigned with the Smith–Waterman algorithm40. Two genes A
and B were considered paralogues if B was the best match for gene A and A was
the best match for B. Moreover, clusters of paralogous genes were constructed in
the same fashion as orthologous clusters (Supplementary Table 10).
29. Adam-Blondon, A. F. et al. Construction and characterization of BAC libraries
from major grapevine cultivars. Theor. Appl. Genet. 110, 1363–1371 (2005).
30. Soderlund, C., Humphray, S., Dunham, A. & French, L. Contigs built with
fingerprints, markers, and FPC V4.7. Genome Res. 10, 1772–1787 (2000).
31. Lamoureux, D. et al. Anchoring of a large set of markers onto a BAC library for the
development of a draft physical map of the grapevine genome. Theor. Appl. Genet.
113, 344–356 (2006).
32. Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664
(2002).
33. Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in
large genomes. Bioinformatics 21 (Suppl. 1), i351–i358 (2005).
34. Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide
analysis using Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235–238
(2000).
35. Jaillon, O. et al. Genome-wide analyses based on comparative genomics. Cold
Spring Harb. Symp. Quant. Biol. 68, 275–282 (2003).
36. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14,
988–995 (2004).
37. Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33,
D154–D159 (2005).
38. Parra, G., Blanco, E. & Guigo, R. GeneID in Drosophila. Genome Res. 10, 511–515
(2000).
39. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
40. Smith, T. F. & Waterman, M. S. Identification of common molecular
subsequences. J. Mol. Biol. 147, 195–197 (1981).
©2007 Nature Publishing Group
Vol 452 | 24 April 2008 | doi:10.1038/nature06856
LETTERS
The draft genome of the transgenic tropical fruit tree
papaya (Carica papaya Linnaeus)
Ray Ming1,2*, Shaobin Hou3*, Yun Feng4,5*, Qingyi Yu1*, Alexandre Dionne-Laporte3, Jimmy H. Saw3, Pavel Senin3,
Wei Wang4,6, Benjamin V. Ly3, Kanako L. T. Lewis3, Steven L. Salzberg7, Lu Feng4,5,6, Meghan R. Jones1,
Rachel L. Skelton1, Jan E. Murray1,2, Cuixia Chen2, Wubin Qian4, Junguo Shen5, Peng Du5, Moriah Eustice1,8,
Eric Tong1, Haibao Tang9, Eric Lyons10, Robert E. Paull11, Todd P. Michael12, Kerr Wall13, Danny W. Rice14,
Henrik Albert15, Ming-Li Wang1, Yun J. Zhu1, Michael Schatz7, Niranjan Nagarajan7, Ricelle A. Acob1,8,
Peizhu Guan1,8, Andrea Blas1,8, Ching Man Wai1,11, Christine M. Ackerman1, Yan Ren4, Chao Liu4, Jianmei Wang4,
Jianping Wang2, Jong-Kuk Na2, Eugene V. Shakirov16, Brian Haas17, Jyothi Thimmapuram18, David Nelson19,
Xiyin Wang9, John E. Bowers9, Andrea R. Gschwend2, Arthur L. Delcher7, Ratnesh Singh1,8, Jon Y. Suzuki15,
Savarni Tripathi15, Kabi Neupane20, Hairong Wei21, Beth Irikura11, Maya Paidi1,8, Ning Jiang22, Wenli Zhang23,
Gernot Presting8, Aaron Windsor24, Rafael Navajas-Pérez9, Manuel J. Torres9, F. Alex Feltus9, Brad Porter8,
Yingjun Li2, A. Max Burroughs7, Ming-Cheng Luo25, Lei Liu18, David A. Christopher8, Stephen M. Mount7,26,
Paul H. Moore15, Tak Sugimura27, Jiming Jiang23, Mary A. Schuler28, Vikki Friedman29, Thomas Mitchell-Olds24,
Dorothy E. Shippen16, Claude W. dePamphilis13, Jeffrey D. Palmer14, Michael Freeling10, Andrew H. Paterson9,
Dennis Gonsalves15, Lei Wang4,5,6 & Maqsudul Alam3,30
Papaya, a fruit crop cultivated in tropical and subtropical regions,
is known for its nutritional benefits and medicinal applications.
Here we report a 33 draft genome sequence of ‘SunUp’ papaya,
the first commercial virus-resistant transgenic fruit tree1 to be
sequenced. The papaya genome is three times the size of the
Arabidopsis genome, but contains fewer genes, including significantly fewer disease-resistance gene analogues. Comparison of the
five sequenced genomes suggests a minimal angiosperm gene set
of 13,311. A lack of recent genome duplication, atypical of other
angiosperm genomes sequenced so far2–5, may account for the
smaller papaya gene number in most functional groups. Nonetheless, striking amplifications in gene number within particular
functional groups suggest roles in the evolution of tree-like habit,
deposition and remobilization of starch reserves, attraction of
seed dispersal agents, and adaptation to tropical daylengths.
Transgenesis at three locations is closely associated with chloroplast insertions into the nuclear genome, and with topoisomerase I
recognition sites. Papaya offers numerous advantages as a system
for fruit-tree functional genomics, and this draft genome sequence
provides the foundation for revealing the basis of Carica’s
distinguishing morpho-physiological, medicinal and nutritional
properties.
Papaya is an exceptionally promising system for the exploration of
tropical-tree genomes and fruit-tree genomics. It has a relatively
small genome of 372 megabases (Mb)6, diploid inheritance with nine
pairs of chromosomes, a well-established transformation system7,
a short generation time (9–15 months), continuous flowering
throughout the year and a primitive sex-chromosome system8. It is
a member of the Brassicales, sharing a common ancestor with
Arabidopsis about 72 million years ago9. Papaya is ranked first on
nutritional scores among 38 common fruits, based on the percentage
of the United States Recommended Daily Allowance for vitamin A,
vitamin C, potassium, folate, niacin, thiamine, riboflavin, iron and
calcium, plus fibre. Consumption of its fruit is recommended for
preventing vitamin A deficiency, a cause of childhood blindness in
tropical and subtropical developing countries. The fruit, stems, leaves
and roots of papaya are used in a wide range of medical applications,
including production of papain, a valuable proteolytic enzyme.
1
Hawaii Agriculture Research Center, Aiea, Hawaii 96701, USA. 2Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 3Advanced
Studies in Genomics, Proteomics and Bioinformatics, University of Hawaii, Honolulu, Hawaii 96822, USA. 4TEDA School of Biological Sciences and Biotechnology, Nankai University,
Tianjin Economic-Technological Development Area, Tianjin 300457, China. 5Tianjin Research Center for Functional Genomics and Biochip, Tianjin Economic-Technological
Development Area, Tianjin 300457, China. 6Key Laboratory of Molecular Microbiology and Technology of the Ministry of Education, College of Life Sciences, Nankai University, Tianjin
300071, China. 7Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland 20742, USA. 8Department of Molecular Bioscience and
Bioengineering, University of Hawaii, Honolulu, Hawaii 96822, USA. 9Plant Genome Mapping Laboratory, University of Georgia, Athens, Georgia 30602, USA. 10Department of Plant
and Microbial Biology, University of California, Berkeley, California 94720, USA. 11Department of Tropical Plant and Soil Sciences, University of Hawaii, Honolulu, Hawaii 96822, USA.
12
Waksman Institute of Microbiology and Department of Plant Biology and Pathology, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA. 13Department
of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA. 14Department of Biology, Indiana University, Bloomington, Indiana 47405, USA. 15USDA-ARS,
Pacific Basin Agricultural Research Center, Hilo, Hawaii 96720, USA. 16Department of Biochemistry and Biophysics, 2128 TAMU, Texas A&M University, College Station, Texas
77843, USA. 17The Institute for Genomic Research, Rockville, Maryland 20850, USA. 18W.M. Keck Center for Comparative and Functional Genomics, University of Illinois at UrbanaChampaign, Urbana, Illinois 61801, USA. 19Department of Molecular Sciences, University of Tennessee, Memphis, Tennessee 38163, USA. 20Leeward Community College, University
of Hawaii, Pearl City, Hawaii 96782, USA. 21Wicell Research Institute, Madison, Wisconsin 53707, USA. 22Department of Horticulture, Michigan State University, East Lansing,
Michigan 48824, USA. 23Department of Horticulture, University of Wisconsin, Madison, Wisconsin 53706, USA. 24Department of Biology, Duke University, Durham, North Carolina
27708, USA. 25Department of Plant Sciences, University of California, Davis, California 95616, USA. 26Department of Cell Biology and Molecular Genetics, University of Maryland,
College Park, Maryland 20742, USA. 27Maui High Performance Computing Center, Kihei, Hawaii 96753, USA. 28Departments of Cell and Developmental Biology, Biochemistry and
Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 29Applied Biosystems, 850 Lincoln Centre Drive, Foster City, California 94404, USA.
30
Department of Microbiology, University of Hawaii, Honolulu, Hawaii 96822, USA.
*These authors contributed equally to this work.
991
©2008 Nature Publishing Group
LETTERS
NATURE | Vol 452 | 24 April 2008
A total of 2.8 million whole-genome shotgun (WGS) sequencing
reads were generated from a female plant of transgenic cultivar
SunUp, which was developed through transformation of Sunset that
had undergone more than 25 generations of inbreeding10. The estimated residual heterozygosity of SunUp is 0.06% (Supplementary
Note 1). After excluding low-quality and organellar reads, 1.6 million
high-quality reads were assembled into contigs containing 271 Mb
and scaffolds spanning 370 Mb including embedded gaps (Supplementary Tables 1 and 2). Of 16,362 unigenes derived from expressed
sequence tags (ESTs), 15,064 (92.1%) matched this assembly. Pairedend reads from 34,065 bacterial artificial chromosome (BAC) clones
provided alignment to an fingerprinted contig (FPC)-based physical
map (Supplementary Note 2). Among 706 BAC end and WGS
sequence-derived simple sequence repeats on the genetic map, 652
(92.4%) could be used to anchor 167 Mb of contigs or 235 Mb of
scaffolds, to the 12 papaya linkage groups in the current genetic map
(Supplementary Fig. 1).
Papaya chromosomes at the pachytene stage of meiosis are
generally stained lightly by 49,6-diamidino-2-phenylindole (DAPI),
revealing that the papaya genome is largely euchromatic. However,
highly condensed heterochromatin knobs were observed on most
chromosomes (Supplementary Fig. 2), concentrated in the centromeric and pericentromeric regions. The lengths of the pachytene
bivalents that are heavily stained only account for approximately
17% of the genome. However, these cytologically distinct and highly
condensed heterochromatic regions could represent 30–35% of the
genomic DNA11. A large portion of the heterochromatic DNA was
probably not covered by the WGS sequence. The 271 Mb of contig
sequence should represent about 75% of the papaya genome and
more than 90% of the euchromatic regions, which is similar to the
92.1% of the EST and 92.4% of genetic markers covered by the
assembled genome and the theoretical 95% coverage by 33 WGS
sequence12.
Gene annotation was carried out using the TIGR Eukaryotic
Annotation Pipeline. The assembled genome was masked based on
similarity to known repeat elements in RepBase and the TIGR Plant
Repeat Database, plus a de novo papaya repeat database (see
Methods). Ab initio gene predictions were combined with spliced
alignments of proteins and transcripts to produce a reference gene
set of 28,629 gene models (Supplementary Table 3). A total of 21,784
(76.1%) of the predicted papaya genes with average length of 1,057
base pairs (bp) have similarity to proteins in the non-redundant
database from the National Center for Biotechnology Information,
with 9,760 (44.8%) of these supported by papaya unigenes. Among
6,845 genes with average length 309 bp that had no hits to the nonredundant proteins, only 515 (7.5%) were supported by papaya unigenes, implying that the number of predicted papaya-specific genes
was inflated. If the 515 genes with unigene support represent 44.8%
of the total, then 1,150 predicted papaya-specific genes may be real,
and the number of predicted genes in the assembled papaya genome
would be 22,934. Considering the assembled genome covers 92.1% of
the unigenes and 92.4% of the mapped genetic markers, the number
of predicted genes in the papaya genome could be 7.9% higher, or
24,746, about 11–20% less than Arabidopsis (based on either the
Table 1 | Statistics of sequenced plant genomes
Carica Arabidopsis Populus Oryza sativa Vitis
papaya thaliana trichocarpa (japonica) vinifera
Size (Mbp)
Number of chromosomes
G 1 C content total (%)
Gene number
Average gene length (bp per gene)
Average intron length (bp)
Transposons (%)
372
125
9
5
35.3
35.0
24,746 31,114*
2,373 2,232
479
165
51.9
14
485
19
33.3
45,555
2,300
379
42
389
487
12
19
43.0
36.2
37,544 30,434
2,821 3,399
412
213
34.8
41.4
* The gene number of Arabidopsis is based on the 27,873 protein-coding and RNA genes from
The Arabidopsis Information Resource website (http://www.arabidopsis.org/portals/
genAnnotation/genome_snapshot.jsp) and recently published 3,241 novel genes6.
27,873 protein coding and RNA genes, or including the 3,241 novel
genes)2,13, 34% less than rice3, 46% less than poplar4 and 19% less
than grape5 (Table 1).
Comparison of the papaya genome with that of Arabidopsis
sheds new light on angiosperm evolutionary history in several ways.
Considering only the 200 longest papaya scaffolds, we found 121 colinear blocks. The papaya blocks range in size from 1.36 Mb containing 181 genes to 0.16 Mb containing 19 genes (a statistical, rather
than a biological, lower limit); the corresponding Arabidopsis regions
range from 0.69 Mb containing 163 genes to 60 kilobases (kb) containing 18 genes. Across the 121 papaya segments for which colinearity can be detected, 26 show primary correspondence (that is,
excluding the effects of ancient triplication detailed below) to only
one Arabidopsis segment, 41 to two, 21 to three, 30 to four, and only 3
to more than four.
The fact that many papaya segments show co-linearity with two to
four Arabidopsis segments (Fig. 1, and Supplementary Figs 3 and 4) is
most parsimoniously explained if either one or two genome duplications have affected the Arabidopsis lineage since its divergence from
papaya. Although it was suspected that the most recent Arabidopsis
genome duplication, a14, might affect only a subset of the
Brassicales15, previous phylogenetic dating of these events15 had suggested that the more ancient b-duplication occurred early in the
eudicot radiation, well before the Arabidopsis–Carica divergence.
This incongruity is under investigation.
In contrast, individual Arabidopsis genome segments correspond
to only one papaya segment, indicating that no genome duplication
has occurred in the papaya lineage since its divergence from
Arabidopsis about 72 million years ago5. The lack of relatively recent
papaya genome doubling is further supported by an L-shaped distribution of intra-EST correspondence for papaya (not shown). However, multiple genome/subgenome alignments (see Supplementary
Methods) reveal evidence in papaya of the ancient ‘c’ genome
duplication shared with Arabidopsis and poplar that is postulated
to have occurred near the origin of angiosperms14. Indeed, both
papaya (with no subsequent duplication) and poplar (with a relatively low rate of duplicate gene loss) suggest that c was not a duplication but a triplication (Fig. 1), with triplicated patterns evident for
about 25% of the 247 Mb comprising the 200 largest papaya scaffolds.
Cp sc29
0.4–0.1 Mb
Pt sc1
8.6–8.3 Mb
Pt sc3
11.3–11.6 Mb
Vv chr2
4.4–3.9 Mb
At chr1
4.3–4.3 Mb
At chr1
23.3–23.4 Mb
At chr4
7.2–7.1 Mb
At chr4
12.0–12.1 Mb
Cp sc18
1.4–1.6 Mb
Pt sc2
11.3–11.6 Mb
Pt sc14
1.7–1.9 Mb
Vv chr15
6.3–5.7 Mb
At chr2
18.8–18.8 Mb
At chr3
22.6–22.6 Mb
α20
β6
α3
γ7
α11
Cp sc4
3.2–3.9 Mb
Pt sc12
13.5–13.3 Mb
Pt sc15
9.7–9.9 Mb
Vv chr16r
2.6–3.0 Mb
At chr5
21.1–21.1 Mb
Figure 1 | Alignment of co-linear regions from Arabidopsis (green), papaya
(magenta), poplar (blue) and grape (red). ‘Vv chr16r’ is an unordered
ultracontig that has been assigned to grape chromosome 16. Triangles
represent individual genes with transcriptional orientations. Several
Arabidopsis regions belong to previously identified duplication segments
(a3, a11, a20, b6, c7, shown to the right)23. The whole syntenic alignment
supports four distinct whole-genome duplication events: a, b within the
Arabidopsis lineage, an independent duplication in poplar, and c which is
shared by all four eudicot genomes. Co-linear regions can be grouped into
three c sub-genomes based on Camin–Sokal parsimony criteria.
992
©2008 Nature Publishing Group
LETTERS
NATURE | Vol 452 | 24 April 2008
Number of genes
300
Arabidopsis
Papaya
200
100
0
0)
0,
1,
0,
1,
) s ( 0)
, 0 en 0,
, 0 nd 0,
, 1 pe 1,
, 0 e 1,
(1 nid a ( ) )
Y mi lut , 0 , 0
LF -Lu vo , 0 , 1 0)
3 a 0 0 ,
C Ph 1, 3, , 0
B- , 4, 0
H M (1 P ( 1, 1) 0)
,
LI 2-D (1 , 1, , 0,
P
0
EF ZI 3, 0 1, )
D- , , , 0 ) )
H (4 1 (1 , 3 0, 0 , 0
3
Z l 1
TA -Be , 1, , 0, 2, 1 , 2)
)
B 5 0 , ) 2
H B ( (1, , 1 , 1 2, ) , 0
6 6 ,
2
SP D (1 , 4 1 ,
-H nji , 1 (8, , 0, , 1
ZFmo 1, 4 P5 , 2 6, 3
(
6
Ju P (1 -HA(8, w 1) 0)
,
T
ro
TC AA ers rec 11, 5, 3
,
C h
C ot ca 14 0, ) )
B- -S , , 1 7 4
H AS , 5 (17 5, , 1
, 8
R 0
G T (3 ers 7, 8 0, 1 0) )
h
)
, 1
5
1
SE ot , 8 9, 1 1
,
1
B- 30 , , 2, 7
H P ( (36 3, 1 , 1 6,
I
0)
,
9
1
bZ DS 3, 31, 1, 8 )
,1
A 2
20
M C ( 52, P (2 3, 3, 2) 1,
1
A (
N LH EB , 1, 1, 3 5,
R 5 , 2
bH2-E (9, , 4 56,
8 (
AP KY K ( ns
R R i
W P- rote
RWB p
Y
M
This is most probably an underestimate that will increase as papaya
contiguity is improved. Triplication in papaya and poplar corresponds closely to the triplication suggested by an independent analysis of the grape genome5.
A few hundred papaya chromosomal segments were aligned using
BLASTZ to their one to four syntenic regions in Arabidopsis, and the
results examined visually using the Genome Evolution (GEvo)
viewer16. The orthologous region of grape was also included5, making
the alignment a six-way comparison. One example is given in
Supplementary Fig. 5: a 500 kb segment of papaya, its four 60 kb
syntenic, orthologous Arabidopsis segments and the 400 kb orthologous segment of grape.
For the homologous Arabidopsis segments that are discernibly colinear (by MC-SCANNER) to the 200 longest papaya scaffolds,
34.8% of Arabidopsis genes in any one segment correspond to a
papaya gene, whereas only 24.8% of papaya genes in any one segment
correspond to an Arabidopsis gene. Moreover, the Arabidopsis homologous segments contain fewer genes, on average only about 57.9% of
the number in their papaya counterparts.
Papaya provides a useful outgroup necessary to detect subfunctionalization. Supplementary Fig. 6 is a GEvo screenshot of a blastn
alignment illustrating subfunctionalization of conserved non-coding
sequences (CNSs)17 upstream of two syntenic, duplicate Arabidopsis
genes and their single papaya orthologous gene. The a-duplicated
genomes within Arabidopsis are perfect for CNS discovery18.
Comparative analysis of the papaya and Arabidopsis 59 untranslated regions showed that only 14% of orthologous promoter pairs
exhibit significantly higher levels of sequence identity than random
comparisons (Supplementary Figs 7 and 8). Although some highly
conserved promoters show substantial conservation across much of
their length, sequence similarity for most orthologous papaya promoters is indistinguishable from background.
Global analysis of all inferred protein models from papaya,
Arabidopsis, poplar, grape and rice clusters the 208,901 non-redundant
protein sequences into 39,706 similarity groups, or ‘tribes’19, 11,851 of
which contain two or more genes (see Supplementary Methods). Tribes
with multiple genes in a species typically correspond to families or
subfamilies of genes; however, tribes may also contain just one gene
(‘singleton tribes’). In papaya, 25,312 gene models were classified into
12,958 tribes, 5,669 of which were specific to papaya (Supplementary
Table 4). Of the papaya-specific tribes, 5,314 were singleton tribes. EST
support was markedly lower for genes in papaya-specific tribes (below
14%) than in tribes that included genes from at least one other taxon
(72.4%).
To investigate the smaller number of genes in papaya, we compared tribe membership from each of the five sequenced angiosperm
species (Supplementary Table 5). Among the 6,726 tribes that contain genes from both Arabidopsis and papaya, 3,595 contain equal
numbers of genes from both species. However, tribes with more
Arabidopsis genes outnumber those with more papaya genes by more
than 2:1 (2,153:979). The trend of smaller number of papaya genes is
widespread across tribes of all sizes and major functional categories
(Supplementary Table 6 and Supplementary Fig. 9).
We then examined membership in the 815 tribes with members
identified as being likely transcription factors in the Arabidopsis transcription factor database (http://arabidopsis.med.ohio-state.edu/
AtTFDB/). This set includes 2,897 genes in Arabidopsis and 2,438
in papaya (a ratio of 1.19:1). The details of tribe membership are
illustrated for 25 exemplar families and superfamilies (Fig. 2), where
most transcription-factor tribes have fewer genes in papaya than
Figure 2 | Comparison of gene numbers in transcription-factor tribe or
related tribes from Arabidopsis and papaya. Most transcription factors are
represented by fewer genes in papaya than Arabidopsis. Transcription-factor
names are given, with values after the names corresponding to: number of
tribes with genes assigned to transcription factor group, number of tribes
with smaller counts in papaya than Arabidopsis, number of tribes with equal
counts in papaya and Arabidopsis, number of tribes with larger counts in
papaya, and number of tribes with zero members in papaya. Supporting data
are provided in Supplementary Table 8.
Arabidopsis. Some transcription-factor tribes had more genes in
papaya, specifically RWP-RK, MADS-box, Scarecrow, TCP and
Jumonji gene families. Interestingly, the difference in MADS protein
family size appears to be due to expanded numbers for half of the 36
MADS tribes. The other 18 MADS tribes had fewer papaya genes,
including 14 that were not found in papaya.
Assuming that a generalized angiosperm could potentially require
only the types and minimal numbers of genes that are shared among
divergent plant species, we examined each of the tribes shared among
the five angiosperms with sequenced genomes. The number of genes
required in a minimal flowering plant is based on the observed
minimum number of genes across each of the shared tribes
(Table 2). When the smallest observed number is taken for each
evolutionarily conserved tribe, a minimal angiosperm genome of
13,311 genes is estimated. Papaya has the smallest number of genes
for more tribes than any other sequenced taxon (4,515, or 76% of
5,925 shared tribes), reinforcing the notion that papaya has fewer
genes than any angiosperm sequenced so far.
Only 55 nucleotide-binding site (NBS)-containing R genes were
identified in papaya; about 28% of the 200 NBS genes in Arabidopsis20
and less than 10% of the 600 NBS genes in rice21. Resistance proteins
also have a carboxy-terminal leucine-rich repeat (LRR) domain.
These NBS-containing R-gene families can be subdivided into three
classes: NBS–LRR, toll interleukin receptor (TIR)–NBS–LRR, and
coiled-coil (CC)–NBS–LRR on the basis of their amino-terminal
region. Papaya NBS–LRR outnumbered both TIR–NBS–LRR and
CC–NBS–LRR genes, in contrast to both poplar (with more CC–
NBS–LRR genes4) and Arabidopsis (with more TIR–NBS–LRR).
More than 50% of the NBS-type R genes were clustered in about
eight scaffolds, indicating that resistance gene evolution may involve
duplication and divergence of linked gene families.
Table 2 | Deduced potential minimal angiosperm gene number based on species with smallest number of genes for each tribe
Shared tribes with minimum
Number of unique tribes
Number of conserved tribes lost or
missing from each species
Carica papaya
Arabidopsis thaliana
4,515
5,708
405
3,597
2,950
113
Populus trichocarpa Oryza sativa (japonica)
1,548
6,338
28
3,657
13,003
429
Vitis vinifera
3,597
3,567
175
Shared tribes Minimal gene number
5,925
13,331
993
©2008 Nature Publishing Group
LETTERS
NATURE | Vol 452 | 24 April 2008
Homologues for genes involved in cellulose biosynthesis are present in papaya and Arabidopsis, with more cellulose synthase genes in
poplar, perhaps associated with wood formation. Papaya has at least
32 putative b-glucosyl transferase (GT1) genes compared with 121 in
Arabidopsis identified using sequence alignment. A total of 38 and 40
cellulose synthase-related genes (GT2) were identified in papaya
using the 48 poplar and 31 Arabidopsis genes as queries, respectively.
These genes include 11 cellulose synthase (CesA) genes, the same
number as in Arabidopsis but 7 fewer than in poplar. Putative cellulose orientation genes (COBRA) were more abundant in
Arabidopsis (12) than in papaya (8).
Papaya also has a similar complement though fewer genes for cellwall synthesis than Arabidopsis. Papaya and Arabidopsis, respectively,
have 6 and 12 callose synthase genes (GT2); 15 and 15 xyloglucan
a-1,2-fucosyl transferases (GT37); 5 and 7 b-glucuronic acid transferases in familes GT43 and GT47; and 27 and 42 in GT8 that includes
galacturonosyl transferases, associated with pectin synthesis.
The cell wall of plants is capable of both plastic and elastic extension, and controls the rate and direction of cell expansion22. Despite
fewer whole-genome duplications, papaya has a similar number of
putative expansin A genes (24) as Arabidopsis (26) and poplar (27),
and more expansin B genes (10) than Arabidopsis (6) and poplar (3).
In contrast to expansion-related genes, papaya has on average
about 25% fewer cell-wall degradation genes than Arabidopsis, in
some cases far fewer. For example, papaya and Arabidopsis, respectively, have 4 and 12 endoxylanase-like genes in glycoside hydrolase
family 10 (GH10); 29 and 67 pectin methyl esterases (carbohydrate
esterase family 8); 28 and 69 polygalacturonases (GH28); 15 and 49
xyloglucan endotransglycosylase/hydrolases (GH16); 18 and 25
b-1,4-endoglucanases (GH9); 42 and 91 b-1,3-glucanases (GH17);
and 15 and 27 pectin lyases (PL1).
A semi-woody giant herb that accumulates lignin in the cell wall at
an intermediate level between Arabidopsis and poplar, papaya generally has intermediate numbers of lignin synthetic genes, fewer than
poplar but more than Arabidopsis despite fewer opportunities for
duplication in papaya. Poplar, papaya and Arabidopsis have 37, 30
and 18 candidate genes for the lignin synthesis pathway, respectively4,23, with papaya having an intermediate number of genes for
the PAL, C4H, 4CL and HCT gene families, and only one COMT and
two C3H genes. In contrast, poplar has three C3H genes, which are
presumed to convert p-coumaroyl quinic acid to caffeoyl shikimic
acid, whereas there are two in papaya and one in Arabidopsis. Papaya,
Arabidopsis and poplar each have two genes in the family
CCoAOMT, which are presumed to convert caffeic acid to ferulic
acid4. Compared with these other plants, papaya has the fewest genes
in the CCR gene family (1 gene) and the most in the F5H (4 genes)
and CAD gene families (18 genes), which all mediate later steps of the
lignin biosynthesis pathway.
More starch-associated genes in papaya, a perennial, may be due to
a greater need for storage in leaves, stem and developing fruit than in
Arabidopsis, an ephemeral that stores oil in the seed. Papaya and
Arabidopsis, respectively, have 13 and 6 putative starch synthase
(GT5) genes; 8 and 3 starch branching genes; 6 and 3 isoamylases
(GH13); and 12 and 9 b-amylases (GH14). Early unloading of fruit
sugar in papaya is probably symplastic24, with five genes for sucrose
synthase/sucrose phosphate synthase (GT4); seven are reported for
Arabidopsis. Five acid invertase (GH32) sequences were found in
papaya whereas 11 have been reported in Arabidopsis. Papaya has
at least seven putative neutral invertase (GH32) genes; Arabidopsis
has six. Wall-associated kinases (WAK) are thought to be involved in
the regulation of vacuolar invertases, with 17 in Arabidopsis and only
10 in papaya. Arabidopsis and papaya have 14 and 7 hexose transporters, respectively. The greater number of genes for sugar accumulation in Arabidopsis may reflect recent genome duplications.
Papaya has undergone particularly striking amplification of genes
involved in volatile development. Papaya and Arabidopsis, respectively, have 18 and 8 genes for cinnamyl alcohol dehydrogenase; 2
and 1 genes for cinnamate-4-hydroxylase; 9 and 3 genes for phenylalanine ammonia lyase; and 24 and 3 limonene cyclase genes.
Papaya ripening is climacteric, with the rise in ethylene production
occurring at the same time as the respiratory increase25. Papaya and
Arabidopsis, respectively, have similar numbers of genes involved in
ethylene synthesis, with four each for S-adenosyl methionine
synthase (SAM synthase); 8 and 13 for aminocyclopropane carboxylic acid (ACC) synthase (ACS); 8 and 12 for ACC oxidase
(ACO); and 42 and 64 for ethylene-responsive binding factors
(AP2/ERF).
Because papaya grows in tropical climates where daily light/dark
cycles do not change much over the year, we can ask if more or fewer
light/circadian genes are required to synchronize with the environment. In fact, there are fewer light/clock genes in the papaya genome
(49% and 34% of poplar and Arabidopsis, respectively; Supplementary Table 7). However, among the core circadian clock genes, the
pseudo-response regulators (PRRs; Supplementary Fig. 10) have
expanded in poplar compared with Arabidopsis, and the papaya
PRR7 cluster has seemingly duplicated with the recent poplar
salicoid-specific genome duplication4 (Supplementary Fig. 11).
Against the backdrop of fewer overall genes, the parallel expansion
of the PRRs is consistent with circadian timing being important in
papaya.
The PAS–FBOX–KELCH genes control light signalling and
flowering time; however, the only papaya orthologue (ZTL) lacks
an obvious KELCH domain compared with Arabidopsis and poplar,
which have five and one KELCH domains, respectively (Supplementary Fig. 10). In fact, the papaya genome contains fewer KELCH
domains (37 compared with 130 and 74 in Arabidopsis and poplar,
respectively). In contrast, there are three constitutive photomorphogenic 1 (COP1) paralogues in the papaya genome compared with
only one in Arabidopsis (Supplementary Tables 7 and 8). A similar
expansion has been noted in moss (Physcomitrella patens), which has
nine COP1 paralogues that are hypothesized to aid in tolerance to
ultraviolet light (Supplementary Fig. 12)26. Both KELCH domains
and the WD-40 of the COP1 family form b-propellers and play a role
in light-mediated ubiquitination. There is not a general expansion of
WD-40 genes in papaya (173 compared with 227 in Arabidopsis).
Perhaps papaya has developed an alternative way of integrating light
or timing information specific to day-neutral plants, such as a strict
adherence to the diel light/dark cycle that is better served by the COPmediated system.
Sex determination in papaya is controlled by a pair of primitive
sex chromosomes, with a small male-specific region of the Y chromosome (MSY)8. The physical map of the MSY is currently estimated
by chromosome walking to span about 8 Mb (ref. 27). Two scaffolds
in the current female-genome sequence align to the X chromosome
physical map based on BAC end sequences, spanning 4.5 Mb and
including 254 predicted protein-encoding genes, of which 75
(29.5%) have EST support (Supplementary Table 9 and Supplementary Fig. 13). If adjusted for the percentage of unigene validation
for other genes (48.0%), the estimated number of genes in the
X-specific region would be 156. The average gene density would be
one gene per 19.5 kb, lower than the estimated genome average of one
gene per 14.3 kb. By contrast, among seven completely sequenced
MSY BACs totalling 1.2 Mb, a total of four expressed genes were
found on two of the BACs14,28. The somewhat lower-than-average
gene density in the X-specific scaffolds is accompanied by more
repetitive DNA (58.3%) than the genome-wide average, perhaps
because this region is near the centromere28. Re-analysis of
the repetitive DNA content of the MSY BACs, to include the new
papaya-specific repeat families identified herein, increased the average repeat sequence to 85.6%, with 54.1% Gypsy and 1.9% Copia
retro-elements (Supplementary Table 10). This compares with an
earlier estimate of 17.9% using the Arabidopsis repeat database
alone28.
994
©2008 Nature Publishing Group
LETTERS
NATURE | Vol 452 | 24 April 2008
The SunUp genome has presented an opportunity to analyse transgene insertion sites critically. Southern blot analysis was key in the
initial identification of transgenic insertion fragments and was performed with probes spanning the entire 19,567-bp transformation
vector used for bombardment (Supplementary Fig. 14). Among the
identified inserts were the functional coat-protein transgene conferring resistance to papaya ringspot virus, which was found in an intact
9,789-bp fragment of the transformation plasmid, and a 1,533-bp
fragment composed of a truncated, non-functional tetA gene and
flanking vector backbone sequence. The structures of the coatprotein transgene and tetA region insertion sites were determined
from cloned sequences. Southern analysis also confirmed a 290-bp
non-functional fragment of the nptII gene originally identified by
WGS sequence analysis (Supplementary Fig. 15). Five of the six
flanking sequences of the three insertions are nuclear DNA copies
of papaya chloroplast DNA fragments. The integration of the transgenes into chloroplast DNA-like sequences may be related to the
observation that transgenes produced either by Agrobacteriummediated or biolistic transformation are often inserted in AT-rich
DNA29, as is the chloroplast DNA of papaya and other land plants.
Four of the six insert junctions have sequences that match topoisomerase I recognition sites, which are associated with breakpoints in genomic DNA transgene insertion sites and transgene rearrangements29.
The presence of these inserts was confirmed by high-throughput
MUMmer30 analysis for each region of the transformation vector.
Evidence for the presence of other transgene inserts is not conclusive
(Supplementary Note 3).
Its lower overall gene number notwithstanding, striking variations
in gene number within particular functional groups, superimposed
on the average approximate 20% reduction in papaya gene number
relative to Arabidopsis, may be related to key features of papaya
morphological evolution. Despite a closer evolutionary relationship
to Arabidopsis, papaya shares with poplar an increased number of
genes associated with cell expansion, consistent with larger plant size;
and lignin biosynthesis, consistent with the convergent evolution of
tree-like habit. Amplification of starch-synthesis genes in papaya
relative to Arabidopsis is consistent with a greater need for storage
in leaves, stem and developing fruit of this perennial. Tremendous
amplification in papaya of genes related to volatile development
implies strong natural selection for enhanced attractants that may
be key to fruit (seed) dispersal by animals and which may also have
attracted the attention of aboriginal peoples. This also foreshadows
what we might expect to discover in the genomes of other fragrantfruited trees, as well as plants with striking fragrance of leaves (herbs),
flowers or other organs.
Arguably, the sequencing of the genome of SunUp papaya makes it
the best-characterized commercial transgenic crop. Because papaya
ringspot virus is widespread in nearly all papaya-growing regions,
SunUp could serve as a transgenic germplasm source that could be
used to breed suitable cultivars resistant to the virus in various parts
of the world. The characterization of the precise transgenic modifications in SunUp papaya should also serve to lower regulatory barriers
currently in place in some countries.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
METHODS SUMMARY
Gene annotation. Papaya unigenes from complementary DNA were aligned to
the unmasked genome assembly, which was then used in training ab initio gene
prediction software. Spliced alignments of proteins from the plant division of
GenBank, and transcripts from related angiosperms, were generated. Gene predictions were combined with spliced alignments of proteins and transcripts to
produce a reference gene set. Detailed descriptions are given in Methods.
Full Methods and any associated references are available in the online version of
the paper at www.nature.com/nature.
Received 6 September 2007; accepted 22 February 2008.
1.
Gonsalves, D. Control of papaya ringspot virus in papaya: a case study. Annu. Rev.
Phytopathol. 36, 415–437 (1998).
30.
The Arabidopsis Genome Initiative. Analysis of the genome sequence of the
flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
International Rice Genome Sequencing Project. The map-based sequence of the
rice genome. Nature 436, 793–800 (2005).
Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. &
Gray). Science 313, 1596–1604 (2006).
Jaillon, C. O. et al. The grapevine genome sequence suggests ancestral
hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).
Arumuganathan, K. & Earle, E. D. Nuclear DNA content of some important plant
species. Plant Mol. Biol. Rep. 9, 208–218 (1991).
Fitch, M. M. M., Manshardt, R. M., Gonsalves, D., Slightom, J. L. & Sanford, J. C.
Virus resistant papaya plants derived from tissues bombarded with the coat
protein gene of papaya ringspot virus. Bio/technology 10, 1466–1472 (1992).
Liu, Z. et al. A primitive Y chromosome in papaya marks incipient sex
chromosome evolution. Nature 427, 348–352 (2004).
Wikström, N., Savolainen, V. & Chase, M. W. Evolution of the angiosperms:
calibrating the family tree. Proc. R. Soc. Lond. B 268, 2211–2220 (2001).
Storey, W. B. Papaya. in Outlines of Perennial Crop Breeding in the Tropics
(eds Ferwerda, F. P. and Wit, F.) 389–408 (H. Veenman & Zonen, Wageningen,
1969).
Li, L. et al. Genome-wide transcription analyses in rice using tiling microarrays.
Nature Genet. 38, 124–129 (2006).
Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random
clones: a mathematical analysis. Genomics 2, 231–239 (1988).
Hanada, K., Zhang, X., Borevitz, J. O., Li, W.-H. & Shiu, S.-H. A large number of
novel coding small open reading frames in the intergenic regions of the
Arabidopsis thaliana genome are transcribed and/or under purifying selection.
Genome Res. 17, 632–640 (2007).
Bowers, J. E., Chapman, B. A., Rong, J. & Paterson, A. H. Unravelling angiosperm
genome evolution by phylogenetic analysis of chromosomal duplication events.
Nature 422, 433–438 (2003).
Schranz, M. E. & Mitchell-Olds, T. Independent ancient polyploidy events in
the sister families Brassicaceae and Cleomaceae. Plant Cell 18, 1152–1165
(2006).
Lyons, E. & Freeling, M. How to usefully compare homologous plant genes and
chromosomes as DNA sequence. Plant J. 53, 661–673 (2008).
Inada, D. C. et al. Conserved noncoding sequences in the grasses. Genome Res. 13,
2030–2041 (2003).
Thomas, B. C., Rapaka, L., Lyons, E., Pedersen, B. & Freeling, M. Arabidopsis
intragenomic conserved noncoding sequence. Proc. Natl Acad. Sci. USA 104,
3348–3353 (2007).
Wall, P. K. et al. PlantTribes: a gene and gene family resource for comparative
genomics in plants. Nucleic Acids Res. 36, D970–D976 (2008).
Meyers, B. C., Morgante, M. & Michelmore, R. W. TIR-X and TIR-NBS proteins:
two new families related to disease resistance TIR-NBS-LRR proteins encoded in
Arabidopsis and other plant genomes. Plant J. 32, 77–92 (2002).
Zhou, T. et al. Genome-wide identification of NBS genes in japonica rice reveals
significant expansion of divergent non-TIR NBS-LRR genes. Mol. Genet. Genomics
271, 402–415 (2004).
Fry, S. C. Primary cell wall metabolism: tracking the careers of wall polymers in
living plant cells. New Phytol. 161, 641–675 (2004).
Ehlting, J. et al. Global transcript profiling of primary stems from Arabidopsis
thaliana identifies candidate genes for missing links in lignin biosynthesis and
transcriptional regulators of fiber differentiation. Plant J. 42, 618–640 (2005).
Zhou, L. L. & Paull, R. E. Sucrose metabolism during papaya (Carica papaya) fruit
growth and ripening. J. Am. Soc. Hortic. Sci. 126, 351–357 (2001).
Paull, R. E. & Chen, N. J. Postharvest variation in cell wall-degrading enzymes of
papaya (Carica papaya L.) during fruit ripening. Plant Physiol. 72, 382–385 (1983).
Richardt, S., Lang, D., Reski, R., Frank, W. & Rensing, S. A. PlanTAPDB, a
phylogeny-based resource of plant transcription-associated proteins. Plant
Physiol. 143, 1452–1466 (2007).
Yu, Q. et al. Low X/Y divergence of four pairs of papaya sex-liked genes. Plant J.
53, 124–132 (2008).
Yu, Q. et al. Chromosomal location and gene paucity of the male specific region on
papaya Y chromosome. Mol. Genet. Genomics 278, 177–185 (2007).
Sawasaki, T., Takahashi, M., Goshima, N. & Morikawa, H. Structures of transgene
loci in transgenic Arabidopsis plants obtained by particle bombardment: junction
regions can bind to nuclear matrices. Gene 218, 27–35 (1998).
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome
Biol. 5, R12 (2004).
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature.
Acknowledgements We thank X. Wan, J. Saito and A. Young at the University of
Hawaii for technical assistance; C. Detter at the DOE Joint Genome Institute;
F. MacKenzie, O. Veatch and T. Uhm at the Hawaii Agriculture Research Center;
L. Li, W. Teng, Y. Wu, Y. Yang, C. Zhou, N. Wang, P. Wang and D. Fei at the Tianjin
Biochip Corporation, Tianjin Economic-Technological Development Area, Tianjin;
and R. Herdes, L. Diebold, R. Kim, A. Hernandez, S. Ali and L. Bynum at the
University of Illinois at Urbana-Champaign. This papaya genome-sequencing
project was given support by the University of Hawaii and the US Department of
Defense grant number W81XWH0520013 to M.A., the Maui High Performance
995
©2008 Nature Publishing Group
LETTERS
NATURE | Vol 452 | 24 April 2008
Computing Center to M.A., the Hawaii Agriculture Research Center to R.M. and
Q.Y., and Nankai University, China, to L.W. Other support to the papaya genome
project included the United States Department of Agriculture T-STAR program; a
United States Department of Agriculture–Agricultural Research Service
cooperative agreement (CA 58-3020-8-134) with the Hawaii Agriculture
Research Center; the University of Illinois; the National Science Foundation Plant
Genome Research Program; and Tianjin Municipal Special Fund for Science and
Technology Innovation Grant 05FZZDSH00800. We thank P. Englert, former
chancellor of the University of Hawaii, for initial infrastructure support of the
research.
Author Information The papaya WGS sequence is deposited at DNA Data Bank of
Japan/European Molecular Biology Laboratory/GenBank under accession number
ABIM00000000. The version described in this paper is the first version,
ABIM01000000. The GenBank accession numbers of the papaya ESTs are
EX227656–EX303501. This paper is distributed under the terms of the Creative
Commons Attribution-Non-Commercial-Share Alike licence, and is freely available
to all readers at www.nature.com/nature. Reprints and permissions information is
available at www.nature.com/reprints. Correspondence and requests for
materials should be addressed to M.A. ([email protected]) or L.W.
([email protected]).
996
©2008 Nature Publishing Group
doi:10.1038/nature06856
METHODS
Genome assembly. The Genome sequence was assembled by Arachne31. WGS
reads and BAC end reads were trimmed by LUCY and screened for organellar
sequences32. Two approaches were applied to screening and removing reads of
presumably organellar origin to alleviate the load in assembling highly repetitive
regions by WGS assembly software. The first approach was an iterative process,
in which reads were assembled, contigs matching with organellar genomes identified, constituent reads removed, and the process repeated by two or three more
rounds. This approach produced the read sets for the released assemblies
Stripped3 and Stripped4. The second approach was to remove plasmid clones
and BAC clones of presumably organellar origin by identifying clones with both
end reads matching entirely with organellar genomes, with physical map
information an amendment to the identification of BAC clones. Two rounds
of iterative screening based on pairing information of assembled and unplaced
reads were added to the second approach to generate the read set for the released
Papaya1.0 assembly.
The sequence error rates were estimated by aligning assembled shotgun
sequences with two finished BACs (GenBank accession numbers EF661023
and EF661026). The error rate of the assembly at 33 coverage or deeper
(74.2% of assembled sequences) was less than 0.01% based on average quality
values of 20 or greater in trimmed sequence. The error rate at 23 coverage
(16.3%) was 0.37%. The error rate at 13 coverage (9.5%) was approximately
0.75%, because these sequences are at the ends of the contigs (and sequence
reads) where the sequence quality declined.
Genome annotation. Gene annotation was conducted following the TIGR
Eukaryotic Annotation Pipeline. Repeat sequences were identified in the
assembled genome and masked by RepeatMasker, RepeatScout and
TransposonPSI, based on known repeat elements in RepBase databases and
TIGR Plant Repeat Databases, and the papaya novel repeat database constructed
in this study33,34. Program to Assemble Spliced Alignments (PASA)35 was used to
generate spliced alignments of papaya unigenes to the unmasked assembly, which
was then used in training ab initio gene prediction software Augustus,
GlimmerHMM and SNAP36–38. Ab initio gene prediction software Fgenesh,
Genscan and TWINSCAN were trained on Arabidopsis39–41. Spliced alignments
of proteins from the plant division of GenBank and transcripts from related
angiosperms (Arabidopsis thaliana, Glycine max, Gossypium hirsutum, Medicago
truncatula, Nicotiana tabacum, Oryza sativa, Zea mays) were generated by the
Analysis and Annotation Tool (AAT)42. Spliced alignment of proteins from the
Pfam database were generated using GeneWise43,44. Gene predictions generated by
Augustus, Fgenesh, Genscan, GlimmerHMM, SNAP and TWINSCAN were combined with spliced alignments of proteins and transcripts to produce a reference
gene set using the evidence-based combiner EVidenceModeler (EVM)45. Protein
domains were predicted using InterProScan against protein databases (PRINTS,
Pfam, ProDom, PROSITE, SMART)46–50.
Construction of papaya repeat database. We used a combination of homologybased and de novo methods to identify signatures of transposable elements in the
papaya genome. We used RepeatMasker (http://www.repeatmasker.org) in
combination with a custom-built library of plant repeat elements for our initial
classification of transposable elements. The customized library was generated by
combining plant repeats from Repbase and plant repeat databases from TIGR
(ftp://ftp.tigr.org/pub/data/TIGR_Plant_Repeats)33. Repeat elements identified
as ribosomal RNA sequences in the TIGR databases match a large fraction of the
papaya genome (about 3%). Ribosomal RNAs were identified separately, and
therefore were excluded from our repeat library, leaving a database of 76,924
repeat sequences that were used to search the papaya genome.
Homology-based methods are limited to finding elements that have not
diverged too greatly from known repeats. Because databases of known transposable elements are necessarily incomplete, we used additional de novo methods to
search for repeat elements in papaya contigs. For this, we applied two recently
developed repeat-finding tools, PILER and RepeatScout to the complete set of
contigs from the papaya genome34,51. PILER was able to find 428 repeat families
whereas RepeatScout found 6,596 repeat sequences.
The repeat families obtained from PILER and RepeatScout were annotated
using a combination of manual curation (786 repeat families) and automated
analysis. For the automated annotation, the combined data set from PILER and
RepeatScout was made non-redundant (using CD-HIT at the 90% similarity
level), leaving behind 6,240 repeat families52. As a post-processing step, we
selected only those families that had at least ten good (E value , 1 3 1020)
BLAST matches to papaya contigs. The resulting data set contained 2,198 repeat
families in the papaya genome. BLAST searches against non-redundant and
PTREP (http://wheat.pw.usda.gov/ITMI/Repeats) were then used to identify
repeat families matching genes associated with transposons and retrotransposons. This procedure discovered an additional 103 repeat families that could be
annotated as being retrotransposons. The combined database of 889 annotated
papaya-specific transposable-element sequences was used in addition to the
database of known repeats to annotate the papaya genome. The remaining,
unannotated repeat families (1,455 sequences with no matches to known genes)
were then used to estimate the additional repeat content of the genome.
31. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes:
Arachne 2. Genome Res. 13, 91–96 (2003).
32. Chou, H. H. & Holmes, M. H. DNA sequence quality trimming and vector removal.
Bioinformatics 17, 1093–1104 (2001).
33. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker (Release Open-3.1.3, 2006).
34. Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in
large genomes. Bioinformatics 21 (suppl.), i351–i358 (2005).
35. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal
transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
36. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new
intron submodel. Bioinformatics 19 (suppl.), ii215–ii225 (2003).
37. Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two
open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879
(2004).
38. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
39. Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA.
Genome Res. 10, 516–522 (2000).
40. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic
DNA. J. Mol. Biol. 268, 78–94 (1997).
41. Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene
structure prediction. Bioinformatics 17 (suppl. 1), S140–S148 (2001).
42. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R. A tool for analyzing and
annotating genomic sequences. Genomics 46, 37–45 (1997).
43. Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic Acids Res. 34
(Database issue), D247–D251 (2006).
44. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14,
988–995 (2004).
45. Haas, B. J. et al. Automated eukaryotic gene structure annotation using
EVidenceModeler and the Program to Assemble Spliced Alignments. Genome
Biol. 9, R7.1–R7.19 (2008).
46. Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res. 33,
W116–W120 (2005).
47. Attwood, T. K. et al. PRINTS and its automatic supplement, prePRINTs. Nucleic
Acids Res. 31, 400–402 (2003).
48. Bru, C. et al. The ProDom database of protein domain families: more emphasis on
3D. Nucleic Acids Res. 33 (Database issue), D212–D215 (2005).
49. Hulo, N. et al. The PROSITE database. Nucleic Acids Res. 34 (Database issue),
D227–D230 (2006).
50. Letunic, I. et al. SMART 5: domains in the context of genomes and networks.
Nucleic Acids Res. 34 (Database issue), D257–D260 (2006).
51. Edgar, R. C. & Myers, E. W. PILER: Identification and classification of genomic
repeats. Bioinformatics 21 (suppl.), i152–i158 (2005).
52. Li, W. & Godzik, A. CD-HIT: A fast program for clustering and comparing large
sets of protein or nucleotide sequences. Bioinformatics 22, i1658–i1659 (2006).
©2008 Nature Publishing Group
Vol 457 | 29 January 2009 | doi:10.1038/nature07723
ARTICLES
The Sorghum bicolor genome and the
diversification of grasses
Andrew H. Paterson1, John E. Bowers1, Rémy Bruggmann2, Inna Dubchak3, Jane Grimwood4, Heidrun Gundlach5,
Georg Haberer5, Uffe Hellsten3, Therese Mitros6, Alexander Poliakov3, Jeremy Schmutz4, Manuel Spannagl5,
Haibao Tang1, Xiyin Wang1,7, Thomas Wicker8, Arvind K. Bharti2, Jarrod Chapman3, F. Alex Feltus1,9, Udo Gowik10,
Igor V. Grigoriev3, Eric Lyons11, Christopher A. Maher12, Mihaela Martis5, Apurva Narechania12, Robert P. Otillar3,
Bryan W. Penning13, Asaf A. Salamov3, Yu Wang5, Lifang Zhang12, Nicholas C. Carpita14, Michael Freeling11,
Alan R. Gingle1, C. Thomas Hash15, Beat Keller8, Patricia Klein16, Stephen Kresovich17, Maureen C. McCann13,
Ray Ming18, Daniel G. Peterson1,19, Mehboob-ur-Rahman1,20, Doreen Ware12,21, Peter Westhoff10,
Klaus F. X. Mayer5, Joachim Messing2 & Daniel S. Rokhsar3,4
Sorghum, an African grass related to sugar cane and maize, is grown for food, feed, fibre and fuel. We present an initial
analysis of the ,730-megabase Sorghum bicolor (L.) Moench genome, placing ,98% of genes in their chromosomal context
using whole-genome shotgun sequence validated by genetic, physical and syntenic information. Genetic recombination is
largely confined to about one-third of the sorghum genome with gene order and density similar to those of rice.
Retrotransposon accumulation in recombinationally recalcitrant heterochromatin explains the ,75% larger genome size of
sorghum compared with rice. Although gene and repetitive DNA distributions have been preserved since
palaeopolyploidization ,70 million years ago, most duplicated gene sets lost one member before the sorghum–rice
divergence. Concerted evolution makes one duplicated chromosomal segment appear to be only a few million years old.
About 24% of genes are grass-specific and 7% are sorghum-specific. Recent gene and microRNA duplications may
contribute to sorghum’s drought tolerance.
The Saccharinae plants include some of the most efficient biomass
accumulators, providing food and fuel from starch (sorghum) and
sugar (sorghum and Saccharum, sugar cane), and have potential for
use as cellulosic biofuel crops (sorghum, sugar cane, Miscanthus). Of
singular importance to Saccharinae productivity is C4 photosynthesis, comprising biochemical and morphological specializations that increase net carbon assimilation at high temperatures1.
Despite their common photosynthetic strategy, the Saccharinae show
much morphological and genomic variation (Supplementary Fig. 1).
Its small genome (,730 Mb) makes sorghum an attractive model for
functional genomics of Saccharinae and other C4 grasses. Rice, the first
fully sequenced cereal genome, is more representative of C3 photosynthetic grasses. Drought tolerance makes sorghum especially important
in dry regions such as northeast Africa (its centre of diversity) and the
southern plains of the United States. Genetic variation in the partitioning
of carbon into sugar stores versus cell wall mass, and in perenniality and
associated features such as tillering and stalk reserve retention2, make
sorghum an attractive system for the study of traits important in perennial cellulosic biomass crops. Its high level of inbreeding makes it an attractive association genetics system3. Transgenic approaches to sorghum
improvement are constrained by high gene flow to weedy relatives4, making knowledge of its intrinsic genetic potential all the more important.
Reconstructing a repeat-rich genome from shotgun sequences
Preferred approaches to sequencing entire genomes are currently to
apply shotgun sequencing5 either to a minimum ‘tiling path’ of genomic clones, or to genomic DNA directly. The latter approach, wholegenome shotgun (WGS) sequencing, is widely used for mammalian
genomes, being fast, relatively economical and reducing cloning bias.
However, its applicability has been questioned for repetitive DNArich plant genomes6.
Despite a repeat content of ,61%, a high-quality genome sequence
was assembled from homozygous sorghum genotype BTx623 by using
WGS and incorporating the following: (1) ,8.5 genome equivalents of
paired-end reads7 from genomic libraries spanning a ,100-fold range
of insert sizes (Supplementary Table 1), resolving many repetitive
regions; and (2) high-quality read length averaging 723 bp, facilitating
assembly. Comparison with 27 finished bacterial artificial chromosomes (BACs) showed the WGS assembly to be .98.46% complete
and accurate to ,1 error per 10 kb (Supplementary Note 2.5).
1
Plant Genome Mapping Laboratory, University of Georgia, Athens, Georgia 30602, USA. 2Waksman Institute for Microbiology, Rutgers University, Piscataway, New Jersey 08854,
USA. 3DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 4Stanford Human Genome Center, Stanford University, Palo Alto, California 94304, USA. 5MIPS/IBIS,
Helmholtz Zentrum München, Inglostaedter Landstrasse 1, 85764 Neuherberg, Germany. 6Center for Integrative Genomics, University of California, Berkeley, California 94720, USA.
7
College of Sciences, Hebei Polytechnic University, Tangshan, Hebei 063000, China. 8Institute of Plant Biology, University of Zurich, Zollikerstrasse 107, 8008 Zurich, Switzerland.
9
Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina 29631, USA. 10Institut fur Entwicklungs und Molekularbiologie der Pflanzen, Heinrich-HeineUniversitat, Universitatsstrasse 1, D-40225 Dusseldorf, Germany. 11Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA. 12Cold Spring
Harbor Laboratory, Cold Spring Harbor, New York 11724, USA. 13Department of Biological Sciences, 14Department of Botany and Plant Pathology, Purdue University, West Lafayette,
Indiana 47907, USA. 15International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru 502 324, India. 16Department of Horticulture and Institute for Plant
Genomics and Biotechnology, Texas A&M University, College Station, Texas 77843, USA. 17Institute for Genomic Diversity, Cornell University, Ithaca, New York 14853, USA.
18
Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 19Mississippi Genome Exploration Laboratory, Mississippi State University,
Starkville, Mississippi 39762, USA. 20National Institute for Biotechnology & Genetic Engineering (NIBGE), Faisalabad, Pakistan. 21USDA NAA Robert Holley Center for Agriculture and
Health, Ithaca, New York 14853, USA.
551
©2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 457 | 29 January 2009
Comparison with a high-density genetic map8, a ‘finger-print
contig’ (FPC)-based physical map9, and the rice sequence6 improved
the sorghum WGS assembly (Supplementary Notes 1 and 2). Among
the 201 largest scaffolds (spanning 678.9 Mb, 97.3% of the assembly),
28 showed discrepancies with two or more of these lines of evidence
(Supplementary Note 2.6), often near repetitive elements. After breaking the assembly at the points of discrepancy, the resulting 229 scaffolds have an N50 (number of scaffolds that collectively cover at least
50% of the assembly) of 35 and L50 (length of the shortest scaffold
among those that collectively cover 50% of the assembly) of 7.0 Mb. A
total of 38 (2%) of 1,869 FPC contigs9 were deemed erroneous, containing .5 BAC ends that fell into different sequence scaffolds.
A total of 127 scaffolds containing 625.7 Mb (89.7%) of DNA and
1,476 FPC contigs could be assigned to chromosomal locations and
oriented. Fifteen out of twenty chromosome ends terminated in
telomeric repeats. The other 102 scaffolds were generally smaller
(53.2 Mb, 7.6%), with 85 (83%) containing far greater-than-average
abundance of the Cen38 (ref. 10) centromeric repeat, and with only
374 predicted genes. These 102 scaffolds merged only 193 FPC contigs, presumably due to the greater abundance of repeats that are
recalcitrant to clone-based physical mapping9 and may be omitted
in BAC-by-BAC approaches11.
Genome size evolution and its causes
The ,75% larger quantity of DNA in the genome of sorghum compared with rice is mostly heterochromatin. Alignment to genetic8 and
cytological maps12 suggests that sorghum and rice have similar quantities of euchromatin (252 and 309 Mb, respectively; Supplementary
Table 7), accounting for 97–98% of recombination (1,025.2 cM and
1,496.5 cM, respectively) and 75.4–94.2% of genes in the respective
cereals, with largely collinear gene order9. In contrast, sorghum
heterochromatin occupies at least 460 Mb (62%), far more than in rice
(63 Mb, 15%). The ,33 genome expansion in maize since its divergence from sorghum13 has been more dispersed—recombinogenic
DNA has grown 4.53 to ,1,382 Mb, much more than can be
explained by genome duplication14.
The net size expansion of the sorghum genome relative to rice
largely involved long terminal repeat (LTR) retrotransposons. The
sorghum genome contains 55% retrotransposons, intermediate
between the larger maize genome (79%) and smaller rice genome
(26%). Sorghum more closely resembles rice in having a higher ratio
of gypsy-like to copia-like elements (3.7 to 1 and 4.9 to 1) than maize
(1.6 to 1: Supplementary Table 10).
Although recent retroelement activity is widely distributed across
the sorghum genome, turnover is rapid (as in other cereals15) with
pericentromeric elements persisting longer. Young LTR retrotransposon insertions (,0.01 million years (Myr) ago) appear randomly
distributed along chromosomes, suggesting that they are preferentially eliminated from gene-rich regions9 but accumulate in genepoor regions (Fig. 1; see also Supplementary Note 3.1). Insertion
times suggest a major wave of retrotransposition ,1 Myr ago, after
a smaller wave 1–2 Myr ago (Supplementary Fig. 2).
CACTA-like elements, the predominant sorghum DNA transposons (4.7% of the genome), seem to relocate genes and gene fragments, as do rice ‘Pack-MULEs’16 and maize helitrons17. Many
sorghum CACTA elements are non-autonomous deletion derivatives
in which transposon genes have been replaced with non-transposon
DNA including exons from one or more cellular genes as exemplified
for family G118 (Fig. 2). Among 13,775 CACTA elements identified
(Supplementary Note 3.4), 200 encode no transposon proteins but
contain at least one cellular gene fragment.
In total, DNA transposons constitute 7.5% of the sorghum genome,
intermediate between maize (2.7%) and rice (13.7%; Supplementary
Table 10). Miniature inverted-repeat transposable elements, 1.7% of
the genome, are associated with genes (Fig. 1; see also Supplementary
Note 3) as in other cereals6. Helitrons, ,0.8% of the genome, nearly all
lack helicase in sorghum as in maize17, but carry fewer gene fragments
in sorghum than maize (Supplementary Note 3.5). Organellar DNA
insertion has contributed only 0.085% to the sorghum nuclear
genome, far less than the 0.53% of rice (Supplementary Note 2.7).
The gene complement of sorghum
Among 34,496 sorghum gene models, we found ,27,640 bona fide
protein-coding genes by combining homology-based and ab initio
gene prediction methods with expressed sequences from sorghum,
Chr 3
Cen38
Retrotransposons
DNA transposons
Genes (introns)
Genes (exons)
Young LTR-RTs
Full-length LTR-RTs
LTR-RT/gypsy
LTR-RT/copia
DNA-TE/CACTA
CpG islands
DNA-TE/MITE
Genes (exons)
Paralogues
Paralogues
Genes (exons)
DNA-TEs/MITE
CpG islands
DNA-TE/CACTA
LTR-RT/copia
LTR-RT/gypsy
Full-length LTR-RTs
Young LTR-RTs
Chr 9
0
20
40
60
(Mb)
552
©2009 Macmillan Publishers Limited. All rights reserved
Figure 1 | Genomic landscape of sorghum
chromosomes 3 and 9. Area charts quantify
retrotransposons (55%), genes (6% exons, 8%
introns), DNA transposons (7%) and
centromeric repeats (2%). Lines between
chromosomes 3 and 9 connect collinear
duplicated genes. Heat-map tracks detail the
distribution of selected elements. Figures for all
sorghum chromosomes are in Supplementary
Note 3. Cen38, sorghum-specific centromeric
repeat10; RTs, retrotransposons (class I); LTRRTs, long terminal repeat retrotransposons;
DNA-TEs, DNA transposons (class II).
ARTICLES
NATURE | Vol 457 | 29 January 2009
Autonomous CACTA-G118 ‘Mother’ element (9,043 bp)
Transposase
ORF2
Non-autonomous derivatives
G118-101, 6,698 bp
Chloroplast carbonic anhydrase
Conserved TIR
region
Metal transporter Nramp6
Foreign gene
fragments
G118-104, 3,609 bp
>30% identical
>40% identical
HP
Figure 2 | CACTA element deletion derivatives
that carry gene fragments. CACTA family G118
has only one complete and presumably
autonomous ‘mother’ element. Among 18
deletion derivatives, only the terminal
500–2,500 bp are conserved, with 8 carrying gene
fragments internally. One relatively
homogeneous subgroup (106, 111 and 112)
presumably arose recently, whereas other
derivatives are unique. The locations of the hits to
known rice proteins are indicated as coloured
boxes. The descriptions of the foreign gene
fragments are indicated underneath the boxes.
HP, hypothetical protein.
>50% identical
G118-110, 5,088 bp
>60% identical
with rice protein
Plant-specific
domain
Conserved HP
Importin β1 subunit
* Found in three copies
G118-106, 6,537 bp*
Replication factor A
G118-114, 9,829 bp
Chloroplast carbonic anhydrase
HP
40S ribosomal S15
G118-116, 4,244 bp
Importin β1 subunit
2 kb
maize and sugar cane (Supplementary Note 4). Evidence for alternate
splicing is found in 1,491 loci.
Another 5,197 gene models are usually shorter than the bona fide genes
(often ,150 amino acids); have few exons (often one) and no expressed
sequence tag (EST) support (compared with 85% for bona fide genes);
are more diverged from rice genes; and are often found in large families
with ‘hypothetical’, ‘uncharacterized’ and/or retroelement-associated
annotations, despite repeat masking (Supplementary Note 4). A high
concentration in pericentromeric regions where bona fide genes are
scarce (Fig. 1) suggests that many of these low confidence gene models
are retroelement-derived. We also identified 727 processed pseudogenes
and 932 models containing domains known only from transposons.
The exon size distributions of orthologous sorghum and rice genes
agree closely, and intron position and phase show .98% concordance (Supplementary Note 5). Intron size has been conserved
between sorghum and rice, although it has increased in maize owing
to transpositions18.
Most paralogues in sorghum are proximally duplicated, including
5,303 genes in 1,947 families of $2 genes (Supplementary Note 4.3).
The longest tandem gene array is 15 cytochrome P450 genes. Other
sorghum-specific tandem gene expansions include haloacid dehalogenase-like hydrolases (PF00702), FNIP repeats (PF05725), and male
sterility proteins (PF03015).
We confirmed the genomic locations of 67 known sorghum
microRNAs (miRNAs) and identified 82 additional miRNAs
(Supplementary Note 4.4). Five clusters located within 500 bp of
each other represent putative polycistronic miRNAs, similar to those
in Arabidopsis and Oryza. Natural antisense miRNA precursors
(nat-miRNAs) of family miR444 (ref. 19) have been identified in
three copies.
Comparative gene inventories of angiosperms
The number and sizes of sorghum gene families are similar to those of
Arabidopsis, rice and poplar (Fig. 3 and Supplementary Note 4.6). A
total of 9,503 (58%) sorghum gene families were shared among all
four species and 15,225 (93%) with at least one other species. Nearly
94% (25,875) of high-confidence sorghum genes have orthologues in
rice, Arabidopsis and/or poplar, and together these gene complements define 11,502 ancestral angiosperm gene families represented
in at least one contemporary grass and rosid genome. However, 3,983
(24%) gene families have members only in the grasses sorghum and
rice; 1,153 (7%) appear to be unique to sorghum.
Pfam domains that are over-represented, under-represented or
even absent in sorghum relative to rice, poplar and Arabidopsis,
may reflect biological peculiarities specific to the Sorghum lineage
(Supplementary Table 20). Domains over-represented in sorghum
are usually present in the other organisms, a notable exception being
the a-kafirin domain that accounts for most seed storage protein and
corresponds to maize zeins20 but which is absent from rice.
Nucleotide-binding-site–leucine-rich-repeat (NBS-LRR) containing
proteins associated with the plant immune system are only about half as
frequent in sorghum as in rice. A search with 12 NBS domains from
published rice, maize, wheat and Arabidopsis gene sequences revealed
211 NBS-LRR coding genes in sorghum, 410 in rice and 149 in
Arabidopsis21. Sorghum NBS-LRR genes mostly encode the CC type
of N-terminal domains. Only two sorghum genes (Sb02g005860 and
Sb02g036630) contain the TIR domain, and neither contains an NBS
domain. NBS-LRR genes are most abundant on sorghum chromosome
5 (62), and its rice homologue (chromosome 11, 106). Enrichment of
NBS-LRR genes in these corresponding genomic regions suggests conservation of R gene location, in contrast to a proposal that R gene
movement may be advantageous22.
Evolution of distinctive pathways and processes
The evolution of C4 photosynthesis in the Sorghum lineage involved
redirection of C3 progenitor genes as well as recruitment and functional
divergence of both ancient and recent gene duplicates. The sole sorghum
C4 pyruvate orthophosphate dikinase (ppdk) and the phosphoenolpyruvate carboxylase kinase (ppck) gene and its two isoforms (produced by
the whole genome duplication) have only single orthologues in rice.
Additional duplicates formed in maize after the sorghum–maize split
(Zmppck2 and Zmppck3). The C4 NADP-dependent malic enzyme (me)
gene has an adjacent isoform but each corresponds to a different maize
homologue, suggesting tandem duplication before the sorghum–maize
split. The C4 malate dehydrogenase (mdh) gene and its isoform are also
adjacent, but share 97% amino acid similarity and correspond to the
single known maize Mdh gene, suggesting tandem duplication in
sorghum after its split with maize. The rice Me and Mdh genes are single
553
©2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 457 | 29 January 2009
Arabidopsis
13,144
22,813
Sorghum
16,378 clusters
28,375 genes
879
Poplar
15,288
34,783
1,153
Rice
15,148
20,109
49
3,983
1,686
634
229
2,403
542
196
9,503
631
25
139
96
Figure 3 | Orthologous gene families between sorghum, Arabidopsis, rice
and poplar. The numbers of gene families (clusters) and the total numbers
of clustered genes are indicated for each species and species intersection.
copy, suggesting duplication and recruitment to the C4 pathway after
the Panicoideae–Oryzoideae divergence (Supplementary Note 9).
The sorghum sequence reinforces inferences previously based only
on rice, about how different grass and dicotyledon gene inventories
relate to their respective types of cell walls23,24. In grasses, cellulose
microfibrils coated with mixed-linkage (1R3),(1R4)-b-D-glucans
are interlaced with glucuronoarabinoxylans and an extensive complex of phenylpropanoids25. The sorghum sequence largely corroborates differences between dicotyledons and rice in the distribution of
cell wall biogenesis genes (Supplementary Note 10). For example, the
CesA/Csl superfamily and callose synthases have either diverged to
form new subgroups or functionally non-essential subgroups were
selectively lost, such as CslB and CslG lost from the grasses, and CslF
and CslH lost from species with dicotyledon-like cell walls26. The
previously rice-unique CslF and CslH genes are present in sorghum.
Arabidopsis contains a single group F GT31 gene, whereas sorghum
and rice contain six and ten, respectively.
The characteristic adaptation of sorghum to drought may be partly
related to expansion of one miRNA and several gene families.
Rice miRNA 169g, upregulated during drought stress27, has five
sorghum homologues (sbi-MIR169c, sbi-MIR169d, sbi-MIR169.p2,
sbi-MIR169.p6 and sbi-MIR169.p7). The computationally predicted
target of the sbi-MIR169 subfamily comprises members of the plant
nuclear factor Y (NF-Y) B transcription factor family, linked to
improved performance under drought by Arabidopsis and maize28.
Cytochrome P450 domain-containing genes, often involved in scavenging toxins such as those accumulated in response to stress, are abundant in sorghum with 326 versus 228 in rice. Expansins, enzymes that
break hydrogen bonds and are responsible for a variety of growth responses that could be linked to the durability of sorghum, occur in 82
copies in sorghum versus 58 in rice and 40 each in Arabidopsis and
poplar.
Duplication and diversification of cereal genomes
Whole-genome duplication in a common ancestor of cereals is
reflected in sorghum and rice gene ‘quartets’ (Fig. 4). A total of
19,929 (57.8%) sorghum gene models were in blocks collinear with
rice (Supplementary Note 6). After the shared whole-genome duplication, only one copy was retained for 13,667 (68.6%) collinear genes
with 13,526 (99%) being orthologous in rice–sorghum, indicating
that most gene losses predate taxon divergence. Both sorghum and
rice retained both copies of 4,912 (14.2%) genes, whereas sorghum
lost one copy of 1,070 (3.1%) and rice lost one copy of 634 (1.8%).
These patterns are likely to be predictive of other grass genomes, as the
major grass lineages diverged from a common ancestor at about the
same time29 (see also Supplementary Note 7).
Although most post-duplication gene loss happened in a common
cereal ancestor, some lineage-specific patterns occur. A total of 2 and
10 protein functional (Pfam) domains showed enrichment for duplicates and singletons (respectively) in sorghum but not rice
(Supplementary Note 6.1). Because the sorghum–rice divergence is
thought to have happened 20 Myr or more after genome duplication29, this suggests that even long-term gene loss differentially affects
gene functional groups.
One genomic region has been subject to a high level of concerted
evolution. It was previously suggested that rice chromosomes 11 and
12 share a ,5–7-Myr-old segmental duplication30–32. We found a
duplicated segment in the corresponding regions of sorghum chromosomes 5 and 8 (Fig. 5). Sorghum–sorghum and rice–rice paralogues from this region show rates of synonymous DNA substitution
(Ks) of 0.44 and 0.22, respectively, consistent with only 34 and 17 Myr
of divergence. However, the Ks value of sorghum–rice orthologues is
0.63, similar to the respective genome-wide averages (0.81, 0.87). We
hypothesize that the apparent segmental duplication actually resulted
from the pan-cereal whole-genome duplication and became differentiated from the remainder of the chromosome(s) owing to concerted
evolution acting independently in sorghum, rice and perhaps other
cereals. Gene conversion and illegitimate recombination are more
frequent in the rice 11–12 region than elsewhere in the genome33.
Physical and genetic maps suggest shared terminal segments of the
corresponding chromosomes in wheat (4, 5)34, foxtail millet (VII,
VIII) and pearl millet (linkage groups 1, 4)35.
Synthesis and implications
Comparison of the sorghum, rice and other genomes clarifies the grass
gene set. Pairs of orthologous sorghum and rice genes combined with
recent paralogous duplications define 19,542 conserved grass gene
families, each representing one gene in the sorghum–rice common
ancestor. Our sorghum gene count is similar to that in a manually
curated rice annotation (RAP2)36, but this similarity masks some differences. About 2,054 syntenic orthologues shared by our sorghum annotation and the TIGR5 (ref. 37) rice annotation are absent from RAP2.
Conversely, ,12,000 TIGR5 annotations may be transposable elements
or pseudogenes, comprising large families of hypothetical genes in both
sorghum and rice RAP2, often with short exons, few introns and limited
EST support. Phylogenetically incongruent cases of apparent gene loss
(for example, genes shared by Arabidopsis and sorghum but not rice:
Fig. 3) may also suggest sequence gaps or misannotations.
Grass genome architecture may reflect euchromatin-specific effects
of recombination and selection, superimposed on non-adaptive
processes of mutation and genetic drift that apply to all genomic
regions38. Patterns of gene and repetitive DNA organization remain
correlated in homologous chromosomes duplicated 70 Myr ago
(Fig. 1), despite extensive turnover of specific repetitive elements.
Synteny is highest and retroelement abundance lowest in distal
chromosomal regions. More rapid retroelement removal from generich euchromatin that frequently recombines than from heterochromatin that rarely recombines supports the hypothesis that recombination may preserve gene structure, order and/or spacing by exposing
new insertions to selection9. Less euchromatin–heterochromatin
polarization in maize, where retrotransposon persistence in euchromatin seems more frequent, may reflect variation in grass genome
architecture or perhaps a lingering consequence of more recent genome duplication39.
Identification of conserved DNA sequences may help us to understand essential genes and binding sites that define grasses. Progress in
sequencing Brachypodium distachyon40 sets the stage for panicoid–
oryzoid–pooid phylogenetic triangulation of genomic changes, as well
as association of some such changes with phenotypes ranging from
molecular (gene expression patterns) to morphological. The divergence
between sorghum, rice and Brachypodium is sufficient to randomize
554
©2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 457 | 29 January 2009
Zea c4 155.3–156.2 Mb
C1
Zea c5 199.2–2-200.2 Mb
C2
Sorghum c4
61.0–60.8 Mb
C3
Oryza c2
29.6–29.7 Mb
C4
Oryza
C5
Oryza c4
30.5–30.7 Mb
C6
Sorghum c6
56.2–56.3 Mb
C7
C8
C9
Zea c10 123.7–124.3 Mb
Zea c2 12.4–11.6 Mb
C10
C11
Sorghum/Oryza scale
C12
Zea scale
C1
C2
C3
C4
C5
Sorghum
C6
C7 C8 C9 C10
10 kb
80 kb
Zea BAC
Sorghum gene
Cereal duplication
Oryza gene
Sorghum–Oryza orthologue
Gene loss
Sorghum–Zea orthologue
Figure 4 | Alignment of sorghum, rice and maize. Dot plots show
intergenomic (gold) and intragenomic (black) alignments. One
sorghum–rice quartet showing both orthologous and paralogous
(duplicated) regions is magnified. Infrequent gene loss (red; see legend) after
sorghum–rice divergence causes ‘special cases’ in which there are paralogues
but no orthologues. Each sorghum region corresponds to two duplicated
maize regions39, with maize gene loss suggested where sorghum loci only
match one of the two. Because maize BACs are mostly unfinished, sorghum
loci are aligned to the centres. Note the different scale necessary for maize
physical distance. Larger dot plots are in Supplementary Note 6.
nonfunctional sequence, facilitating conserved noncoding sequence
(CNS) discovery41,42 (Supplementary Fig. 9). More distant comparisons
to the dicotyledon Arabidopsis show exon conservation but no CNS
(Supplementary Fig. 10). Chloridoid and arundinoid genome
sequences are needed to sample the remaining grass lineages, and an
outgroup such as Ananas (pineapple) or Musa (banana) would further
aid in identifying genes and sequences that define grasses.
The fact that the sorghum genome has not re-duplicated in
,70 Myr29 makes it a valuable outgroup for deducing fates of gene pairs
and CNS in grasses that have reduplicated. Single sorghum regions
correspond to two regions resulting from maize-specific genome
doubling39—gene fractionation is evident (Fig. 4), and subfunctionalization is probable (Supplementary Fig. 10). Sorghum may prove
especially valuable for unravelling genome evolution in the more closely
related Saccharum–Miscanthus clade: two genome duplications since
its divergence from sorghum 8–9 Myr ago43 complicate sugar cane
genetics44 yet Saccharum BACs show substantially conserved gene order
with sorghum (Supplementary Note 11).
Conservation of grass gene structure and order facilitates development of DNA markers to support crop improvement. We identified
,71,000 simple-sequence repeats (SSRs) in sorghum (Supplementary
List 1); among a sampling of 212, only 9 (4.2%) map to paralogues of
their source locus. Conserved-intron scanning primers (Supplementary
List 2) for 6,760 genes provide DNA markers useful across many monocotyledons, particularly valuable for ‘orphan cereals’45.
As the first sequenced plant genome of African origin, sorghum adds
new dimensions to ethnobotanical research. Of particular interest will
be the identification of alleles selected during the earliest stages of
sorghum cultivation, which are valuable towards testing the hypothesis
that convergent mutations in corresponding genes contributed to
independent domestications of divergent cereals46. Invigorated sorghum improvement would benefit regions such as the African
‘Sahel’ where drought tolerance makes sorghum a staple for human
populations that are increasing by 2.8% per year. Sorghum yield
improvement has lagged behind that of other grains, in Africa only
gaining a total of 37% (western) to 38% (eastern) from 1961–63 to
2005–07 (Supplementary Note 12).
L
Oryza
chr 11
L
Sorghum
chr 5
S
S
S
S
Oryza
chr 12
Sorghum
chr 8
L
L
Ks
0 0.2
0.4 1.0
Figure 5 | Independent illegitimate recombination in corresponding
regions of sorghum and rice. Four homologous rice and sorghum
chromosomes (11 and 12 in rice; 5 and 8 in sorghum) are shown, with gene
densities plotted. ‘L’ and ‘S’ show long and short arms, respectively. Lines
show Ks between homologous gene pairs, and colours are used to show
different dates of conversion events.
METHODS SUMMARY
Genome sequencing. Approximately 8.5-fold redundant paired-end shotgun
sequencing was performed using standard Sanger methodologies from small (,2–
3 kb) and medium (5–8 kb) insert plasmid libraries, one fosmid library (,35 kb
inserts), and two BAC libraries (insert size 90 and 108 kb). (Supplementary Note 1.)
Integration of shotgun assembly with genetic and physical maps. The largest
201 scaffolds, all exceeding 39 kb, excluding ‘N’s, and collectively representing
678,902,941 bp (97.3%) of nucleotides, were checked for possible chimaeras
555
©2009 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 457 | 29 January 2009
suggested by the sorghum genetic map, sorghum physical map, abrupt changes
in gene or repeat density, rice gene order, and coverage by BAC or fosmid clones
(Supplementary Note 2).
Repeat analysis. De novo searches for LTR retrotransposons used LTR_STRUC.
De novo detection of CACTA-DNA transposons and MITEs used custom programs (Supplementary Note 3). Known repeats were identified by RepeatMasker
(Open-3-1-8) (http://www.repeatmasker.org) with mips-REdat_6.2_Poaceae, a
compilation of grass repeats including sorghum-specific LTR retrotransposons
(http://mips.gsf.de/proj/plant/webapp/recat/). The insertion age of full-length
LTR-retrotransposons was determined from the evolutionary distance between
59 and 39 soloLTR derived from a ClustalW alignment of the two soloLTRs.
Protein-coding gene annotation. Putative protein-coding loci were identified
based on BLAST alignments of rice and Arabidopsis peptides and sorghum and
maize ESTs. GenomeScan47 was applied using maize-specific parameters.
Predicted coding structures were merged with EST data from maize and sorghum using PASA48.
Intergenomic and intragenomic alignments. Dot plots used ColinearScan49
and multi-alignments used MCScan50, applied to RAP236 (mapped representative models, 29,389 loci) and the sbi1.4 annotation set (34,496 loci). Pairwise
BLASTP (E , 131025, top five hits), both within each genome and between the
two genomes, was used to retrieve potential anchors. Zea BAC sequences and
FPC contig coordinates were downloaded (http://www.maizesequence.org,
release 7 January 2008). Zea BACs were searched for potential orthologues of
Sorghum coding sequences using translated BLAT with a minimum score of 100.
Received 20 August; accepted 9 December 2008.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
Hatch, M. D. & Slack, C. R. Photosynthesis by sugar-cane leaves—a new
carboxylation reaction and pathway of sugar formation. Biochem. J. 101, 103 (1966).
Paterson, A. H. et al. The weediness of wild plants—molecular analysis of genes
influencing dispersal and persistence of johnsongrass, Sorghum halepense (l) pers.
Proc. Natl Acad. Sci. USA 92, 6127–6131 (1995).
Hamblin, M. T. et al. Equilibrium processes cannot explain high levels of short- and
medium-range linkage disequilibrium in the domesticated grass Sorghum bicolour.
Genetics 171, 1247–1256 (2005).
Morrell, P. L. et al. Crop-to-weed introgression has impacted allelic composition of
johnsongrass populations with and without recent exposure to cultivated
sorghum. Mol. Ecol. 14, 2143–2154 (2005).
Gardner, R. C. et al. The complete nucleotide sequence of an infectious clone of
cauliflower mosaic virus by M13mp7 shotgun sequencing. Nucleic Acids Res. 9,
2871–2888 (1981).
Matsumoto, T. et al. The map-based sequence of the rice genome. Nature 436,
793–800 (2005).
Vieira, J. & Messing, J. The pUC plasmids, an M13mp7-derived system for
insertion mutagenesis and sequencing with synthetic universal primers. Gene 19,
259–268 (1982).
Bowers, J. E. et al. A high-density genetic recombination map of sequence-tagged
sites for Sorghum, as a framework for comparative structural and evolutionary
genomics of tropical grains and grasses. Genetics 165, 367–386 (2003).
Bowers, J. E. et al. Comparative physical mapping links conservation of
microsynteny to chromosome structure and recombination in grasses. Proc. Natl
Acad. Sci. USA 102, 13206–13211 (2005).
Miller, J. T. et al. Cloning and characterization of a centromere-specific repetitive
DNA element from Sorghum bicolour. Theor. Appl. Genet. 96, 832–839 (1998).
Venter, J. C. et al. Shotgun sequencing of the human genome. Science 280,
1540–1542 (1998).
Kim, J. S. et al. Chromosome identification and nomenclature of Sorghum bicolour.
Genetics 169, 1169–1173 (2005).
Swigonova, Z. et al. Close split of sorghum and maize genome progenitors.
Genome Res. 14, 1916–1923 (2004).
Swigonova, Z. et al. On the tetraploid origin of the maize genome. Comp. Funct.
Genomics 5, 281–284 (2004).
Swigonova, Z., Bennetzen, J. L. & Messing, J. Structure and evolution of the r/b
chromosomal regions in rice, maize and sorghum. Genetics 169, 891–906 (2005).
Jiang, N. et al. Pack-mule transposable elements mediate gene evolution in plants.
Nature 431, 569–573 (2004).
Brunner, S. et al. Evolution of DNA sequence nonhomologies among maize
inbreds. Plant Cell 17, 343–360 (2005).
Haberer, G. et al. Structure and architecture of the maize genome. Plant Physiol.
139, 1612–1624 (2005).
Lu, C. et al. Genome-wide analysis for discovery of rice microRNAs reveals natural
antisensemicroRNAs(nat-miRNAs).Proc.NatlAcad.Sci.USA105,4951–4956(2008).
Xu, J.-H. & Messing, J. Organization of the prolamin gene family provides insight
into the evolution of the maize genome and gene duplications in grass species.
Proc. Natl Acad. Sci. USA 105, 14330–14335 (2008).
Meyers, B. C. et al. Genome-wide analysis of NBS-LRR-encoding genes in
Arabidopsis. Plant Cell 15, 809–834 (2003).
Leister, D. Tandem and segmental gene duplication and recombination in the
evolution of plant disease resistance genes. Trends Genet. 20, 116–122 (2004).
23. Carpita, N. C. & Gibeaut, D. M. Structural models of primary cell walls in flowering
plants—consistency of molecular structure with the physical properties of the
walls during growth Plant J. 3, 1–30 (1993).
24. McCann, M. C. & Roberts, K. in The Cytoskeletal Basis of Plant Growth and Form (ed.
Lloyd, C. W.) 109–129 (Academic Press, 1991).
25. Carpita, N. C. Structure and biogenesis of the cell walls of grasses. Annu. Rev. Plant
Physiol. Plant Mol. Biol. 47, 445–476 (1996).
26. Hazen, S. P. et al. Quantitative trait loci and comparative genomics of cereal cell
wall composition. Plant Physiol. 132, 263–271 (2003).
27. Zhao, B. T. et al. Identification of drought-induced microRNAs in rice. Biochem.
Biophys. Res. Commun. 354, 585–590 (2007).
28. Nelson, D. E. et al. Plant nuclear factor Y (NF-Y) B subunits confer drought
tolerance and lead to improved corn yields on water-limited acres Proc. Natl Acad.
Sci. USA 104, 16450–16455 (2007).
29. Paterson, A. H., Bowers, J. E. & Chapman, B. A. Ancient polyploidization predating
divergence of the cereals, and its consequences for comparative genomics. Proc.
Natl Acad. Sci. USA 101, 9903–9908 (2004).
30. Wang, X. et al. Duplication and DNA segmental loss in rice genome and their
implications for diploidization. New Phytol. 165, 937–946 (2005).
31. Yu, J. et al. The genomes of Oryza sativa: A history of duplications. PLoS Biol. 3,
266–281 (2005).
32. The Rice Chromosomes 11 and 12 Sequencing Consortia.. The sequence of rice
chromosomes 11 and 12, rich in disease resistance genes and recent gene
duplications. BMC Biol. 3, 20 (2005).
33. Wang, X. et al. Extensive concerted evolution of rice paralogs and the road to
regaining independence. Genetics 177, 1753–1763 (2007).
34. Singh, N. K. et al. Single-copy genes define a conserved order between rice and
wheat for understanding differences caused by duplication, deletion, and
transposition of genes. Funct. Integr. Genomics 7, 17–35 (2007).
35. Devos, K. M., Pittaway, T. S., Reynolds, A. & Gale, M. D. Comparative mapping
reveals a complex relationship between the pearl millet genome and those of
foxtail millet and rice TAG. Theor. Appl. Genet. 100, 190–198 (2000).
36. Tanaka, T. et al. The rice annotation project database (RAP-DB): 2008 update.
Nucleic Acids Res. 36, D1028–D1033 (2008).
37. Ouyang, S. et al. The TIGR rice genome annotation resource: Improvements and
new features. Nucleic Acids Res. 35, D883–D887 (2007).
38. Lynch, M. & Conery, J. S. The origins of genome complexity. Science 302,
1401–1404 (2003).
39. Wei, F. et al. Physical and genetic structure of the maize genome reflects its
complex evolutionary history. PLoS Genet. 3, e123 (2007).
40. Huo, N. et al. The nuclear genome of Brachypodium distachyon: Analysis of BAC
end sequences. Funct. Integr. Genomics 8, 135–147 (2007).
41. Margulies, E. H. et al. An initial strategy for the systematic identification of
functional elements in the human genome by low-redundancy comparative
sequencing. Proc. Natl Acad. Sci. USA 102, 4795–4800 (2005).
42. Eddy, S. R. A model of the statistical power of comparative genome sequence
analysis. PLoS Biol. 3, 95–102 (2005).
43. Jannoo, N. et al. Orthologous comparison in a gene-rich region among grasses
reveals stability in the sugarcane polyploid genome. Plant J. 50, 574–585 (2007).
44. Ming, R. et al. Sugarcane improvement through breeding and biotechnology. Plant
Breed. Rev. 27, 15–118 (2005).
45. Lohithaswa, H. C. et al. Leveraging the rice genome sequence for comparative
genomics in monocots. Theor. Appl. Genet. 115, 237–243 (2007).
46. Paterson, A. H. et al. Convergent domestication of cereal crops by independent
mutations at corresponding genetic loci. Science 269, 1714–1718 (1995).
47. Yeh, R.-F., Lim, L. P. & Burge, C. Computational inference of homologous gene
structures in the human genome. Genome Res. 11, 803–816 (2001).
48. Haas, B. J. et al. Full-length messenger RNA sequences greatly improve genome
annotation. Genome Biol. 3, research0029.0021–0029.0012 (2002).
49. Wang, X. Y. et al. Statistical inference of chromosomal homology based on gene
colinearity and applications to Arabidopsis and rice. BMC Bioinform. 7, 447 (2006).
50. Tang,H.etal.Syntenyandcolinearityinplantgenomes.Science320,486–488(2008).
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature.
Acknowledgements We thank the US Department of Energy Joint Genome
Institute Community Sequencing Program, J. Bristow, S. Lucas and the JGI
production sequencing team for sequencing sorghum; and L. Lin for contributions
to Fig. 1. We appreciate funding from the US National Science Foundation (NSF
DBI-9872649, 0115903; MCB-0450260), International Consortium for Sugarcane
Biotechnology, National Sorghum Producers, and a John Simon Guggenheim
Foundation fellowship to A.H.P.; US Department of Energy (DE-FG05-95ER20194)
to J.M.; German Federal Ministry of Education GABI initiative to MIPS (0313117 and
0314000C); NSF DBI-0321467 to A.N.; and US Department of
Agriculture-Agricultural Research Service to C.A.M., L.Z. and D.W.
Author Information Reprints and permissions information is available at
www.nature.com/reprints. Correspondence and requests for materials should be
addressed to A.H.P. ([email protected]).
556
©2009 Macmillan Publishers Limited. All rights reserved
REPORTS
Patrick S. Schnable,1,2,3,4* Doreen Ware,5,6* Robert S. Fulton,7† Joshua C. Stein,6† Fusheng Wei,8†
Shiran Pasternak,6 Chengzhi Liang,6 Jianwei Zhang,8 Lucinda Fulton,7 Tina A. Graves,7
Patrick Minx,7 Amy Denise Reily,7 Laura Courtney,7 Scott S. Kruchowski,7 Chad Tomlinson,7
Cindy Strong,7 Kim Delehaunty,7 Catrina Fronick,7 Bill Courtney,7 Susan M. Rock,7 Eddie Belter,7
Feiyu Du,7 Kyung Kim,7 Rachel M. Abbott,7 Marc Cotton,7 Andy Levy,7 Pamela Marchetto,7
Kerri Ochoa,7 Stephanie M. Jackson,7 Barbara Gillam,7 Weizu Chen,7 Le Yan,7 Jamey Higginbotham,7
Marco Cardenas,7 Jason Waligorski,7 Elizabeth Applebaum,7 Lindsey Phelps,7 Jason Falcone,7
Krishna Kanchi,7 Thynn Thane,7 Adam Scimone,7 Nay Thane,7 Jessica Henke,7 Tom Wang,7
Jessica Ruppert,7 Neha Shah,7 Kelsi Rotter,7 Jennifer Hodges,7 Elizabeth Ingenthron,7
Matt Cordes,7 Sara Kohlberg,7 Jennifer Sgro,7 Brandon Delgado,7 Kelly Mead,7 Asif Chinwalla,7
Shawn Leonard,7 Kevin Crouse,7 Kristi Collura,8 Dave Kudrna,8 Jennifer Currie,8 Ruifeng He,8
Angelina Angelova,8 Shanmugam Rajasekar,8 Teri Mueller,8 Rene Lomeli,8 Gabriel Scara,8 Ara Ko,8
Krista Delaney,8 Marina Wissotski,8 Georgina Lopez,8 David Campos,8 Michele Braidotti,8
Elizabeth Ashley,8 Wolfgang Golser,8 HyeRan Kim,8 Seunghee Lee,8 Jinke Lin,8 Zeljko Dujmic,8
Woojin Kim,8 Jayson Talag,8 Andrea Zuccolo,8 Chuanzhu Fan,8 Aswathy Sebastian,8 Melissa Kramer,6
Lori Spiegel,6 Lidia Nascimento,6 Theresa Zutavern,6 Beth Miller,6 Claude Ambroise,6
Stephanie Muller,6 Will Spooner,6 Apurva Narechania,6 Liya Ren,6 Sharon Wei,6 Sunita Kumari,6
Ben Faga,6 Michael J. Levy,6 Linda McMahan,6 Peter Van Buren,6 Matthew W. Vaughn,6 Kai Ying,3
Cheng-Ting Yeh,1,2 Scott J. Emrich,9,10 Yi Jia,3 Ananth Kalyanaraman,9,11 An-Ping Hsia,1,2
W. Brad Barbazuk,12 Regina S. Baucom,13 Thomas P. Brutnell,14 Nicholas C. Carpita,15
Cristian Chaparro,16 Jer-Ming Chia,6 Jean-Marc Deragon,16 James C. Estill,13,17 Yan Fu,2,4
Jeffrey A. Jeddeloh,18 Yujun Han,13,17 Hyeran Lee,19 Pinghua Li,14 Damon R. Lisch,20
Sanzhen Liu,3 Zhijie Liu,6 Dawn Holligan Nagel,13,17 Maureen C. McCann,21 Phillip SanMiguel,22
Alan M. Myers,23 Dan Nettleton,24 John Nguyen,25 Bryan W. Penning,15,21 Lalit Ponnala,26
Kevin L. Schneider,27 David C. Schwartz,28 Anupma Sharma,27 Carol Soderlund,29
Nathan M. Springer,30 Qi Sun,26 Hao Wang,13,17 Michael Waterman,25 Richard Westerman,22
Thomas K. Wolfgruber,27 Lixing Yang,13 Yeisoo Yu,29 Lifang Zhang,6 Shiguo Zhou,28 Qihui Zhu,13,17
Jeffrey L. Bennetzen,13 R. Kelly Dawe,13,17 Jiming Jiang,19 Ning Jiang,31 Gernot G. Presting,27
Susan R. Wessler,13,17 Srinivas Aluru,1,9,32 Robert A. Martienssen,6 Sandra W. Clifton,7
W. Richard McCombie,6 Rod A. Wing,8 Richard K. Wilson7,33‡
We report an improved draft nucleotide sequence of the 2.3-gigabase genome of maize, an
important crop plant and model for biological research. Over 32,000 genes were predicted, of
which 99.8% were placed on reference chromosomes. Nearly 85% of the genome is composed of
hundreds of families of transposable elements, dispersed nonuniformly across the genome. These
were responsible for the capture and amplification of numerous gene fragments and affect the
composition, sizes, and positions of centromeres. We also report on the correlation of methylationpoor regions with Mu transposon insertions and recombination, and copy number variants with
insertions and/or deletions, as well as how uneven gene losses between duplicated regions were
involved in returning an ancient allotetraploid to a genetically diploid state. These analyses inform
and set the stage for further investigations to improve our understanding of the domestication and
agricultural improvements of maize.
aize (Zea mays ssp. mays L.) was domesticated over the past ~10,000 years
from the grass teosinte in Central
America (1) and has been subject to cultivation
and selection ever since. Maize is an important
model organism for fundamental research into the
inheritance and functions of genes, the physical
linkage of genes to chromosomes, the mechanistic
relation between cytological crossovers and recombination, the origin of the nucleolus, the properties of telomeres, epigenetic silencing, imprinting,
and transposition (2). Maize also is an important
crop, yielding in the USA alone 12 billion (B =
109) bushels of grain from ~86 million acres with
a value of $47 B [2008 data from (3)]. Over the
last century, breeders have increased grain yields
M
1112
eightfold (4), in part by harnessing heterosis
(hybrid vigor), a universal, but poorly understood,
phenomenon that can increase yields of hybrids
by 15 to 60% relative to inbred parents (5).
The maize genome has undergone several
rounds of genome duplication, including that
of a paleopolyploid ancestor ~70 million years
ago (mya) (6) and an additional whole-genome
duplication event about 5 to 12 mya (7, 8),
which distinguishes maize from its close relative, Sorghum bicolor (9). The 10 chromosomes
of the maize genome are structurally diverse and
have undergone dynamic changes in chromatin
composition. The size of the maize genome has
expanded dramatically (to 2.3 gigabases) over
the last ~3 million years via a proliferation of
20 NOVEMBER 2009
VOL 326
SCIENCE
long terminal repeat retrotransposons (LTR retrotransposons) (10).
We sequenced the maize genome using a
minimum tiling path of bacterial artificial chromosomes (BACs) (n = 16,848) and fosmid (n =
63) clones derived from an integrated physical
and genetic map (11, 12), augmented by comparisons with an optical map (13). Clones were
shotgun sequenced (four- to sixfold coverage),
followed by automated and manual sequence improvement (14) of the unique regions only, which
resulted in the B73 reference genome version 1
(B73 RefGen_v1).
We identified the full complement of maize
transposable elements (TEs) accessible from B73
RefGen_v1, which includes active class II DNA
TEs and an abundance of class I RNA TEs (15).
1
Center for Plant Genomics, Iowa State University, Ames, IA
50011, USA. 2Department of Agronomy, Iowa State University,
Ames, IA 50011, USA. 3Department of Genetics, Development
and Cell Biology, Iowa State University, Ames, IA 50011, USA.
4
Center for Carbon Capturing Crops, Iowa State University,
Ames, IA 50011, USA. 5U.S. Department of Agriculture (USDA),
North Atlantic Area, Robert Holley Center for Agriculture and
Health, Ithaca, NY 14853, USA. 6Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA. 7The Genome Center
at Washington University, St. Louis, MO 63108, USA. 8Arizona
Genomics Institute, School of Plant Sciences and Department
of Ecology and Evolutionary Biology, BIO5 Institute for Collaborative Research, University of Arizona, Tucson, AZ 85721,
USA. 9Department of Electrical and Computer Engineering,
Iowa State University, Ames, IA 50011, USA. 10Department of
Computer Science and Engineering, University of Notre Dame,
Notre Dame, IN 46556, USA. 11School of Electrical Engineering
and Computer Science, Washington State University, Pullman,
WA 99164, USA. 12Department of Botany, University of Florida,
Gainesville, FL 32611, USA. 13Department of Genetics, University of Georgia, Athens, GA 30602, USA. 14Boyce Thompson
Institute, Cornell University, Ithaca, NY 14853, USA. 15Department of Botany and Plant Pathology, Purdue University,
West Lafayette, IN 47907, USA. 16Université de Perpignan
Via Domitia, CNRS, Perpignan, France. 17Department of Plant
Biology, University of Georgia, Athens, GA 30602, USA.
18
NimbleGen, Madison, WI 53711, USA. 19Department of
Horticulture, University of Wisconsin–Madison, Madison, WI
53706, USA. 20Department of Plant Biology, University of
California, Berkeley, CA, 94720, USA. 21Department of
Biological Sciences, Purdue University, West Lafayette, IN
47907, USA. 22Department of Horticulture and Landscape
Architecture, Purdue University, West Lafayette, IN 47907,
USA. 23Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, 50011, USA.
24
Department of Statistics, Iowa State University, Ames, IA
50011, USA. 25Departments of Mathematics, Biology, and
Computer Science, University of Southern California, Los
Angeles, CA 90089, USA. 26Cornell University Computational
Biology Service Unit, Cornell University, Ithaca, NY 14850,
USA. 27Molecular Biosciences and Bioengineering, University
of Hawaii, Honolulu, HI 96822, USA. 28Laboratory for Molecular and Computational Genomics, Department of Chemistry,
Laboratory of Genetics, University of Wisconsin–Madison,
Madison, WI 53706, USA. 29BIO5 Institute for Collaborative
Research, University of Arizona, Tucson, AZ 85721, USA.
30
Department of Plant Biology, University of Minnesota, St.
Paul, MN 55108, USA. 31Department of Horticulture, Michigan
State University, East Lansing, MI 48824, USA. 32Indian Institute of Technology, Bombay, India. 33Department of Genetics, Washington University School of Medicine, St. Louis,
MO 63110, USA.
*These authors contributed equally to this work.
†These authors contributed equally to data production and
analysis.
‡To whom correspondence should be addressed. E-mail:
[email protected]
www.sciencemag.org
Downloaded from www.sciencemag.org on February 28, 2010
The B73 Maize Genome: Complexity,
Diversity, and Dynamics
Almost 85% of the B73 RefGen_v1 consists of
TEs (table S2). Indeed, the existence of TEs
(16), as well as the first members of the CACTA
(Spm/En), hAT (Ac), PIF/Harbinger and Mutator
superfamilies, and MITE family (Tourist), were
all initially discovered in maize (17). Further, both
the existence and unparalleled abundance of LTR
retrotransposons in plants were originally discovered in maize (18).
The B73 RefGen_v1 contains 855 families
of DNA TEs that make up 8.6% of the genome;
most of these (82%) were identified in this study
(table S2) (14). The most complex of these superfamilies is Mutator, with dramatic variation in
element sequence and size, including 262 PackMULEs (Mutator-like elements that contain gene
fragments) carrying fragments of 226 nuclear
genes. About 40,000 nonredundant Mu insertion sites were amplified from Mu-active lines,
sequenced, and mapped to B73 RefGen_v1. The
nonuniformly distributed Mu insertion sites colocalize with gene-rich regions of the genome
that have the highest rates of meiotic recombination per megabase (Fig. 1) (19). Like Mu, most
maize DNA TEs (but not the CACTA elements)
were enriched in the gene-rich, recombinationally
active chromosome ends (Fig. 1 and fig. S1).
Helitrons, a class of DNA elements believed
to transpose by a rolling-circle mechanism (20),
are present in plants, animals, and fungi, but are
particularly active, variable, and abundant in maize
(21). Maize contains eight families of Helitrons
Fig. 1. The maize B73 reference genome (B73 RefGen_v1): Concentric circles show aspects of the genome.
Chromosome structure (A). Reference chromosomes with physical fingerprint contigs (11) as alternating gray
and white bands. Presumed centromeric positions are indicated by red bands (31); enlarged for emphasis.
Genetic map (B). Genetic linkage across the genome, on the basis of 6363 genetically and physically
mapped markers (14, 19). Mu insertions (C). Genome mappings of nonredundant Mu insertion sites
(14, 19). Methyl-filtration reads (D). Enrichment and depletion of methyl filtration. For each
nonoverlapping 1-Mb window, read counts were divided by the total number of mapped reads. Repeats
(E). Sequence coverage of TEs with RepeatMasker with all identified intact elements in maize. Genes (F).
Density of genes in the filtered gene set across the genome, from a gene count per 1-Mb sliding window
at 200-kb intervals. Sorghum synteny (G) and rice synteny (H). Syntenic blocks between maize and related
cereals on the basis of 27,550 gene orthologs. Underlined blocks indicate alignment in the reverse strand.
Homoeology map (I). Oriented homoeologous sites of duplicated gene blocks within maize.
www.sciencemag.org
SCIENCE
VOL 326
with a combined copy number of ~20,000, which
are particularly active in gene fragment acquisition (table S2). In maize, we observed that
Helitrons are located predominantly within generich regions, whereas, in all previously studied
plant and animal genomes, they are enriched in
gene-poor regions (22, 23). LTR retrotransposons
compose >75% of the B73 RefGen_v1 and are
diverse. Most of the 406 families have fewer
than 10 copies. LTR retrotransposons exhibited
family-specific, nonuniform distributions along
chromosomes, e.g., Copia-like elements are overrepresented in gene-rich euchromatic regions,
whereas Gypsy-like elements are overrepresented
in gene-poor heterochromatic regions (fig. S1)
(24, 25). We observed more than 180 acquisitions of nuclear gene fragments inside LTR retrotransposons (table S2).
Protein-encoding and microRNA (miRNA)
(26) genes were predicted from assembled or
improved BAC contigs by a combination of
evidence-based (27) and ab initio approaches,
projected to B73 RefGen_v1, and subsequently
filtered to a set of 32,540 protein-encoding and
150 miRNA genes (14) (fig. S2). Exon sizes of
maize genes were similar to that of their
orthologous genes in rice and sorghum, but maize
genes contained more large introns because of
insertion of repetitive elements (11, 28) (figs. S3
and S4 and tables S5 and S6). A comparative
analysis with rice, sorghum, and Arabidopsis revealed similar numbers of gene families (14)
(Fig. 2), of which a core set of 8494 families is
shared among all four species, and of the 11,892
maize families, all but 465 are conserved with at
least one other species. Species- and lineagespecific families point out potential inconsistencies between annotation projects, but also reflect
genuine biological differences in gene inventories.
Because of the stringent criteria used for
including genes in the filtered gene set (14),
we expected to miss some genes. About 95%
of a collection of 63,851 full-length maize
cDNAs (fl-cDNAs) (29, 30) mapped to B73
RefGen_v1. On the basis of the ratio of flcDNA to supported genes in the filtered set,
we estimated that this set accounts for at least
85% of all genes in the B73 RefGen_v1 (14).
Downloaded from www.sciencemag.org on February 28, 2010
REPORTS
Fig. 2. Venn diagram showing unique and shared
gene families between and among the three sequenced grasses (maize, rice, and sorghum) and
the dicot, Arabidopsis.
20 NOVEMBER 2009
1113
The maximum rate of false-positive gene annotations was estimated by aligning ~112 million
RNA-seq (transcriptome sequencing) reads from
various tissues to the filtered gene set (14) (figs.
S10 and S11). These experiments provided evidence for the transcription of ~91% of the genes
in the filtered gene set (29,541 out of 32,540).
Manual annotation of 200 randomly chosen genes
from the filtered gene set indicated that only two
are likely to be TE-derived. Additional manual
annotation of smaller sets of selected genetically
well-characterized genes (tables S8 to S10) indicated that the vast majority of genes and proteins predicted in the filtered gene set are mostly
correct.
Maize centromeres were found to contain
variable amounts of the tandem CentC satellite
repeat and centromeric retrotransposon elements
of maize (CRMs). On the basis of comparisons
to B73 whole-genome shotgun data, we initially
identified about half of the genome’s CentC content (table S13). We captured additional CentC
sequence by draft sequencing 101 centromeric
repeat–containing BACs and anchoring them to
the genetic and physical maps, thereby localizing all of the centromeres (31). We delineated
the functional centromeres on the basis of their
centromere-specific histone H3 (CENH3) (32)
by using chromatin immunoprecipitation (ChIP)
with an antibody against CENH3, followed by
pyrosequencing. The centromere regions delineated in this way, although mostly incomplete,
correlated with a high density of CentC and
CRM1/CRM2/CRM3 repeats, but a number of
these repeats also occurred outside of the functional centromeres (fig. S12). The CRM2 subfamily appears to be the centromeric repeat most
closely associated with CENH3 in maize, as it is
more enriched in the CENH3 chromatin fraction
than CentC, CRM1, or CRM3 (table S13).
We traversed two centromeres (2 and 5) in
their entirety and determined that they differ in
size and CENH3 density (31). Because CRM elements have generated recombinants with distinct periods of activity (33, 34), we were able to
demonstrate that the regional centromeres of
maize are dynamic loci and that the CENH3 domain shifts over time (31).
To protect genome integrity, TEs are usually
transcriptionally silenced (35) in part via the
RNA-directed DNA methylation (RdDM) pathway, which requires an RNA-dependent RNA
polymerase 2 (RDR2). When the maize homolog
of RDR2 (36) is mutated, it alters the accumulation of transcripts from many characterized
transposons, but unexpectedly, some TEs are
down-regulated by loss of RDR2 function (37).
In most plant genomes, genes are less densely
methylated than heterochromatic TEs and other
repeats. Consequently, ~2× coverage of the maize
genome by methylation-filtered (MF) reads includes portions of ~95% of maize genes (38).
Mapping MF reads (39) of maize and sorghum
onto their respective genomes revealed speciesspecific distributions of heterochromatic DNA
1114
methylation along the reference chromosomes
(fig. S13, A and B). It is noteworthy that, in the
sorghum genome, hypomethylated genes are
largely excluded from the pericentromeric regions, whereas they are dispersed more widely
in maize. Visual comparisons between sorghum
and maize (14) revealed high levels of coalignment, including centromeres where centromeric
repeats are undermethylated relative to the surrounding heterochromatin (39, 40) (fig. S13C).
Thus, the B73 RefGen_v1 yields evidence that
heavily methylated regions are more condensed
during interphase.
Anchoring the B73 RefGen_v1 to a newly
developed genetic map (19) revealed that rates of
meiotic recombination per megabase are highest
at the ends of the reference chromosomes and
very low in the middle half of each chromosome
surrounding the centromeres (Fig. 1) (19, 41).
Although recombination occurs preferentially in
genes (2) and gene density shows a similar
distribution (Fig. 1), gene density does not fully
explain the nonrandom distribution of recombination events, because a pronounced nonuniform
distribution is still observed even when gene
density is taken into consideration (19). Instead,
epigenetic marks, including hypomethylation and
histone modifications, are implicated in guiding
both Mu insertion and meiotic recombination
(19). Epigenetic processes have also been invoked to explain the observation that genomic
imprinting contributes to the expression of
thousands of genes in maize hybrids (42).
Maize exhibits extremely high levels of both
phenotypic and genetic diversity. This genomic
diversity was explored with both resequencing
(41) and array-based comparative genomic hybridization between the B73 and Mo17 inbred
lines (43). This revealed extensive structural variation, including hundreds of copy number variants (CNVs) and thousands of present-absent
variants (PAVs). Many of the PAVs, including an
~2-Mb region on chromosome 6, contain intact, expressed single-copy genes that are present
in one inbred genome but absent from the other.
These haplotype-specific sequences may contribute to heterosis and the substantial degree of
phenotypic variation among maize inbreds (43).
After a whole-genome duplication, the return to a genetically diploid state was associated
with numerous chromosomal breakages and
fusions, as shown by alignment to the genomes
of sorghum and the more distantly related rice
(Fig. 1 and fig. S14) (12). In contrast, sorghum
has experienced relatively few interchromosomal rearrangements since its lineage split with
rice (8); therefore, its chromosomal configuration closely resembles the ancestral state of maize’s
two subgenomes (12). Cosynteny of maize genes
to common reference genes in rice or sorghum
defined maize’s duplicate regions (fig. S15). Although syntenic blocks cover 1832 Mb (~89%
of the genome), individual gene losses were
common and resulted in retention of only ~8110
genes as duplicate homoeologs (~25% of total
20 NOVEMBER 2009
VOL 326
SCIENCE
genes; ~30% having orthologs in rice and/or
sorghum). On the basis of an analysis of GO
(gene ontology) terms (14, 44) (table S15), retention of genes as duplicates is not random, e.g.,
retained duplicates are significantly enriched
for transcription factors (>1.5-fold; P value =
7.6 × 10–22) (table S15), as is also the case in
rice (44) and Arabidopsis (45). An example of
biased retention is the CesA family, in which all
10 ancestral sites were retained as duplicates
(fig. S16) (46). Using the sorghum genome to
project extant maize regions to ancestral chromosomes (14) revealed a strong bias for gene loss
(fractionation) between sister regions (table S16
and fig. S17). Fractionation bias has been observed in other plant lineages and species (47–50).
Sites containing proximately duplicated paralogs tend to exist as single copies, or not at all,
at corresponding homoeologous positions (table
S18). Of the 1454 proximately duplicated paralogs identified (making up 3614 genes), only
126 (~9%) could be found at homoeologous
positions (14). Of the remainder, 279 (19%) had
a single paralog at the corresponding homoeologous site, and 1049 (72%) had no homoeologs.
Nearly identical paralogs (NIPs) are genes
with pairwise alignments of ≥500 bp, ≥98% identity, and ≥95% coverage with other genes (51). Of
maize-filtered genes, 2.5% (828 out of 32,540)
were NIPs from 386 families, most of which
have only two members (n = 349); the largest has
nine members. Almost half (46%) of the NIP
pairs had both members physically linked within
200 kb of each other, whereas in most of the
remaining cases, the two members were distant
from each other or on different chromosomes
(fig. S18).
Just as cytogenetic and genetic maps (52)
revolutionized research and crop improvement
over the last century, the B73 maize reference
sequence promises to advance basic research
and to facilitate efforts to meet the world’s
growing needs for food, feed, energy, and
industrial feed stocks in an era of global climate
change. Findings derived from this genome
sequence briefly summarized here are described
in more detail in a series of companion papers
(11, 13, 19, 22, 24–26, 30, 31, 37, 41–43, 46).
Annotation data and browser are available at www.
maizegenome.org.
References and Notes
1. J. F. Doebley, B. S. Gaut, B. D. Smith, Cell 127, 1309
(2006).
2. J. L. Bennetzen, S. Hake, Handbook of Maize: Genetics
and Genomics (Springer, New York, 2009).
3. C. P. National Corn Growers Association, Table showing
corn harvested, yield, production, mya price, and value,
1991–2008; http://ncga.com/corn-production-trends.
4. A. F. Troyer, Crop Sci. 46, 528 (2006).
5. D. N. Duvick, Science 286, 418 (1999).
6. A. H. Paterson, J. E. Bowers, B. A. Chapman, Proc. Natl.
Acad. Sci. U.S.A. 101, 9903 (2004).
7. G. Blanc, K. H. Wolfe, Plant Cell 16, 1667 (2004).
8. Z. Swigonova et al., Genome Res. 14, 1916 (2004).
9. A. H. Paterson et al., Nature 457, 551 (2009).
10. P. SanMiguel, B. S. Gaut, A. Tikhonov, Y. Nakajima,
J. L. Bennetzen, Nat. Genet. 20, 43 (1998).
www.sciencemag.org
Downloaded from www.sciencemag.org on February 28, 2010
REPORTS
11. F. Wei et al., PLoS Genet., 19 November 2009 (10.1371/
journal.pgen.1000715).
12. F. Wei et al., PLoS Genet. 3, e123 (2007).
13. S. Zhou et al., PLoS Genet., 19 November 2009
(10.1371/journal.pgen.1000711).
14. Materials and methods are available as supporting
material on Science Online.
15. P. SanMiguel et al., Science 274, 765 (1996).
16. B. McClintock, Cold Spring Harbor Symp. Quant. Biol.
16, 13 (1951).
17. C. Feschotte, N. Jiang, S. R. Wessler, Nat. Rev. Genet. 3,
329 (2002).
18. A. Kumar, J. L. Bennetzen, Annu. Rev. Genet. 33, 479
(1999).
19. S. Liu et al., PLoS Genet., 19 November 2009 (10.1371/
journal.pgen.1000733).
20. V. V. Kapitonov, J. Jurka, Proc. Natl. Acad. Sci. U.S.A. 98,
8714 (2001).
21. S. Lal, N. Georgelis, L. Hannah, in Handbook of Maize:
Genetics and Genomics, J. L. Bennetzen, S. Hake, Eds.
(Springer, New York, 2008), pp. 329–339.
22. L. Yang, J. L. Bennetzen, Proc. Natl. Acad. Sci. USA,
published online 19 November 2009 (10.1073/
pnas.0908008106).
23. L. Yang, J. L. Bennetzen, Proc. Natl. Acad. Sci. U.S.A.
106, 12832 (2009).
24. R. S. Baucom et al., PLoS Genet., 19 November 2009
(10.1371/journal.pgen.1000732).
25. F. Wei et al., PLoS Genet., 19 November 2009 (10.1371/
journal.pgen.1000728).
26. L. Zhang, PLoS Genet., 19 November 2009 (10.1371/
journal.pgen.1000716).
27. H. Liang, W. H. Li, Mol. Biol. Evol. 26, 1195 (2009).
28. G. Haberer et al., Plant Physiol. 139, 1612 (2005).
29. N. N. Alexandrov et al., Plant Mol. Biol. 69, 179
(2009).
30. C. Soderlund et al., PLoS Genet., 19 November 2009
(10.1371/journal.pgen.1000740).
31. T. K. Wolfgruber et al., PLoS Genet., 19 November 2009
(10.1371/journal.pgen.1000743).
32. C. X. Zhong et al., Plant Cell 14, 2825 (2002).
33. A. Sharma, G. G. Presting, Mol. Genet. Genomics 279,
133 (2008).
34. A. Sharma, K. L. Schneider, G. G. Presting, Proc. Natl.
Acad. Sci. U.S.A. 105, 15470 (2008).
35. D. Lisch, Annu. Rev. Plant Biol. 60, 43 (2009).
36. M. Alleman et al., Nature 442, 295 (2006).
37. Y. Jia et al., PLoS Genet., 19 November 2009 (10.1371/
journal.pgen.1000737).
38. Y. Fu et al., Proc. Natl. Acad. Sci. U.S.A. 102, 12282
(2005).
39. L. E. Palmer et al., Science 302, 2115 (2003).
40. W. Zhang, H. R. Lee, D. H. Koo, J. Jiang, Plant Cell 20, 25
(2008).
41. M. A. Gore et al., Science, 326, 1115 (2009).
42. R. A. Swanson-Wagner et al., Science 326, 1118 (2009).
43. N. M. Springer et al., PLoS Genet., 19 November 2009
(10.1371/journal.pgen.1000734).
44. C. G. Tian et al., Yi Chuan Xue Bao 32, 519 (2005).
45. C. Seoighe, C. Gehring, Trends Genet. 20, 461 (2004).
46. B. W. Penning et al., Plant Physiol., published online
19 November 2009 (10.1104/pp.109.136804).
47. H. Shaked, K. Kashkush, H. Ozkan, M. Feldman,
A. A. Levy, Plant Cell 13, 1749 (2001).
48. K. Song, P. Lu, K. Tang, T. C. Osborn, Proc. Natl. Acad.
Sci. U.S.A. 92, 7719 (1995).
49. J. A. Tate, P. Joshi, K. A. Soltis, P. S. Soltis, D. E. Soltis,
BMC Plant Biol. 9, 80 (2009).
50. B. C. Thomas, B. Pedersen, M. Freeling, Genome Res. 16,
934 (2006).
51. S. J. Emrich et al., Genetics 175, 429 (2007).
52. B. McClintock, Science 69, 629 (1929).
53. The Maize Genome Sequencing Project supported by
NSF award DBI-0527192 (R.K.W., S.W.C., R.S.F.,
R.A.W., P.S.S., S.A., L.S., D.W., W.R.M., R.A.M.). The
Maize Transposable Element Consortium and the
A First-Generation Haplotype
Map of Maize
Michael A. Gore,1,2,3*† Jer-Ming Chia,4* Robert J. Elshire,3 Qi Sun,5 Elhan S. Ersoz,3
Bonnie L. Hurwitz,4‡ Jason A. Peiffer,2 Michael D. McMullen,1,6 George S. Grills,7
Jeffrey Ross-Ibarra,8 Doreen H. Ware,1,4§ Edward S. Buckler1,2,3§
Maize is an important crop species of high genetic diversity. We identified and genotyped several
million sequence polymorphisms among 27 diverse maize inbred lines and discovered that the
genome was characterized by highly divergent haplotypes and showed 10- to 30-fold variation in
recombination rates. Most chromosomes have pericentromeric regions with highly suppressed
recombination that appear to have influenced the effectiveness of selection during maize inbred
development and may be a major component of heterosis. We found hundreds of selective sweeps
and highly differentiated regions that probably contain loci that are key to geographic adaptation.
This survey of genetic diversity provides a foundation for uniting breeding efforts across the world
and for dissecting complex traits through genome-wide association studies.
aize (Zea mays L.) is both a model genetic system and an important crop species. Already a critical source of food,
fuel, feed, and fiber, the addition of genomic information allows maize to be further improved
through plant breeding that exploits its tremendous
genetic diversity (1–3). Genome-wide association
studies (GWAS) of diverse maize germplasm offer the potential to rapidly resolve complex traits
to gene-level resolution, but these studies require
a high density of genome-wide markers. To do
this, we targeted the 20% of the maize genome
M
that is low-copy (4, 5) on a diverse panel of 27
inbred lines (representative of maize breeding efforts and worldwide diversity)―founders of the
maize nested association mapping (NAM) population (6)―and used sequencing-by-synthesis
(SBS) technology with three complementary restriction enzyme–anchored genomic libraries (figs.
S1 and S2A) (7).
More than 1 billion SBS reads (>32 gigabases
of sequence) were generated, covering ~38% of
the total maize genome, albeit at mostly lowcoverage levels. We focused on the ~93 million
www.sciencemag.org
SCIENCE
VOL 326
Maize Centromere Consortium supported by NSF awards
DBI-0607123 (S.R.W., J.L.B., R.K.D., N.J., P.S.M.) and
DBI-0421671 (R.K.D., J.J., G.G.P.). Also supported by
NSF grants DBI-0321467 (D.W.), DBI-0321711
(P.S.S.), DBI-0333074 (D.W.), DBI-0501818 (D.C.S.),
DBI-0501857 (Y.Y.), DBI-0701736 (T.P.B., Q.S.),
DBI-0703273 (R.A.M.), and DBI-0703908 (D.W.), and
by USDA National Research Initiative Grants
2005-35301-15715 and 2007-35301-18372 from the
USDA Cooperative State Research, Education, and
Extension Service (P.S.S.) and from the USDA-ARS
(408934 and 413089) to D.W., and from the
Office of Science (Biological and Environmental
Research), U.S. Department of Energy, grant
DE-FG02-08ER64702 to N.C.C. and M.C.M.
Sequences of the reference chromosomes have been
deposited in GenBank as accession numbers
CM000777 to CM000786. RNA-sequence reads
have been deposited in the Gene Expression Omnibus
(GEO) database (www.ncbi.nlm.nih.gov/geo) as
accession numbers GSE16136, GSE16868, and
GSE16916. Centromeric sequences have been
deposited in the National Center for Biotechnology
Information, NIH, Trace Archive as accessions
1757396377 to 1757412600 and 2185189231 to
2185200942.
Supporting Online Material
www.sciencemag.org/cgi/content/full/326/5956/1112/DC1
Materials and Methods
SOM Text
Figs. S1 to S18
Tables S1 to S18
References
1 July 2009; accepted 13 October 2009
10.1126/science.1178534
base pairs (Mbp) of low-copy sequence present
in 13 or more lines in this study. Roughly 39%
of the sequenced low-copy fraction was derived
from introns and exons (5), covering 32% of the
total genic fraction in the genome. We identified
3.3 million single-nucleotide polymorphisms
(SNPs) and indels (table S1) and found that, overall, 1 in every 44 bp was polymorphic (p = 0.0066
per base pair). In a subset used for the population
genetics analyses, the error rate was 1/2570 or
17-fold lower than p (roughly half the errors are
paralogy issues). The absolute level of diversity
we examined, though high, may be slightly reduced because of difficulties aligning highly divergent sequences and our low power to call
Downloaded from www.sciencemag.org on February 28, 2010
REPORTS
1
United States Department of Agriculture–Agriculture Research Service (USDA-ARS). 2Department of Plant Breeding
and Genetics, Cornell University, Ithaca, NY 14853, USA.
3
Institute for Genomic Diversity, Cornell University, Ithaca, NY
14853, USA. 4Cold Spring Harbor Laboratory, Cold Spring
Harbor, NY 11724, USA. 5Computational Biology Service Unit,
Cornell University, Ithaca, NY 14853, USA. 6Division of Plant
Sciences, University of Missouri, Columbia, MO 65211, USA.
7
Institute for Biotechnology and Life Science Technologies,
Cornell University, Ithaca, NY 14853, USA. 8Department of
Plant Sciences, University of California, Davis, CA 95616–
5294, USA.
*These authors contributed equally to this work.
†Present address: United States Arid-Land Agricultural Research Center, Maricopa, AZ 85138, USA.
‡Present address: Department of Ecology and Evolutionary
Biology, University of Arizona, Tucson, AZ 85721, USA.
§To whom correspondence should be addressed. E-mail:
[email protected] (D.H.W.); [email protected] (E.S.B.)
20 NOVEMBER 2009
1115
Articles
© 2009 Nature America, Inc. All rights reserved.
The genome of the cucumber, Cucumis sativus L.
Sanwen Huang1,19, Ruiqiang Li2,3,19, Zhonghua Zhang1,19, Li Li2,19, Xingfang Gu1,19, Wei Fan2,19,
William J Lucas4,19, Xiaowu Wang1, Bingyan Xie1, Peixiang Ni2, Yuanyuan Ren2, Hongmei Zhu2, Jun Li2, Kui Lin5,
Weiwei Jin6, Zhangjun Fei7, Guangcun Li8, Jack Staub9, Andrzej Kilian10, Edwin A G van der Vossen11, Yang Wu5,
Jie Guo5, Jun He1, Zhiqi Jia1, Yi Ren1, Geng Tian2, Yao Lu2, Jue Ruan2,12, Wubin Qian2, Mingwei Wang2,
Quanfei Huang2, Bo Li2, Zhaoling Xuan2, Jianjun Cao2, Asan2, Zhigang Wu2, Juanbin Zhang2, Qingle Cai2,
Yinqi Bai2, Bowen Zhao13, Yonghua Han6, Ying Li1, Xuefeng Li1, Shenhao Wang1, Qiuxiang Shi1, Shiqiang Liu1,
Won Kyong Cho14, Jae-Yean Kim14, Yong Xu15, Katarzyna Heller-Uszynska10, Han Miao1, Zhouchao Cheng1,
Shengping Zhang1, Jian Wu1, Yuhong Yang1, Houxiang Kang1, Man Li1, Huiqing Liang2, Xiaoli Ren2,
Zhongbin Shi2, Ming Wen2, Min Jian2, Hailong Yang2, Guojie Zhang2,12, Zhentao Yang2, Rui Chen2, Shifang Liu2,
Jianwen Li2, Lijia Ma2,12, Hui Liu2, Yan Zhou2, Jing Zhao2, Xiaodong Fang2, Guoqing Li2, Lin Fang2,
Yingrui Li2,12, Dongyuan Liu2, Hongkun Zheng2,3, Yong Zhang2, Nan Qin2, Zhuo Li2, Guohua Yang2,
Shuang Yang2, Lars Bolund2,16, Karsten Kristiansen17, Hancheng Zheng2,18, Shaochuan Li2,18, Xiuqing Zhang2,
Huanming Yang2, Jian Wang2, Rifei Sun1, Baoxi Zhang1, Shuzhi Jiang1, Jun Wang2,17, Yongchen Du1 & Songgang Li2
Cucumber is an economically important crop as well as a
model system for sex determination studies and plant vascular
biology. Here we report the draft genome sequence of Cucumis
sativus var. sativus L., assembled using a novel combination of
traditional Sanger and next-generation Illumina GA sequencing
technologies to obtain 72.2-fold genome coverage. The absence
of recent whole-genome duplication, along with the presence
of few tandem duplications, explains the small number of
genes in the cucumber. Our study establishes that five of the
cucumber’s seven chromosomes arose from fusions of ten
ancestral chromosomes after divergence from Cucumis melo.
The sequenced cucumber genome affords insight into traits
such as its sex expression, disease resistance, biosynthesis of
cucurbitacin and ‘fresh green’ odor. We also identify 686 gene
clusters related to phloem function. The cucumber genome
provides a valuable resource for developing elite cultivars and for studying the evolution and function of the plant
vascular system.
The botanical family Cucurbitaceae, commonly known as cucurbits and gourds, includes several economically important cultivated
plants, such as cucumber (C. sativus L.), melon (C. melo L.), watermelon (Citrullus lanatus (Thunb.) Matsum. & Nakai) and squash and
pumpkin (Cucurbita spp.). Agricultural production of cucurbits uses
9 million hectares of land and yields 184 million tons of vegetables,
fruits and seeds annually (http://faostat.fao.org). The cucurbit family also displays a rich diversity of sex expression, and the cucumber
has served as a primary model system for sex determination studies 1.
The cucurbits are also model plants for the study of vascular biology,
as both xylem and phloem sap can be readily collected for studies of
long-distance signaling events2,3.
Despite the agricultural and biological importance of cucurbits,
knowledge of their genetics and genome is currently very limited. We
have therefore sequenced and assembled the genome of the domestic
cucumber, C. sativus var. sativus L.
All previous plant genome sequences have been derived using
traditional Sanger technology 4–9. The recent development of
1Key
Laboratory of Horticultural Crops Genetic Improvement of Ministry of Agriculture, Sino-Dutch Joint Lab of Horticultural Genomics Technology, Institute
of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China. 2BGI-Shenzhen, Shenzhen, China. 3Department of Biochemistry and
Molecular Biology, University of Southern Denmark, Odense, Denmark. 4Department of Plant Biology, College of Biological Sciences, University of California, Davis,
California, USA. 5College of Life Sciences, Beijing Normal University, Beijing, China. 6National Maize Improvement Center of China, Key Laboratory of Crop Genetic
Improvement and Genome of Ministry of Agriculture, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing, China. 7Boyce
Thompson Institute and USDA Robert W. Holley Center for Agriculture and Health, Cornell University, Ithaca, New York, USA. 8High-Tech Research Center, Shandong
Academy of Agricultural Sciences, Jinan, China. 9US Department of Agriculture, Agricultural Research Service, Vegetable Crops Research Unit, Department of
Horticulture, University of Wisconsin, Madison, Wisconsin, USA. 10Diversity Arrays Technology, Canberra, Australia. 11Wageningen UR Plant Breeding, Wageningen,
The Netherlands. 12The Graduate University of Chinese Academy of Sciences, Beijing, China. 13High School Affiliated to Renmin University of China, Beijing,
China. 14Division of Applied Life Science (BK21 and WCU program), PMBBRC and EB-NCRC, Gyeongsang National University, Jinju, Republic of Korea. 15National
Engineering Research Center for Vegetables, Beijing, China. 16Institute of Human Genetics, University of Aarhus, Aarhus, Denmark. 17Department of Biology,
University of Copenhagen, Copenhagen, Denmark. 18South China University of Technology, Guangzhou, China. 19These authors contributed equally to this work.
Correspondence should be addressed to Y.D. ([email protected]), S.H. ([email protected]), Jun Wang ([email protected]) or Songgang Li
([email protected]).
Received 6 May; accepted 28 September; published online 1 November 2009; doi:10.1038/ng.475
Nature Genetics volume 41 | number 12 | december 2009
1275
© 2009 Nature America, Inc. All rights reserved.
Articles
next-generation sequencing technologies has Table 1 Cucumber genome assembly statistics
Contig N50a Contig total Scaffold N50 Scaffold total % sequence anchored on
significantly improved sequencing throughput at
(kb)
(Mb)
(kb)
(Mb)
chromosome
a markedly reduced cost10. However, an intrinsic Assembly
2.6
204
19
238
—
characteristic of next-generation technologies is Sanger
12.5
190
172
200
—
their short read length (~50 bp), which prevents Illumina GA
19.8
226.5
1,140
243.5
72.8%
their direct application for de novo assembly of Sanger + Illumina GA
large genomes. When using these new technolo- aN50 refers to the size above which half of the total length of the sequence set can be found.
gies, assembly is typically carried out by mapping
these short reads onto a known reference genome11,12. For the cucumber recombination suppression of two 10-Mb regions at either end of
genome, we carried out a novel combination de novo sequencing strat- chromosome 4, a 20-Mb region on chromosome 5 and an 8-Mb region
egy, taking advantage of the long read and clone length of Sanger on chromosome 7. Using high-resolution FISH, we confirmed previtechnology and, for the first time, the high sequencing depth and low ously identified segmental inversion16 within the suppression region
unit cost of Illumina GA technology.
on chromosome 5 between Gy14 and PI183967 (Fig. 1b), which provides an explanation for recombination suppression in these regions.
RESULTS
These regions of recombination suppression are additionally useful
Sequencing and assembly
for studying cucumber evolution during domestication.
We selected the ‘Chinese long’ inbred line 9930, which is commonly
After excluding 16 markers whose genetic positions were ambiguused in modern cucumber breeding13, for our genome sequencing ous, we examined the six remaining regions that had conflicts between
project. We generated a total of 26.5 billion high-quality base pairs, the genetic map and our assembly. Upon inspection, we found that
or 72.2-fold genome coverage, of which the Sanger reads provided clone mate-pair information supported our assembly in all of these
3.9-fold coverage and the Illumina GA reads provided 68.3-fold regions (Supplementary Fig. 2). We also identified no misassemcoverage (Supplementary Table 1). The GA reads ranged in length bly within the regions covered by the six finished fosmid or BAC
from 42 to 53 bp.
sequences (Supplementary Fig. 3). The conflicts may be a result of
We compared the assemblies obtained by Sanger reads only, chromosomal rearrangement that occurred between the sequenced
Illumina GA reads only and Sanger plus Illumina reads. The ‘hybrid’ genotype 9930 and the genotypes used to create the mapping popuapproach achieved markedly longer N50 (the size above which half of lation; alternatively, these markers may have been placed incorrectly
the total length of the sequence set can be found) in both contigs and on the genetic map. Sequencing depth distribution showed that
scaffolds, so we used this assembly for further analyses (Table 1 and we obtained more than 10× coverage on more than 97.5% of the
Supplementary Table 2). The total length of the assembled genome assembly (Supplementary Fig. 4).
was 243.5 Mb, about 30% smaller than the genome size estimated
by flow cytometry of isolated nuclei stained with propidium iodide Repetitive sequences and transposons
(367 Mb)14 and by K-mer depth distribution of sequenced reads The cucumber genome contains a large number of transposable ele(350 Mb; Supplementary Fig. 1). Several types of satellite sequences ments, but only a few have previously been identified. We therefore
were present in the data set, comprising 23.2% of all Sanger reads and constructed repeat libraries using multiple de novo methods and then
76.2% of unassembled reads (Supplementary Table 3). FISH analysis derived a combined repeat library that contained 1,566 sequences
indicated that these are primarily located in the centromeric and telo- (Supplementary Table 5), of which 469 (29.9%) were manually clasmeric regions15. The cucumber genome also contains a large number sified (Supplementary Table 6). We then used this library for repeat
of rRNA sequences, and about 3.3% of the Sanger reads matched 45S annotation of the cucumber genome. We identified a total of 54.4 Mb,
rRNA. These results indicated that the majority of the remaining 30% which represents ~24% of the genome, as repeats. Among them,
of unassembled regions of the genome are likely to be heterochro- 51.5% could be classified based on known repeats. The long termimatic satellite or rRNA sequences.
nal repeat (LTR) retrotransposons (gypsy and copia) made up the
The high coverage of the cucumber genome by this assembly was majority of the transposable element classes and comprised 10.4%
also confirmed using the available EST, fosmid and BAC sequences. of the genome (Supplementary Table 7). The repeats divergence rate
The assembly contains 96.8% of the 63,312 cucumber unigenes (percentage of substitutions in the matching region compared with
assembled from ~350,000 Roche 454–sequenced ESTs, 99.3% of the consensus repeats in constructed libraries) distribution showed a peak
6,952 NCBI-deposited ESTs of cucumber, 91.2% of the 50,441 NCBI- at 20%. A fraction of LTR retrotransposons, long interspersed nuclear
deposited ESTs of melon and 98.7% of the six finished fosmid and elements and DNA transposons (composing 2.3%, 0.4% and 0.2%
BAC sequences (Supplementary Table 4).
of the genome, respectively) are of relatively recent origin, having a
A genetic map was developed using 77 recombinant inbred lines sequence divergence rate of less than 5% (Supplementary Fig. 5).
from the intersubspecific cross between Gy14 (a North American
processing market–type cucumber cultivar) and PI183967 (an acces- Gene annotation
sion of C. sativus var. hardwickii originating from India). The map We used three gene-prediction methods (cDNA-EST, homology based
spans 581 cM and contains 1,885 markers, including 995 micro- and ab initio) to identify protein-coding genes and then built a consensatellite markers16 and 890 Diversity Arrays Technology markers sus gene set by merging all of the results (Supplementary Fig. 6). We
(marker sequences can be accessed at http://cucumber.genomics.org. predicted 26,682 genes, with a mean coding sequence size of 1,046 bp
cn). Using this map, we were able to anchor 72.8% of the assembled and an average of 4.39 exons per gene (Supplementary Table 8).
sequences onto the seven chromosomes. Among the 1,885 mark- Under an 80% sequence overlap threshold, we found that 26.7% of
ers, 1,763 (93.5%) were uniquely aligned and used for construct- the genes were supported by models from all three gene prediction
ing the pseudochromosomes. The majority (98.7%) of the markers methods, 25% had both ab initio prediction and homology-based
were collinear with the sequence assembly (Fig. 1a). Comparison of evidence, and 7.4% had ab initio prediction and cDNA-EST expresthe genetic and physical distances between markers revealed sion evidence; the remaining genes were primarily derived from pure
1276
volume 41 | number 12 | december 2009 Nature Genetics
Articles
© 2009 Nature America, Inc. All rights reserved.
a
120
120
LG1
100
120
LG2
100
100
80
80
80
60
60
60
40
40
40
20
20
20
0
0
0
5
10 15 20 25 30 35 40
100
0
5
120
120
Genetic distance (cM)
Figure 1 Integrated genetic and physical map of
cucumber. (a) Genetic versus physical distance
map of the seven cucumber chromosomes.
The genetic map was constructed using a
recombinant inbred line mapping population
from the intersubspecific cross between Gy14
(domestic cucumber) and PI183967 (wild
cucumber). (b) Segmental inversion between
Gy14 and PI183967 on cucumber chromosome 5
detected by high-resolution FISH (12-2
and 12-7 denote individual fosmid clones).
A low-resolution FISH analysis was also recently
reported16. Scale bars represent 1 µm.
LG4
100
LG5
0
10 15 20 25 30 35 40
0
120
100
80
80
80
60
60
60
40
40
40
20
20
20
LG3
5
10 15 20 25 30 35 40
LG6
ab initio prediction, but the majority of these
0
0
0
0 5 10 15 20 25 30 35 40
0 5 10 15 20 25 30 35 40
0 5 10 15 20 25 30 35 40
were supported by multiple gene finders
120
(Supplementary Table 9). About 81% of
LG7
Physical distance (Mb)
100
the genes have homologs in the TrEMBL
b
80
Gy14
Pl183967
protein database, and 66% can be classified
60
by InterPro. In sum, 82% of the genes have
40
either known homologs or can be function20
ally classified (Supplementary Table 10). In
0
addition to protein-coding genes, we iden0 5 10 15 20 25 30 35 40
Physical distance (Mb)
tified 292 rRNA fragments and 699 tRNA,
Centromeric regions estimated by FISH
238 small nucleolar RNA, 192 small nuclear
RNA and 171 miRNA genes in the cucumber
genome (Supplementary Table 11).
12-7
12-2
On the basis of pairwise protein sequence
similarities, we carried out a gene family
clustering analysis on all genes in sequenced
plants, using rice as an outgroup. The cucumber genes consist of 15,669 families. Of these, 4,362 are cucum- two recent WGDs (Fig. 2b). In cucumber, the analysis showed ancient
ber unique families, among which 3,784 are single-gene families duplication events (peak at ~0.60) but did not reveal recent WGD.
(Supplementary Table 12). The EST confirmation rate of these unique This lack of recurrent WGD in the small cucumber genome provides
single-copy genes was much lower than the average of all predicted an important complement to the grapevine and papaya genomes to
genes (33.4% vs. 72.3%, respectively). This category may therefore study ancestral forms and arrangements of plant genes.
contain a number of false-positive predictions. In papaya, there are
4,622 unique families, but the actual number of genes is estimated to Synteny with flowering plant genomes
be 24,746, which is lower than the 28,629 predicted genes7. Thus, the Given the similar gene arrangements between cucumber and other
actual number in cucumber should be lower than 26,682 and similar plant genomes, we defined syntenic blocks that contained 5,473,
to that in papaya. The smaller average gene family size in cucumber 6,525, 9,842, 8,439 and 3,992 cucumber genes collinear to Arabidopsis,
(1.71) and papaya (1.77) supports this conclusion (Fig. 2a).
papaya, poplar, grapevine and rice, respectively (Supplementary
The cucumber genome contains the smallest number of tandem Table 13 and Supplementary Figs. 8–12). The numbers of collinear
gene duplications (479) among all the plants we compared, whereas genes were consistent with the phylogenetic distances of the other
grapevine has the largest number (5,382; Fig. 2a). This may contribute plants to cucumber. Within the syntenic blocks, we observed the
in part to the small number of genes in cucumber.
highest density of collinear genes between cucumber and grapevine
(90.5 genes per Mb), followed by papaya (76.1; the low contiguity
Absence of recent whole-genome duplication
of genome assembly may have, in part, decreased this value), poplar
Whole-genome duplication (WGD) is common in angiosperm plants (68.8), rice (55.6) and Arabidopsis (43.5; Supplementary Table 13).
and produces a tremendous source of raw material for gene genesis. This indicates that Arabidopsis has the most reshuffled or rearranged
Previous research has revealed a paleohexaploidy (γ) event in the genome, whereas the genomes of grapevine and papaya are more
common ancestor of Arabidopsis thaliana and grapevine after the conserved, probably because they have not undergone WGD since
divergence of monocotyledons and dicotyledons6. Subsequently, two the ancestral paleohexaploidy.
WGDs (α and β) occurred in Arabidopsis17 and one (p) in poplar8,
whereas no recent WGD occurred in grapevine and papaya. Evidence Substantial fusion events involved in chromosomal evolution
indicates that rice underwent an ancient WGD18. We carried out a Melon and cucumber belong to the same genus, although cucumcollinear gene-order analysis on the cucumber genome and observed ber has seven chromosomes and melon has 12. Watermelon, their
no recent WGD and only a few segmental duplication events common distant relative, has 11 chromosomes. To investigate cucur(Supplementary Fig. 7). We also used the distance-transversion rate bit chromosomal evolution, we compared the melon19 and waterat fourfold degenerate sites (4DTv method) to analyze paralogous melon genetic maps to the cucumber genome (Fig. 3a). In total,
gene pairs between syntenic blocks in Arabidopsis and cucumber, 348 (66.7%) of the 522 melon markers and 136 (58.6%) of the 232
respectively. Two peaks (~0.06 and ~0.25) in Arabidopsis support the watermelon markers were aligned on the cucumber chromosomes
Nature Genetics volume 41 | number 12 | december 2009
1277
Articles
© 2009 Nature America, Inc. All rights reserved.
12
a
2
6
4
10
1
at
iva
O
.s
V.
v
in
ife
ra
rp
a
at
iv
us
ca
9
C.
s
ho
P.
tri
c
ap
C.
p
A.
th
a
lia
n
a
ay
a
Percentage of syntenic blocks
(Supplementary Table 14). The comparison
60
a
b 40
C. sativus
No. of all
revealed that there has been no substantial
40
A. thaliana
predicted genes
rearrangement of cucumber chromosome 7,
20
(×1,000)
30
0
which corresponds to melon chromosome 1
6
No.
of
tandem
and watermelon group 7.
4
Recent two WGD duplications
duplicated genes
20
2
in Arabidopsis
Using watermelon as an outgroup, we
(×1,000)
Ancient duplications
0
found that cucumber chromosomes 1, 2, 3, 5
3
in cucumber
2
No. of genes
10
and 6 were collinear to melon chromosomes
per family
1
2 and 12, 3 and 5, 4 and 6, 9 and 10, and 8
0
and 11, respectively, indicating that after spe0
0
0.25
0.5
0.75
1
ciation these cucumber chromosomes each
4DTv
resulted from a fusion of two ancestral chromosomes. We also found that cucumber chro- Figure 2 Comparison of cucumber genome with other sequenced plant genomes. (a) Numbers of
mosome 6 and melon chromosome 3 have a predicted genes, numbers of tandem duplicated genes and gene family sizes of the six sequenced
syntenic segment, indicating that interchro- plant genomes. (b) The 4DTv distribution of duplicate gene pairs in cucumber and Arabidopsis,
mosome rearrangement occurred in one of calculated based on alignment of codons with HKY substitution model.
the two genomes after speciation. Cucumber
chromosome 4 largely corresponds to melon chromosome 7, although BAC sequences could be aligned onto the cucumber genome, with an
a segment of melon chromosome 8 is syntenic with cucumber chro- average of 88% sequence identity. Nonetheless, the highly conserved
mosome 4 (crossing the centromere). These data indicate that the gene content and order between the two species make the cucumber
rearrangement is most likely to have occurred before the divergence genome useful for genetic analysis of melon.
of cucumber and melon. In addition to chromosome fusion and interUsing the annotated genes in the four melon BACs, we obtained
chromosome rearrangements, the comparison revealed the occur- and manually curated eight orthologous families among rice, cucumrence of several intrachromosome rearrangements (Fig. 3a).
ber, melon, Arabidopsis and papaya. Extrapolating from the age of
divergence between Arabidopsis and papaya (54–90 million years ago),
Cucumber-melon microsynteny
we estimated that cucumber and melon diverged about 4–7 million
To estimate the sequence divergence rate, we compared the four years ago, which is consistent with a previous estimate of 9 ± 3 million
sequenced melon BACs to the cucumber genome (Fig. 3b and years ago20.
Supplementary Fig. 13). There are 56 genes on the melon BACs, 52
of which are collinear with the cucumber genome. The mean sequence Pathogen resistance genes
similarity over coding regions is 95%. Although the gene region simi- Only 61 nucleotide-binding site (NBS)-containing resistance (NBS-R)
larity is very high, the repeat content between the two genomes is genes have been identified in cucumber, similar to papaya (55) 7
quite different. New transposable elements were frequently inserted but only a fraction of what is found in Arabidopsis (200), poplar
in the intergenic regions of both genomes. Hence, only 54% of the (398) and rice (600)8. Distribution of NBS genes on chromosomes
5
3
11
8
7
Melon
Cucumber
1
3
5
7
2
6
4
Watermelon
11
b
9
12
16
4
6
1
13
7
18
10
8
2
5
15
0
10
20
30
40
50
60
70
80
90
15,060
15,070
15,080
15,090
15,100
15,110
15,120
15,130
15,140
15,150
14
Kb
EF188258
Chr. 1
Kb
15,160
Figure 3 Comparative genomic analysis of cucurbits. (a) Comparative analysis of the melon and watermelon genetic maps with the cucumber
sequence map. Cucumber, melon and watermelon have 7, 12 and 11 pairs of chromosomes, respectively. The current version of the watermelon
genetic map is organized into 18 genetic groups. (b) Syntenic blocks between the cucumber genome and a melon BAC sequence (GenBank
accession code EF188258.1). Genes are indicated by black arrows with the orientation indicated on the sequence. Rectangles, transposable
elements; red, retrotransposable elements; blue, DNA transposons; green, unclassified transposable elements. Orthologous sequence regions
between the two genomes are shown.
1278
volume 41 | number 12 | december 2009 Nature Genetics
Articles
2,190
2,250 kb
Chr. 2
C. sativus
A. thaliana
C. papaya
P. trichocarpa
V. vinifera
O. sativa
20 kb
0
Scaffold 337
Type I
Type II
© 2009 Nature America, Inc. All rights reserved.
Chr. 4
9,525
9,625-Kb
Figure 4 Lineage-specific expansion of the LOX gene family in the five
sequenced dicot genomes and rice genome. The LOX family is divided
into two groups, type I and type II. The two tandem duplicated gene
clusters are ordered and shown on chromosomes 2 and 4, as well as one
unmapped scaffold of the cucumber genome.
is nonrandom, with only five genes located on chromosomes 1, 6 and 7
and 20 genes located on chromosome 2 (Supplementary Fig. 14).
Three-quarters of the NBS genes are located within 11 clusters, indicating that they evolved through tandem duplications, similar to other
known plant genomes.
The lipoxygenase (LOX) pathway has an important role in developmentally and environmentally regulated processes in plants21 and
generates short-chain aldehydes and alcohols that are involved in plant
defense and pest resistance22. The LOX gene family has been notably
expanded in the cucumber genome (23 LOX genes in cucumber, 6 in
Arabidopsis, 15 in papaya, 21 in poplar, 18 in grapevine and 15 in rice).
Fourteen of the LOX genes are specific to the cucumber lineage. The
majority of cucumber LOX genes (19 of 23) are distributed in three
clusters, the largest of which contains 11 members that are arranged
in tandem (Fig. 4). The other sequenced plant genomes show no obvious LOX clustering, with the exception of grapevine, which has one
cluster harboring six copies.
Given that the cucumber has only 61 NBS-R genes, the expanded
lipoxygenase pathway might be a complementary mechanism to cope
with biotic stress. In support of this hypothesis, Arabidopsis has more
NBS-R genes and fewer LOX genes than does papaya. The volatile
(E,Z)-2,6-nonadienal (NDE) gives cucumber its ‘fresh green’ flavor23
and confers resistance to some bacteria and fungi24. Lipoxygenase and
one type of hydroperoxide lyase, 9-HPL, synthesize NDE from linolenic acid precursors. Genes encoding enzymes with 9-HPL activity
are rarely found in other plants25. However, cucumber contains two
tandem HPL genes, one of which has been experimentally confirmed
as encoding an enzyme with 9-HPL activity25. The expansion of the
LOX gene family and the duplicated HPL genes may be related to the
high level of NDE synthesis in cucumber.
Eukaryotic translation initiation factors, particularly the eIF4E and
eIF4G families, confer recessive resistance to plant RNA virus infections. An EIF4E gene in melon was found to mediate recessive resistance against melon necrotic spot virus26. In the cucumber genome,
three EIF4E and three EIF4G genes have been identified, providing
candidates for known recessive resistance genes against RNA viruses
Nature Genetics volume 41 | number 12 | december 2009
such as zucchini yellow mosaic virus and watermelon mosaic virus27.
In some wild melon genotypes, enhanced expression of two glyoxylate
aminotransferase genes (At1 and At2) controls the resistance to downy
mildew, a devastating foliar disease of cucurbits28. We identified two
At homologs in cucumber that could be candidate genes for downy
mildew resistance.
Novel biosynthetic pathways
Cucurbitacins are bitter cucurbit triterpenoid compounds that are
toxic to most organisms but can attract specialized insects29,30. The
presence of cucurbitacin in the cucumber is controlled by a mendelian gene, Bi30. Oxidosqualene cyclase catalyzes the formation of
the triterpene carbon framework in plants31. An OSC gene, CPQ,
in squash (Cucurbita pepo L.) is the first committed enzyme in the
cucurbitacin biosynthesis pathway32. In cucumber, we identified
four OSC genes; the CPQ ortholog Csa008595 resides in a genetic
interval that defines the Bi gene (Supplementary Fig. 15). Notably,
Csa008595 forms a cluster that contains an acyltransferase-encoding
gene (Csa008594) and two cytochrome P450–encoding genes
(Csa008596 and Csa008597). Three of these (Csa008594, Csa008595
and Csa008597) are coexpressed strongly in cucumber leaf tissue
(Supplementary Fig. 16) in a pattern similar to that of the operonlike gene cluster involved in thalianol biosynthesis in Arabidopsis33.
This gene cluster may therefore catalyze the stepwise formation of
cucurbitacin in cucumber.
Cucumber is a model system for studying sex expression in plants1.
Ethylene stimulates femaleness and is considered the sex hormone
of cucumber34. We identified 137 cucumber genes that are related
to the biosynthetic and signaling pathways of ethylene35,36, but
we found no gene family expansion in these pathways compared
with other sequenced plant genomes (Supplementary Table 15).
Thus, the origin of monoecy in cucumber might involve other
evolutionary mechanisms.
The melon gene Cm-ACS7 (ref. 37) and its cucumber ortholog
Cs-ACS2 (ref. 38) encode 1-aminocyclopropane-1-carboxylate synthase (ACS), a key regulatory enzyme in the ethylene biosynthetic
pathway. Both genes are crucial to the inhibition of male organs and
development of the female flower. In situ mRNA hybridization experiments revealed that both Cm-ACS7 and Cs-ACS2 transcripts accumulate only in the pistil and ovule, whereas their Arabidopsis ortholog,
AT4G26200 (Supplementary Fig. 17), is expressed only in the roots39.
We also identified two ethylene-responsive elements (AWTTCAAA)
and one flower meristem identity gene LEAFY-responsive element
(CCAATGT) within the Cs-ACS2 and Cm-ACS7 promoter sequences,
but these were absent from the promoter of AT4G26200. These findings indicate that the evolution of unisexual flowers in cucurbits may
have involved the acquisition of new cis elements of the ACS genes.
To better understand the mechanism of sex determination in
cucumber, we sequenced 359,105 EST sequences from near-isogenic
unisexual and bisexual flower buds using the 454 pyrosequencing
technology. Our analysis revealed that six auxin-related genes (auxin
can regulate sex expression by stimulating ethylene production40) and
three short-chain dehydrogenase or reductase genes (homologs to
the sex determination gene ts2 in maize41) are more highly expressed
in unisexual flowers (Supplementary Table 16). This analysis provides an important resource for further study of sex determination
in cucumber.
Novel developmental programs
The tendril is a specific climbing tool of vines, such as Vitaceae and all
Cucurbitaceae. Darwin considered tendrils a key innovation in plant
1279
Articles
© 2009 Nature America, Inc. All rights reserved.
evolution42. In cucumber and grapevine, gibberellic acid regulates
tendril formation43,44. In most plants, the transition of GA12aldehyde to GA12 is catalyzed by cytochrome P450 monooxygenase.
In cucurbits, it is also catalyzed by specific GA-7-oxidase genes, which
are absent from Arabidopsis45. Cucumber has two GA-7-oxidase genes
(Supplementary Table 17). GA-20-oxidase controls key steps leading
to bioactive GA1 and GA4, and our data show that the cucumber has
three lineage-specific clades (three copies; Supplementary Fig. 18).
These specific genes might be associated with the role of gibberellic
acid in the regulation of tendril formation. Tendril coiling involves
rapid cell wall modification46, and expansins are cell wall–loosening
proteins in plants47. We found that, in cucumber, the expansin subfamily EXLA has undergone marked expansion through tandem
duplication (eight genes in cucumber, compared with one to three
genes in other genomes; Supplementary Fig. 19); this event may have
contributed to the development of tendril coiling in cucumber.
Use in plant vascular biology studies
The evolution of the plant vascular system, comprising xylem and
phloem tissues, had a pivotal role in the emergence of land plants.
The sieve tube system of phloem, the equivalent of the animal arterial system, delivers nutrients and signaling molecules to developing
organs2. A BLASTP analysis of 1,209 protein fragments from pumpkin
phloem48 identified 800 phloem proteins in the cucumber genome
(Supplementary Table 18). Using these cucumber proteins, we conducted orthologous gene family (cluster) analysis (Supplementary
Table 19) with their homologs in other vascular plants as well as the
nonvascular moss Physcomitrella patens49. In total, we constructed
686 clusters (Table 2). About two-thirds (49 of 75) of the Arabidopsis
and half (57 of 120) of the rice phloem proteins identified in previous
studies50,51 were included in this data set, indicating the effectiveness
of these analyses and the value of this resource for vascular biology
studies in plants.
The vascular and nonvascular plants shared 596 clusters; between
monocots and eudicots, there are 648 clusters in common. Phloem protein II (PP2; cluster 2432) are present in angiosperms but absent from the
moss genome. PP2-like genes are also present in gymnosperm52, indicating their association with the advent of vascular plants. In cucurbits,
these genes can increase the size-exclusion limit of plasmodesmata and
facilitate cell-to-cell traffic of macromolecules52 and thus are likely to
have an essential role in vascular function. The sieve element occlusion
proteins (gene cluster 4754), present in all eudicots but absent from
mosses and monocots, represent a novel mechanism that evolved for
sealing the sieve tube system after wounding53.
The average number of genes in each cluster ranges from 2.9 to 5.1 in
the vascular plants, compared to 1.7 in moss (Table 2). The increase of
gene numbers per cluster may be associated with the evolution of the plant
vascular system. The 16-kDa PP16 cluster (cluster 2599) has an average
of 3.7 genes in the vascular plants compared to 2 in moss. The CmPP16
gene in pumpkin is involved in transport of mRNA into the phloem3. The
increase of the number of PP16 genes in vascular plants indicates these new
members may be involved in long-distance trafficking of mRNA.
To better understand xylem formation, we compared gene families
related to lignin and cellulose biosynthesis between woody and herbaceous plants. The perennial woody plants, poplar and grapevine,
have a large number of lignin biosynthesis–related genes (48 and 49,
respectively), whereas the semiwoody plant papaya has an intermediate number (39). In contrast, the herbaceous plants Arabidopsis
and cucumber have smaller numbers (28 and 26, respectively;
Supplementary Table 20). Among these gene families, the number
of genes in the cadmium-sensitive CAD family was consistent with
1280
Table 2 Summary of orthologous gene families (clusters)
established using cucumber genes homologous to pumpkin phloem
proteins
Genes
Gene clusters
Average genes per
cluster
P. patensa
1,072
622
1.7
O. sativa
2,458
676
3.6
S. bicolor
2,780
679
4.1
A. thaliana
2,351
682
3.5
C. papaya
1,944
672
2.9
P. trichocarpa
3,454
684
5.1
C. sativusb
1,986
686
2.9
V. vinifera
2,535
668
3.8
aMoss
(P. patens) was used as the only outgroup. bFor each cluster, at least one cucumber
phloem protein was included.
this trend. In poplar and grapevine, homologs for AT4G37980 and
AT4G37990 in Arabidopsis, which have low cadmium-sensitive enzymatic activity in vitro and may have only a minor role in lignin formation in this species54, were expanded markedly. In papaya, there is an
expansion of homologs for AT1G37970, which lack detectable cadmium-sensitive catalytic activities in vitro but are expressed predominantly in lignin-forming tissues54 (Supplementary Fig. 20). Thus, the
expansion of CAD genes may be associated with wood formation. It
is also notable that grapevine has the largest PAL gene family, with 15
members, and that poplar and papaya have the largest number of HCT
genes, with 7 members. Of the cellulose biosynthesis–related genes,
poplar has more CESA and COB genes (18 of each) than do any of
the other sequenced dicots (Supplementary Table 20).
DISCUSSION
The sequence of the cucumber genome provides an invaluable new
resource for biological research and breeding of cucurbits. The high
collinearity between cucumber and melon genomes enables cucumber
to serve as a model system in the Cucurbitaceae family for comparative genomics studies in plants. The cucumber genome and related
transcriptome analysis can provide insights into the mechanisms
underlying sex determination, an important biological process that
has been well characterized in cucumber at the phenotypic level.
The genome can also advance our knowledge of the evolution and
function of the plant vascular system.
We have also shown that, in combination with traditional Sanger
sequencing, next-generation DNA sequencing technologies can be
used effectively for de novo sequencing of plant genomes, making
it possible to carry out rapid and low-cost sequencing for other
important plant species.
Methods
Methods and any associated references are available in the online version
of the paper at http://www.nature.com/naturegenetics/.
Accession codes. The cucumber genome sequence has been deposited
in GenBank with accession code ACHR00000000 (the version described
here is the first version, with accession code ACHR01000000).
Note: Supplementary information is available on the Nature Genetics website.
Acknowledgments
We thank L. Goodman for assistance in editing the manuscript and R. Quatrano,
L. Kochian, L. Comai, V. Sundaresan, S. Kamoun and S. Renner for critical readings
of the manuscript. This work was funded by the Chinese Ministry of Agriculture
(948 program), Ministry of Science and Technology (2006DFA32140,
2007CB815701, 2007CB815703 and 2007CB815705) and Ministry of Finance
volume 41 | number 12 | december 2009 Nature Genetics
Articles
© 2009 Nature America, Inc. All rights reserved.
(1251610601001); the National Natural Science Foundation of China (30871707
and 30725008); the Chinese Academy of Agricultural Sciences (seed grant to S.H.);
the Chinese Academy of Science (GJHZ0701-6 and KSCX2-YWN-023); the US
Department of Agriculture (National Research Initiative grant 2006-35304-17346
to W.J.L.); the National Science Foundation (grant IOS-07-15513 to W.J.L.); and
the Korea Science and Engineering Foundation–Ministry of Education, Science
and Technology (WCU R33-10002 and BK21 grants to J.-Y.K.). WKC was partly
supported by grants from the Environmental Biotechnology National Core
Research Center (R15-2003-012-01003-0) and National Research Laboratory
(2009-0066339). This work was also supported by the Shenzhen Municipal and
Yantian District Governments and the Society of Entrepreneurs & Ecology.
D. Qu and Z. Fang of the Chinese Academy of Agricultural Sciences provided
management support for this work.
AUTHOR CONTRIBUTIONS
S.H., Y.D., Jun Wang and Songgang Li managed the project. S.H., Z.Z., W.J.L., X.G.
and R.L. designed the analyses. X.G., H.M., L.L., Yuanyuan Ren, G.T., Y. Lu, Z.X.,
J.C., A., Z.W., J. Zhang, H. Liang, X.R., M.J., Hailong Yang, R.C., Shifang Liu and
X.Z. conducted DNA preparation and sequencing. X.W., B.X., K.L., W.J., Guangcun
Li, Z.F., J.S., A.K., E.A.G.v.d.V. and Y.X. contributed new reagents and analytic tools.
S.H., Z.Z., W.J.L., X.G., R.L., X.W., B.X., K.L., W.J., J.H., Z.J., Yi Ren, Ying Li, X.L.,
S.W., Q.S., W.K.C., J.-Y.K., K.H.-U., H.M., Z.C., S.Z., J. Wu, Y.Y., H.K., Y.W., J.G.,
Y.H., M.L., B. Zhao, Shiqiang Liu, W.F., P.N., H. Zhu, Jun Li, J.R., W.Q., M. Wang,
Q.H., B.L., Q.C., Y.B., Z.S., M. Wen, G.Z., Z.Y., Jianwen Li, L.M., H. Liu., Y. Zhou,
J. Zhao, X.F., Guoqing Li, L.F., Yingrui Li, D.L., Hancheng Zheng and Shaochuan
Li conducted the data analyses. S.H., R.L., Z.Z. and W.J.L. wrote the paper. Y.D.,
R.S., B. Zhang., S.J., G.Y., S.Y., Hongkun Zheng, Y. Zhang, N.Q., Z.L., L.B., K.K.,
Huanming Yang and Jian Wang revised the paper.
Published online at http://www.nature.com/naturegenetics/.
Reprints and permissions information is available online at http://npg.nature.com/
reprintsandpermissions/.
1. Tanurdzic, M. & Banks, J.A. Sex-determining mechanisms in land plants. Plant Cell
16, S61–S71 (2004).
2. Lough, T.J. & Lucas, W.J. Integrative plant biology: role of phloem
long-distance macromolecular trafficking. Annu. Rev. Plant Biol. 57, 203–232
(2006).
3. Xoconostle-Cázares, B. et al. Plant paralog to viral movement protein that potentiates
rransport of mRNA into the phloem. Science 283, 94–98 (1999).
4. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering
plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
5. International Rice Genome Sequencing Project. The map-based sequence of the
rice genome. Nature 436, 793–800 (2005).
6. Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization
in major angiosperm phyla. Nature 449, 463–467 (2007).
7. Ming, R. et al. The draft genome of the transgenic tropical fruit tree papaya
(Carica papaya Linnaeus). Nature 452, 991–996 (2008).
8. Tuskan, G.A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. &
Gray). Science 313, 1596–1604 (2006).
9. Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica).
Science 296, 79–92 (2002).
10.Shendure, J., Mitra, R.D., Varma, C. & Church, G.M. Advanced sequencing
technologies: methods and goals. Nat. Rev. Genet. 5, 335–344 (2004).
11.Bentley, D.R. et al. Accurate whole human genome sequencing using reversible
terminator chemistry. Nature 456, 53–59 (2008).
12.Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456,
60–65 (2008).
13.Staub, J.E., Serquen, F.C., Horejsi, T. & Chen, J.-f. Genetic diversity in cucumber
(Cucumis sativus L.): IV. An evaluation of Chinese germplasm1. Genet. Resour.
Crop Evol. 46, 297–310 (1999).
14.Arumuganathan, K. & Earle, E. Nuclear DNA content of some important plant
species. Plant Mol. Biol. Rep. 9, 208–218 (1991).
15.Han, Y.H. et al. Distribution of the tandem repeat sequences and karyotyping in
cucumber (Cucumis sativus L.) by fluorescence in situ hybridization. Cytogenet.
Genome Res. 122, 80–88 (2008).
16.Ren, Y. et al. An integrated genetic and cytogenetic map of the cucumber genome.
PLoS One 4, e5795 (2009).
17.Bowers, J.E., Chapman, B.A., Rong, J. & Paterson, A.H. Unravelling angiosperm
genome evolution by phylogenetic analysis of chromosomal duplication events.
Nature 422, 433–438 (2003).
18.Yu, J. et al. The genomes of Oryza sativa: a history of duplications. PLoS Biol. 3,
e38 (2005).
19.Fernandez-Silva, I. et al. Bin mapping of genomic and EST-derived SSRs in melon
(Cucumis melo L.). Theor. Appl. Genet. 118, 139 (2008).
20.Schaefer, H., Heibl, C. & Renner, S.S. Gourds afloat: a dated phylogeny reveals an
Asian origin of the gourd family (Cucurbitaceae) and numerous oversea dispersal
events. Proc. Biol. Sci. 276, 843–851 (2009).
Nature Genetics volume 41 | number 12 | december 2009
21.Liavonchanka, A. & Feussner, I. Lipoxygenases: occurrence, functions and catalysis.
J. Plant Physiol. 163, 348–357 (2006).
22.Schwab, W., Davidovich-Rikanati, R. & Lewinsohn, E. Biosynthesis of plant-derived
flavor compounds. Plant J. 54, 712–732 (2008).
23.Buescher, R.H. & Buescher, R.W. Production and stability of (E, Z )-2, 6-nonadienal,
the major flavor volatile of cucumbers. J. Food Sci. 66, 357–361 (2001).
24.Cho, M.J., Buescher, R.W., Johnson, M. & Janes, M. Inactivation of pathogenic
bacteria by cucumber volatiles (E,Z )-2,6-nonadienal and (E )-2-nonenal. J. Food
Prot. 67, 1014–1016 (2004).
25.Matsui, K. et al. Fatty acid 9- and 13-hydroperoxide lyases from cucumber. FEBS
Lett. 481, 183–188 (2000).
26.Nieto, C. et al. An eIF4E allele confers resistance to an uncapped and nonpolyadenylated RNA virus in melon. Plant J. 48, 452–462 (2006).
27.Wai, T. & Grumet, R. Inheritance of resistance to watermelon mosaic virus in the
cucumber line TMG-1: tissue-specific expression and relationship to zucchini yellow
mosaic virus resistance. Theor. Appl. Genet. 91, 699–706 (1995).
28.Taler, D., Galperin, M., Benjamin, I., Cohen, Y. & Kenigsbuch, D. Plant eR genes
that encode photorespiratory enzymes confer resistance against disease. Plant Cell
16, 172–184 (2004).
29.Balkema-Boomstra, A.G. et al. Role of cucurbitacin C in resistance to spider mite
Tetranychus urticae in cucumber Cucumis sativus L. J. Chem. Ecol. 29, 225–235
(2003).
30.Da Costa, C.P. & Jones, C.M. Cucumber beetle resistance and mite susceptibility
controlled by the bitter gene in Cucumis sativus L. Science 172, 1145–1146
(1971).
31.Phillips, D.R., Rasbery, J.M., Bartel, B. & Matsuda, S.P. Biosynthetic diversity in
plant triterpene cyclization. Curr. Opin. Plant Biol. 9, 305–314 (2006).
32.Shibuya, M., Adachi, S. & Ebizuka, Y. Cucurbitadienol synthase, the first committed
enzyme for cucurbitacin biosynthesis, is a distinct enzyme from cycloartenol
synthase for phytosterol biosynthesis. Tetrahedron 60, 6995–7003 (2004).
33.Field, B. & Osbourn, A.E. Metabolic diversification–independent assembly of operonlike gene clusters in different plants. Science 320, 543–547 (2008).
34.Rudich, J., Halevy, A.H. & Kedar, N. Ethylene evolution from cucumber plants as
related to sex expression. Plant Physiol. 49, 998–999 (1972).
35.Pirrung, M.C. Ethylene biosynthesis from 1-aminocyclopropanecarboxylic acid.
Acc. Chem. Res. 32, 711–718 (1999).
36.Stepanova, A.N. & Alonso, J.M. Ethylene signaling pathway. Sci. STKE 2005, cm3
(2005).
37.Boualem, A. et al. A conserved mutation in an ethylene biosynthesis enzyme leads
to andromonoecy in melons. Science 321, 836–838 (2008).
38.Li, Z. et al. Molecular isolation of the M gene suggests that a conserved-residue
conversion induces the formation of bisexual flowers in cucumber plants. Genetics
182, 1381–1385 (2009).
39.Yamagami, T. et al. Biochemical diversity among the 1-amino-cyclopropane-1carboxylate synthase isozymes encoded by the Arabidopsis gene family. J. Biol.
Chem. 278, 49102–49112 (2003).
40.Takahashi, H. & Jaffe, M.J. Further studies of auxin and ACC induced feminization
in the cucumber plant using ethylene inhibitors. Phyton (Buenos Aires) 44, 81–86
(1984).
41.DeLong, A., Calderon-Urrea, A. & Dellaporta, S.L. Sex determination gene
TASSELSEED2 of maize encodes a short-chain alcohol dehydrogenase required for
stage-specific floral organ abortion. Cell 74, 757–768 (1993).
42.Darwin, C.R. The Movements and Habits of Climbing Plants (Murray, London,
1875).
43.Boss, P.K. & Thomas, M.R. Association of dwarfism and floral induction with a
grape /‘green revolution/’ mutation. Nature 416, 847–850 (2002).
44.Galun, E. The cucumber tendril—a new test organ for gibberellic acid. Cell. Mol.
Life Sci. 15, 184–185 (1959).
45.Lange, T. Cloning gibberellin dioxygenase genes from pumpkin endosperm by
heterologous expression of enzyme activities in Escherichia coli. Proc. Natl. Acad.
Sci. USA 94, 6553–6558 (1997).
46.Braam, J. In touch: plant responses to mechanical stimuli. New Phytol. 165,
373–389 (2005).
47.Cosgrove, D.J. Loosening of plant cell walls by expansins. Nature 407, 321–326
(2000).
48.Lin, M.-K., Lee, Y.-J., Lough, T.J., Phinney, B.S. & Lucas, W.J. Analysis of the
pumpkin phloem proteome provides insights into angiosperm sieve tube function.
Mol. Cell. Proteomics 8, 343–356 (2009).
49.Rensing, S.A. et al. The Physcomitrella genome reveals evolutionary insights into
the conquest of land by plants. Science 319, 64–69 (2008).
50.Aki, T., Shigyo, M., Nakano, R., Yoneyama, T. & Yanagisawa, S. Nano scale
proteomics revealed the presence of regulatory proteins including rhree FT-Like
proteins in phloem and xylem saps from rice. Plant Cell Physiol. 49, 767–790
(2008).
51.Giavalisco, P., Kapitza, K., Kolasa, A., Buhtz, A. & Kehr, J. Towards the proteome
of Brassica napus phloem sap. Proteomics 6, 896–909 (2006).
52.Dinant, S. et al. Diversity of the superfamily of phloem lectins (phloem protein 2)
in angiosperms. Plant Physiol. 131, 114–128 (2003).
53.Pélissier, H.C., Peters, W.S., Collier, R., van Bel, A.J. & Knoblauch, M. GFP tagging
of sieve element occlusion (SEO) proteins results in green fluorescent forisomes.
Plant Cell Physiol. 49, 1699–1710 (2008).
54.Kim, S.J. et al. Expression of cinnamyl alcohol dehydrogenases and their putative
homologues during Arabidopsis thaliana growth and development: lessons for
database annotations. Phytochemistry 68, 1957–1974 (2007).
1281
ONLINE METHODS
© 2009 Nature America, Inc. All rights reserved.
Removal of contamination for Sanger reads. Sanger reads were aligned against
mitochondrion (assembled by us based on the gene sequences of mitochondria of rice and Arabidopsis), chloroplast (GenBank accession code AJ970307)
and satellite (GenBank X03768, X03769, X03770, X69163, AY424361 and
AY424362) sequences. Reads with identity >95% were filtered.
De novo assembly of Solexa data. The De Bruijn graph method was used to
represent all possible sequences assembled by Solexa reads, with a K-mer as a
node and the (K − 1) base overlap between two K-mers as an edge. Some tips
and low-coverage K-mers in the graph were removed to reduce sequencing
errors and eliminate branches. The De Bruijn graph was then converted to a
contiging graph by turning a series of linearly connected K-mers into a precontig node. Dijkstra’s algorithm was implemented to detect bubbles, which
were then straightforwardly merged into a single path if sequences of the
branches were sufficiently similar. By this approach, the repeat regions could
be assembled into consensus sequences.
Contigs were next connected by paired reads to form a scaffolding graph.
Edges in this graph were connections between contigs, and the edge length
was estimated from the insert size of the paired reads. The paired-end information was used step by step, from insert sizes around 200 bp and 500 bp to
2 kb. At each step, two procedures were applied: the repeat-masking method
masked the complicated connections around repeat contigs, and the subgraph
linearization turned the interleaving contigs into linear structure. This process
yielded the final set of Solexa contigs and scaffolds.
Combination of Sanger reads and Solexa scaffolds. RePS2 (ref. 55) software
was used to assemble the Solexa scaffolds and Sanger reads. We counted the
depth of each 17-mer in the 3.9× plasmid and fosmid ends to create the 17-mer
database, which contained all the depth information of the 17-mers. This
database was then used to check all the contigs to identify repeated ones. A
contig was defined as a repeat if over 80% of the 17-mers it contained were
with higher depth than the threshold. After removing the repeat contigs, the
scaffolds were divided into fake paired reads with read length of 600 bp and
insert size of 1,700 bp. All segments over 200 bp were put into the second
data set, which was then assembled as a unique region. In the same way as the
construction of Solexa scaffolds, the plasmid, fosmid and BAC ends were used,
step by step, to construct a ‘superscaffold’.
Misassembly checking and gap filling. In the final stage, we used the repeat
sequences to fill the gaps in the scaffolds using the following steps. First, we
mapped all of the reads that contained paired-end information (Solexa and
plasmid reads, as well as fosmid and BAC ends) to the scaffolds, and we used
the unique contigs to establish the paired-end relationship between the contigs. Second, we identified repeat contigs with paired ends that uniquely connected two other scaffolded contigs. If the length of the repeat contig and the
estimated size of the gap were similar, the gap was filled by this repeat. Any
remaining repeat contigs that were not used for gap filling were added into
the final set of scaffolds.
Chromosome anchoring along the cucumber genetic map. The marker
sequences in the cucumber genetic map were aligned against the scaffold
sequences using BLASTN at an E-value cutoff of 1 × 10−20. Hits with coverage >30% and identity >90% were considered mapped markers. Based on
the mapped markers, the scaffold sequences were anchored on the cucumber
chromosomes. During this process, the scaffolds with mapped markers that
showed inconsistent genetic positions were manually checked by paired-end
relationships; the incorrect scaffold was then split.
FISH analysis. The FISH protocol was described in a previous study16. To
better visualize the segmental inversion, we chose chromosome spreads where
chromosome 5 appeared in a straight form. Instead of showing all chromosomes16, only chromosome 5 is shown in Figure 1b of this study. In addition,
the image was taken in a higher resolution. Scale bars represent 1 µm, as
compared to 3 µm previously16. Red and green signals were detected with
anti-digoxigenin antibody coupled to rhodamine (Roche) and by anti-avidin
antibody conjugated with FITC (Vector Laboratories), respectively.
Nature Genetics
Identification of repetitive elements in the cucumber genome. Four de novo
software packages, ReAS56, PILER-DF57, RepeatScout58 and LTR_Finder59,
were used to search for repeat sequences within the cucumber genome. All
repeat sequences with lengths >100 bp and gap ‘N’ <5% constituted the raw
transposable element library.
The repeat elements belonging to rRNA and satellite sequences were first
filtered using BLASTN (E value ≤ 1 × 10−10, identity ≥ 80%, coverage ≥ 50%
and minimal matching length ≥ 100 bp). All-versus-all BLASTN (E value ≤ 1
× 10−10) searches were then conducted iteratively, and the shorter sequences
were filtered when two repeats aligned with identity ≥ 80%, coverage ≥ 80%
and minimal matching length ≥ 100 bp; this yielded a nonredundant transposable element library. The nonredundant repeats were then searched against
the Swiss-Prot protein database to filter the protein-coding genes by BLASTX
(E value ≤ 1 × 10−4, identity ≥ 30%, coverage ≥ 30% and minimal matching
length ≥ 30 amino acids). After manual curation, a de novo transposable element library for cucumber was obtained.
Transposable elements in the cucumber genome assembly were identified
both at the DNA and protein level. RepeatMasker was applied for DNA-level
identification using a custom library (a combination of Repbase, plant repeat
database and our cucumber de novo transposable element library). At the
protein level, RepeatProteinMask was used to conduct WU-BLASTX searches
against the transposable element protein database. Overlapping transposable
elements belonging to the same type of repeats were integrated together,
whereas those with low scores were removed if they overlapped >80% and
belonged to different types.
Gene prediction. Our strategy for gene prediction was to conduct de novo predictions on the repeat-masked genome and then integrate them with spliced
alignments of proteins and transcripts to genome sequences using GLEAN60.
Cucumber genome sequences were masked by identified repeat sequences
with length >500 bp, except for miniature inverted-repeat transposable elements, which are usually found near genes or inside introns. The EST and
full-length cDNA sequences of cucumber were processed by PASA61 to train
gene prediction software BGF62, GlimmerHMM63 and SNAP64. Augustus65
and Genscan66 software used gene model parameters trained for Arabidopsis.
We aligned the protein sequences of five sequenced plants (Arabidopsis, papaya,
poplar, grapevine and rice) onto the cucumber genome using TBLASTN, at
an E-value cutoff of 1 × 10−5, and the homologous genome sequences were
aligned against the matching proteins using GeneWise67 for accurate spliced
alignments. The cDNA and EST sequences of cucumber and melon were
aligned against the cucumber genome using BLAT (identity ≥ 0.95, coverage ≥ 0.90) to generate spliced alignments. We also aligned TIGR unigenes 68
from Cucurbitales, Fabales and Fagales to the cucumber genome by ATT_gap2
(ref. 69). All of these resources were combined by GLEAN60 to produce the
consensus gene sets.
Identification of noncoding RNA genes in the cucumber genome. The tRNA
genes were identified by tRNAscan-SE70 with default parameters. The C/D-box
small nucleolar RNAs were identified by Snoscan71 using yeast rRNA and yeast
methylation sites. Other noncoding RNAs, including miRNA, small nuclear
RNA and H/ACA-box small nucleolar RNA, were identified using INFERNAL
software by searching against the Rfam72 database with default parameters.
Construction of gene families. We adapted the Treefam73 method to construct
gene families for the genes in cucumber, Arabidopsis, papaya, poplar, grapevine
and rice (outgroup).
Construction of syntenic blocks. We identified syntenic blocks between two
species (A and B) by an automatic clustering algorithm on a dot plot graph,
which included five steps. First, markers (gene pairs) were generated between
A and B. All protein sequences of A were aligned to all proteins of B using
BLASTP (E value < 1 × 10−10 and identity > 20%). The fragmental alignments
were conjoined for each gene pair. Those gene pairs with aligned regions covering <50% were filtered. The remaining gene pairs were plotted on the dot
graph as markers (points). Second, the Euclidean distance was calculated for
each pair. Distances were calculated based on the gene order in each chromosome rather than the genomic position. Third, hierarchical clustering was
doi:10.1038/ng.475
determined for all of the points. If the distance between two points was less
than the distance cutoff, a link was assigned. The distance cutoff was adapted
in accordance with the selected species. Fourth, the quality was estimated for
each cluster by calculating the point number (N), average point distance (D)
and correlation coefficient (R). A score (S) was calculated to show the overall
quality, defined as S = N × sqrt(2)/D × R. Finally, problematic clusters were
filtered. Clusters with N < 8 or |R| < 0.5 were filtered out. The clusters caused
by tandem duplication were further filtered by determining the slope (L) of
the regression line within a range of 0.1 < |L| < 10. This algorithm can also be
used to study intraspecies synteny.
© 2009 Nature America, Inc. All rights reserved.
4DTv calculation. After the identification of syntenic blocks, the pairwise
protein alignments for each gene pair were first constructed with MUSCLE 74.
The nucleotide alignment was then created according to the protein alignment.
4DTv was then calculated on concatenated nucleotide alignments with HKY
substitution models75.
Comparative analysis between cucumber and melon. Cucumber genome
sequences were aligned with melon BAC sequences using NUCmer, a program
in the MUMmer package76. The delta-filter program was then run with the −1
option to remove complex alignments. Orthologous gene pairs were identified
by the reciprocal best method.
The Bayesian relaxed molecular clock approach was used to estimate divergence time using the program MULTIDIVTIME, which was implemented
using the Thornian Time Traveler (T3) package. The calibration time (fossil
record time) interval (54–90 million years ago) of Capparales was obtained
from previous results77,78.
URLs. Arabidopsis thaliana (TIGR Release 5.0), ftp://ftp.tigr.org/pub/data/a_
thaliana/ath1; Carica papaya (assembly v1.0, EVidence Modeler genes), http://
www.life.uiuc.edu/ming; Populus trichocarpa (assembly release v1.0, annotation v1.1), http://genome.jgi-psf.org/Poptr1_1/Poptr1_1.download.ftp.html;
Vitis vinifera (published assembly, annotation v1), http://www.genoscope.
cns.fr/externe/GenomeBrowser/Vitis/; Oryza sativa (assembly International
Rice Genome Sequencing Project build 3), http://rgp.dna.affrc.go.jp/IRGSP/
download.html; Oryza sativa (GLEAN genes annotated by Beijing Genomics
Institute), ftp.genomics.org.cn/pub/ricedb/rice_update_data/GLEAN_genes/
IRGSP_japonica/; Physcomitrella patens (assembly release v1.0, annotation
v1.1), http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html; Sorghum
bicolor (assembly release v1.0, annotation v1.4), http://www.phytozome.
net/sorghum; UniGene sequences of Cucurbitales, Fabales and Fagales, http://
plantta.jcvi.org/; cucumber marker sequences, http://cucumber.genomics.org.
cn; UniProt (Swiss-Prot/TrEMBL) release 14.1, http://www.uniprot.org/down
loads; InterPro v18.0, http://www.ebi.ac.uk/interpro/; KEGG release 47, ftp://
ftp.genome.jp/pub/kegg/pathway/; Repbase release 13.07, http://www.girinst.
org/repbase/index.html; Plant Repeat Databases (TIGR), http://plantrepeats.
doi:10.1038/ng.475
plantbiology.msu.edu/index.html; Rfam release 9.0, http://rfam.sanger.ac.uk/;
Thornian Time Traveler (T3) package, http://abacus.gene.ucl.ac.uk/software.
html; RepeatMasker, http://www.repeatmasker.org.
55.Wang, J. et al. RePS: a sequence assembler that masks exact repeats identified
from the shotgun data. Genome Res. 12, 824–831 (2002).
56.Li, R. et al. ReAS: Recovery of ancestral sequences for transposable elements from
the unassembled reads of a whole genome shotgun. PLOS Comput. Biol. 1, e43
(2005).
57.Edgar, R.C. & Myers, E.W. PILER: identification and classification of genomic
repeats. Bioinformatics 21 Suppl 1, i152–i158 (2005).
58.Price, A.L., Jones, N.C. & Pevzner, P.A. De novo identification of repeat families
in large genomes. Bioinformatics 21 Suppl 1, i351–i358 (2005).
59.Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length
LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
60.Elsik, C.G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13
(2007).
61.Campbell, M.A., Haas, B.J., Hamilton, J.P., Mount, S.M. & Buell, C.R. Comprehensive
analysis of alternative splicing in rice and comparative analyses with Arabidopsis.
BMC Genomics 7, 327 (2006).
62.Li, H. et al. Test data sets and evaluation of gene prediction programs on the rice
genome. J Comp Sci Tech 20, 446–453 (2005).
63.Majoros, W.H., Pertea, M. & Salzberg, S.L. TigrScan and GlimmerHMM: two
open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879
(2004).
64.Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
65.Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new
intron submodel. Bioinformatics 19 Suppl 2, ii215–ii225 (2003).
66.Salamov, A.A. & Solovyev, V.V. Ab initio gene finding in Drosophila genomic DNA.
Genome Res. 10, 516–522 (2000).
67.Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14,
988–995 (2004).
68.Childs, K.L. et al. The TIGR Plant Transcript Assemblies database. Nucleic Acids
Res. 35, D846–D851 (2007).
69.Huang, X., Adams, M.D., Zhou, H. & Kerlavage, A.R. A tool for analyzing and
annotating genomic sequences. Genomics 46, 37–45 (1997).
70.Lowe, T.M. & Eddy, S.R. tRNAscan-SE: a program for improved detection of transfer
RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
71.Lowe, T.M. & Eddy, S.R. A computational screen for methylation guide snoRNAs
in yeast. Science 283, 1168–1171 (1999).
72.Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes.
Nucleic Acids Res. 33, D121–D124 (2005).
73.Li, H. et al. TreeFam: a curated database of phylogenetic trees of animal gene
families. Nucleic Acids Res. 34, D572–D580 (2006).
74.Edgar, R.C. MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
75.Hasegawa, M., Kishino, H. & Yano, T. Dating of the human-ape splitting by a
molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985).
76.Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome
Biol. 5, R12 (2004).
77.Crepet, W.L., Nixon, K.C. & Gandolfo, M.A. Fossil evidence and phylogeny: the age
of major angiosperm clades based on mesofossil and macrofossil evidence from
Cretaceous deposits. Am. J. Botany 91, 1666–1682 (2004).
78.Wikström, N., Savolainen, V. & Chase, M.W. Evolution of the angiosperms: calibrating
the family tree. Proc. Biol. Sci. 268, 2211–2220 (2001).
Nature Genetics
Vol 463 | 14 January 2010 | doi:10.1038/nature08670
ARTICLES
Genome sequence of the palaeopolyploid
soybean
Jeremy Schmutz1,2, Steven B. Cannon3, Jessica Schlueter4,5, Jianxin Ma5, Therese Mitros6, William Nelson7,
David L. Hyten8, Qijian Song8,9, Jay J. Thelen10, Jianlin Cheng11, Dong Xu11, Uffe Hellsten2, Gregory D. May12,
Yeisoo Yu13, Tetsuya Sakurai14, Taishi Umezawa14, Madan K. Bhattacharyya15, Devinder Sandhu16,
Babu Valliyodan17, Erika Lindquist2, Myron Peto3, David Grant3, Shengqiang Shu2, David Goodstein2, Kerrie Barry2,
Montona Futrell-Griggs5, Brian Abernathy5, Jianchang Du5, Zhixi Tian5, Liucun Zhu5, Navdeep Gill5, Trupti Joshi11,
Marc Libault17, Anand Sethuraman1, Xue-Cheng Zhang17, Kazuo Shinozaki14, Henry T. Nguyen17, Rod A. Wing13,
Perry Cregan8, James Specht18, Jane Grimwood1,2, Dan Rokhsar2, Gary Stacey10,17, Randy C. Shoemaker3
& Scott A. Jackson5
Soybean (Glycine max) is one of the most important crop plants for seed protein and oil content, and for its capacity to fix
atmospheric nitrogen through symbioses with soil-borne microorganisms. We sequenced the 1.1-gigabase genome by a
whole-genome shotgun approach and integrated it with physical and high-density genetic maps to create a
chromosome-scale draft sequence assembly. We predict 46,430 protein-coding genes, 70% more than Arabidopsis and
similar to the poplar genome which, like soybean, is an ancient polyploid (palaeopolyploid). About 78% of the predicted
genes occur in chromosome ends, which comprise less than one-half of the genome but account for nearly all of the genetic
recombination. Genome duplications occurred at approximately 59 and 13 million years ago, resulting in a highly duplicated
genome with nearly 75% of the genes present in multiple copies. The two duplication events were followed by gene
diversification and loss, and numerous chromosome rearrangements. An accurate soybean genome sequence will facilitate
the identification of the genetic basis of many soybean traits, and accelerate the creation of improved soybean varieties.
Legumes are an important part of world agriculture as they fix atmospheric nitrogen by intimate symbioses with microorganisms. The
soybean in particular is important worldwide as a predominant plant
source of both animal feed protein and cooking oil. We report here a
soybean whole-genome shotgun sequence of Glycine max var.
Williams 82, comprised of 950 megabases (Mb) of assembled and
anchored sequence (Fig. 1), representing about 85% of the predicted
1,115-Mb genome1 (Supplementary Table 3.1). Most of the genome
sequence (Fig. 1) is assembled into 20 chromosome-level pseudomolecules containing 397 sequence scaffolds with ordered positions within
the 20 soybean linkage groups. An additional 17.7 Mb is present in
1,148 unanchored sequence scaffolds that are mostly repetitive and
contain fewer than 450 predicted genes. Scaffold placements were
determined with extensive genetic maps, including 4,991 single nucleotide polymorphisms (SNPs) and 874 simple sequence repeats
(SSRs)2–5. All but 20 of the 397 sequence scaffolds are unambiguously
oriented on the chromosomes. Unoriented scaffolds are in repetitive
regions where there is a paucity of recombination and genetic markers
(see Supplementary Information for assembly details).
The soybean genome is the largest whole-genome shotgunsequenced plant genome so far and compares favourably to all other
high-quality draft whole-genome shotgun-sequenced plant genomes
(Supplementary Table 4). A total of 8 of the 20 chromosomes have
telomeric repeats (TTTAGGG or CCCTAAA) on both of the distal
scaffolds and 11 other chromosomes have telomeric repeats on a
single arm, for a total of 27 out of 40 chromosome ends captured
in sequence scaffolds. Also, internal scaffolds in 19 of 20 chromosomes contain a large block of characteristic 91- or 92-base-pair
(bp) centromeric repeats6,7 (Fig. 1). Four chromosome assemblies
contain several 91/92-bp blocks; this may be the correct physical
placements of these sequences, or may reflect the difficulty in assembling
these highly repetitive regions.
Gene composition and repetitive DNA
A striking feature of the soybean genome is that 57% of the genomic
sequence occurs in repeat-rich, low-recombination heterochromatic
regions surrounding the centromeres. The average ratio of geneticto-physical distance is 1 cM per 197 kb in euchromatic regions, and
1 cM per 3.5 Mb in heterochromatic regions (see Supplementary
Information section 1.8). For reference, these proportions are similar
to those in Sorghum, in which 62% of the sequence is heterochromatic, and different than in rice, with 15% in heterochromatin8. In
1
HudsonAlpha Genome Sequencing Center, 601 Genome Way, Huntsville, Alabama 35806, USA. 2Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California 94598, USA.
USDA-ARS Corn Insects and Crop Genetics Research Unit, Ames, Iowa 50011, USA. 4Department of Bioinformatics and Genomics, 9201 University City Blvd, University of North
Carolina at Charlotte, Charlotte, North Carolina 28223, USA. 5Department of Agronomy, Purdue University, 915 W. State Street, West Lafayette, Indiana 47906, USA. 6Center for
Integrative Genomics, University of California, Berkeley, California 94720, USA. 7Arizona Genomics Computational Laboratory, BIO5 Institute, 1657 E. Helen Street, The University of
Arizona, Tucson, Arizona 85721, USA. 8USDA, ARS, Soybean Genomics and Improvement Laboratory, B006, BARC-West, Beltsville, Maryland 20705, USA. 9Department Plant
Science and Landscape Architecture, University of Maryland, College Park, Maryland 20742, USA. 10Division of Biochemistry & Interdisciplinary Plant Group, 109 Christopher S. Bond
Life Sciences Center, University of Missouri, Columbia, Missouri 65211, USA. 11Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA. 12The
National Center for Genome Resources, 2935 Rodeo Park Drive East, Santa Fe, New Mexico 87505, USA. 13Arizona Genomics Institute, School of Plant Sciences, University of Arizona,
Tucson, Arizona 85721, USA. 14RIKEN Plant Science Center, Yokohama 230-0045, Japan. 15Department of Agronomy, Iowa State University, Ames, Iowa 50011, USA. 16Department of
Biology, University of Wisconsin-Stevens Point, Stevens Point, Wisconsin 54481, USA. 17National Center for Soybean Biotechnology, Division of Plant Sciences, University of Missouri,
Columbia, Missouri 65211, USA. 18Department of Agronomy and Horticulture, University of Nebraska, Lincoln, Nebraska 68583, USA.
3
178
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 463 | 14 January 2010
0
10
20
30
40
50
60 Mb
Chr1-D1a
Chr2-D1b
Chr3-N
Chr4-C1
Chr5-A1
Chr6-C2
Chr7-M
Chr8-A2
Chr9-K
Chr10-O
Chr11-B1
Chr12-H
Chr13-F
Chr14-B2
Chr15-E
Chr16-J
Chr17-D2
Chr18-G
Chr19-L
Chr20-I
Genes
DNA transposons
Copia-like
retrotransposons
Gypsy-like
retrotransposons
Cen91/92
Figure 1 | Genomic landscape of the 20 assembled soybean chromosomes.
Major DNA components are categorized into genes (blue), DNA
transposons (green), Copia-like retrotransposons (yellow), Gypsy-like
retrotransposons (cyan) and Cent91/92 (a soybean-specific centromeric
repeat (pink)), with respective DNA contents of 18%, 17%, 13%, 30% and
1% of the genome sequence. Unclassified DNA content is coloured grey.
Categories were determined for 0.5-Mb windows with a 0.1-Mb shift.
general, these boundaries, determined on the basis of suppressed
recombination, correlate with transitions in gene density and transposon density. Ninety-three per cent of the recombination occurs in
the repeat-poor, gene-rich euchromatic genomic region that only
accounts for 43% of the genome. Nevertheless, 21.6% of the highconfidence genes are found in the repeat- and transposon-rich
regions in the chromosome centres.
We identified 46,430 high-confidence protein-coding loci in the
soybean genome, using a combination of full-length complementary
DNAs9, expressed sequence tags, homology and ab initio methods
(Supplementary Information section 2). Another ,20,000 loci were
predicted with lower confidence; this set is enriched for hypothetical,
partial and/or transposon-related sequences, and possess shorter
coding sequences and fewer introns than the high-confidence set.
The exon–intron structure of genes shows high conservation among
soybean, poplar and grapevine, consistent with a high degree of position and phase conservation found more broadly across angiosperms10. Introns in soybean gene pairs retained in duplicate have a
strong tendency to persist. Of 19,775 introns shared by poplar and
grapevine (diverged more than 90 million years (Myr) ago11), and
hence by the last common ancestor of soybean and grapevine, 19,666
(99.45%) were preserved in both copies in soybean. Of the remaining
0.55%, 78% are absent in both recent soybean copies (that is, lost
before the ,13-Myr-ago duplication) and 22% are found only in one
paralogue (that is, other copy lost). We find a slower intron loss rate
in poplar (0.4%) than in soybean (0.6%) since the last common rosid
ancestor, which is consistent with the slower rate of sequence evolution in the poplar lineage thought to be associated with its perennial,
clonal habit, global distribution and wind pollination12. Intron size is
also highly conserved in recent soybean paralogues, indicating that
few insertions and deletions have accumulated within introns over
the past 13 Myr.
Of the 46,430 high-confidence loci, 34,073 (73%) are clearly orthologous with one or more sequences in other angiosperms, and can
be assigned to 12,253 gene families (Supplementary Table 5). Among
pan-angiosperm or pan-rosid gene families that also have members outside the legumes, soybean is particularly enriched (using a
Fisher’s exact test relative to Arabidopsis) in genes containing NBARC (nucleotide-binding-site-APAF1-R-Ced) and LRR (leucinerich-repeat) domains. These genes are associated with the plant
immune system, and are known to be dynamic13. Tandem gene family
expansions are common in soybean and include NBS-LRR, F-box,
auxin-responsive protein, and other domains commonly found in
large gene families in plants. The ages of genes in these tandem families,
inferred from intrafamily sequence divergence, indicate that they originated at various times in the evolutionary history of soybean, rather
than in a discrete burst.
From protein families in the sequenced angiosperms (http://
www.phytozome.net) (Supplementary Table 4), we identified 283
putative legume-specific gene families containing 448 high-confidence
soybean genes (Supplementary Information section 2). These gene
families include soybean and Medicago representatives, but no representatives from grapevine, poplar, Arabidopsis, papaya, or grass
(Sorghum, rice, maize, Brachypodium). The top domains in this set
are the AP2 domain, protein kinase domain, cytochrome P450, and
PPR repeat. An additional 741 putatively soybean-specific gene families
(each consisting of two or more high-confidence soybean genes) may
also include legume-specific genes that have not yet been sequenced in
the ongoing Medicago sequencing project, or may represent bona fide
soybean-specific genes. The top domains in this list include protein
kinase and protein tyrosine kinase, AP2, LRR, MYB-like DNA binding
domain, cytochrome P450 (the same domains most common in the
entire soybean proteome) as well as GDSL-like lipase/acylhydrolase
and stress-upregulated Nod19.
A combination of structure-based analyses and homology-based
comparisons resulted in identification of 38,581 repetitive elements,
covering most types of plant transposable elements. These elements,
together with numerous truncated elements and other fragments,
make up ,59% of the soybean genome (Supplementary Table 6).
Long terminal repeat (LTR) retrotransposons are the most abundant
class of transposable elements. The soybean genome contains ,42%
LTR retrotransposons, fewer than Sorghum8 and maize14, but higher
than rice15. The intact element sizes range from 1 kb to 21 kb, with an
average size of 8.7 kb (Supplementary Fig. 2). Of the 510 families containing 14,106 intact elements, 69% are Gypsy-like and the remainder
Copia-like. However, most (,78%) of these families are present at low
copy numbers, typically fewer than 10 copies. The genome also contains an estimated 18,264 solo LTRs, probably caused by homologous
recombination between LTRs from a single element. Nested retrotransposons are common, with 4,552 nested insertion events identified. The
copy numbers within each block range from one to six.
The genome consists of ,17% transposable elements, divided into
Tc1/Mariner, haT, Mutator, PIF/Harbinger, Pong, CACTA superfamilies
and Helitrons. Of these superfamilies, those containing more than 65
complete copies, Tc1/Mariner and Pong, comprise ,0.1% of the
genome sequence, and seem to have not undergone recent amplification, indicating that they may be inactive and relatively old. Conversely,
other families seem to have amplified recently and may still be active,
indicated by the high similarity (.98%) of multiple elements.
Multiple whole-genome duplication events
Timing and phylogenetic position. A striking feature of the soybean
genome is the extent to which blocks of duplicated genes have been
retained. On the basis of previous studies that examined pairwise
synonymous distance (Ks values) of paralogues16,17, and targeted
sequencing of duplicated regions within the soybean genome18, we
expected that large homologous regions would be identified in the
genome. Using a pattern-matching search, gene families of sizes from
two to six were identified, and Ks values were calculated for these genes,
179
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 463 | 14 January 2010
08
11
02
14
14
12
19
19
06
06
15
09
06
16
19
13
17
12
02
01
03
03
03
14
18
05
10
10
12
c
17
13
11
08
01
10
02
18
07
07
13
05
07
08
b
17
20
11
04
01
09
18
20 16 15
05
04
a
04
20 16
15
09
12
10
Pairs (%)
8
6
4
2
5
1.
1.
26
1.
32
1.
38
1.
44
2
1.
0.
9
0.
96
1.
02
1.
08
1.
14
0.
3
0.
36
0.
42
0.
48
0.
54
0.
6
0.
66
0.
72
0.
78
0.
84
0
0.
06
0.
12
0.
18
0.
24
0
Synonymous distance
Figure 2 | Homologous relationships between the 20 soybean
chromosomes. The bottom histogram plot shows pairwise Ks values for
gene family sizes 2 to 6. Top panels show the 20 chromosomes in a circle with
lines connecting homologous genes. Gene-rich regions (euchromatin) of
each chromosome are coded a different colour around the circle. Grey
represents Ks values of 0.06–0.39, 13-Myr genome duplication; black
represents Ks values of 0.40–0.80, 59-Myr genome duplication. These
correspond to the grey and black bars in the histogram. a, Chromosomes 1,
11, 17, 7, 10 and 3, which contain centromeric repeat Sb91. b, Chromosomes
19 and 6, which contain both Sb91 and Sb92 centromeric repeats.
c, Chromosomes 18, 5, 8, 2, 14, 12, 13, 9, 15, 16, 20 and 4, which contain
Sb92.
10,000
Poplar–soybean
Grapevine–soybean
Arabidopsis–soybean
Rice–soybean
Medicago–soybean
Soybean–soybean
9,000
8,000
Gene pairs on syntenic segments
here displayed as a histogram plot (Fig. 2), which shows two distinct
peaks. Similarly, nucleotide diversity for the fourfold synonymous
third-codon transversion position, 4dTv, was calculated. Both metrics
give a measure of divergence between two genes, but the 4dTv uses a
subset of the sites (transitions/transversion) used in the computation of Ks. 31,264 high-confidence soybean genes have recent paralogues with Ks < 0.13 synonymous substitutions per site and
4dTv < 0.0566 synonymous transversions per site (Fig. 3), corresponding to a soybean-lineage-specific palaeotetraploidization. This was
probably an allotetraploidy event based on chromosomal evidence19.
Of the 46,430 high-confidence genes, 31,264 exist as paralogues and
15,166 have reverted to singletons. We infer that the pre-duplication
proto-soybean genome possessed ,30,000 genes: half of
(2 3 15,166 1 2 3 15,632) 5 30,798. This number is comparable to
the modern Arabidopsis gene complement. A second paralogue peak
at Ks < 0.59 (4dTv < 0.26) corresponds to the early-legume duplication, which several lines of evidence suggest occurred near the origins of
the papilionoid lineage20. The papilionoid origin has been dated to
approximately 59 Myr ago21. A third highly diffuse peak is seen when
the plot is expanded past a Ks value of 1.5 (data not shown) and most
probably corresponds to the ‘gamma’ event22, shown to be a triplication
in Vitis23 and in other angiosperms24.
Owing to the existence of macrofossils in the legumes and allies,
the timing of clade origins in the legumes is better established than
other plant families. A fossil-calibrated molecular clock for the
legumes places the origin of the legume stem clade and the oldest
papilionoid crown clade at 58 to 60 Myr ago21. If the early-legume
whole-genome duplication (WGD) occurred outside the papilionoid
lineage, as suggested by map evidence from Arachis (an early-diverging
7,000
6,000
5,000
4,000
3,000
2,000
1,000
0
0
0.2
0.4
0.6
0.8
1.0
1.2
4dTv distance (corrected for multiple substitutions)
1.4
Figure 3 | Distribution of 4dTv distance between syntenically orthologous
genes. Segments were found by locating blocks of BLAST hits with
significance 1310218 or better with less than 10 intervening genes between
such hits. The 4dTv distance between orthologous genes on these segments
is reported.
genus in the papilionoid clade)20, then the duplication occurred within
the narrow window of time between the origin of the legumes and the
papilionoid radiation. If the older duplication is assumed to have
occurred around 58 Myr ago, then the calculated rate of silent mutations extending back to the duplication would be 5.17 3 1023, similar
to previous estimates of 5.2 3 10–3 (ref. 21). The Glycine-specific
duplication is estimated to have occurred ,13 Myr ago, an age consistent with previous estimates16,17.
Structural organization. We identified homologous blocks within
the genome using i-ADHoRe25. Using relatively stringent parameters,
442 multiplicons (that is, duplicated segments) were identified
within the soybean genome and visualized using Circos26 (Fig. 2).
Owing to the multiple rounds of duplication and diploidization in
the genome, as well as chromosomal rearrangements, multiplicons
(or blocks) between chromosomes can involve more than just two
chromosomes. On average, 61.4% of the homologous genes are
found in blocks involving only two chromosomes, only 5.63% spanning three chromosomes, and 21.53% traversing four chromosomes.
Two notable exceptions to this pattern are chromosome 14, which
has 11.8% of its genes retained across three chromosomes, and chromosome 20 with 7.08% of the homologues (gene pairs resulting from
genome duplication) retained across four chromosomes. Chromosome 14 seems to be a highly fragmented chromosome with block
matches to 14 other chromosomes, the highest number of all chromosomes. Conversely, chromosome 20 is highly homologous to the
long arm of chromosome 10, with few matches elsewhere in the
genome.
Retention of homologues across the genome is exceptionally high;
blocks retained in two or more chromosomes can be clearly observed
(Fig. 2 and Supplementary Figs 5 and 6). The number of homologues
(gene pairs) within a block average 31, although any given block may
contain from 6 to 736 homologues. Given that not all genes within a
block are retained as homologues (owing to loss of duplicated genes
over time (fractionation)), the average number of genes in a block is
,75 genes and ranges from 8 to 1,377 genes.
Repeated duplications in the soybean genome make it possible to
determine rates of gene loss following each round of polyploidy. In
homologous segments from the 13-Myr-old Glycine duplication,
43.4% of genes have matches in the corresponding region, in contrast
to 25.9% in blocks from the early legume duplication. Combining
these gene-loss rates with WGD dates of 13 Myr ago and 59 Myr ago,
the rate of gene loss has been 4.36% of genes per Myr following the
Glycine WGD and 1.28% of genes per Myr following the early-legume
180
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 463 | 14 January 2010
WGD. This differential in gene-loss rates indicates an exponential
decay pattern of rapid gene loss after duplication, slowing over time.
Table 1 | Putative acyl lipid genes in Arabidopsis and soybean
Function category of acyl lipid genes
Number in
Arabidopsis
Number in
soybean
Nodulation and oil biosynthesis genes
A unique feature of legumes is their ability to establish nitrogenfixing symbioses with soil bacteria of the family Rhizobiaceae.
Therefore, information on the nodulation functions of the soybean
genome is of particular interest. Sequence comparisons with previously identified nodulation genes identified 28 nodulin genes and 24
key regulatory genes, which probably represent true orthologues of
known nodulation genes in other legume species (Supplementary
section 3 and Supplementary Table 8). Among this list of 52 genes,
32 have at least one highly conserved homologue gene. We hypothesize
that these are homologous gene pairs arising from the Glycine WGD
(that is, ,13 Myr ago). Further analysis shows that seven soybean
nodulin genes produce transcript variants. The exceptional example
is nodulin-24 (Glyma14g05690), which seems to produce ten transcript variants (Supplementary Table 8). In total, 25% of the examined
nodulin genes produce transcript variants, which is slightly higher than
the incidence of alternative splicing in Arabidopsis (,21.8%) and rice
(,21.2%)27. However, none of the soybean regulatory nodulation
genes produces transcript variants (Supplementary Table 8).
Mining the soybean genome for genes governing metabolic steps
in triacylglycerol biosynthesis could prove beneficial in efforts to
modify soybean oil composition or content. Genomic analysis of acyl
lipid biosynthesis in Arabidopsis revealed 614 genes involved in pathways beginning with plastid acetyl-CoA production for de novo fatty
acid synthesis through cuticular wax deposition28. Comparison of
these sequences to the soybean genome identified 1,127 putative
orthologous and paralogous genes in soybean. This is probably a
low estimate owing to the high stringency conditions used for gene
mining. The distribution of these genes according to various functional classes of acyl lipid biosynthesis is shown in Table 1.
Comparing Arabidopsis to soybean, the number of genes involved
in storage lipid synthesis, fatty acid elongation and wax/cutin production was similar. For all other subclasses, the soybean genome
contained substantially higher numbers of genes. Interestingly, the
number of genes involved in lipid signalling, degradation of storage
lipids, and membrane lipid synthesis were two- to threefold higher in
soybean than Arabidopsis, indicating that these areas of acyl lipid
synthesis are more complex in soybean. The number of genes
involved in plastid de novo fatty acid synthesis was 63% higher in
soybean compared to Arabidopsis. Many single-gene activities in
Synthesis of fatty acids in plastids
Synthesis of membrane lipids in plastids
Synthesis of membrane lipids in endomembrane system
Metabolism of acyl lipids in mitochondria
Synthesis and storage of oil
Degradation of storage lipids and straight fatty acids
Lipid signalling
Fatty acid elongation and wax and cutin metabolism
Miscellaneous
Total
46
20
56
29
19
43
153
73
175
614
75
33
117
69
22
155
312
70
274
1,127
ABI3/VP1: 78
a
ZF-HD: 54
Other TFs: 561
Transcription factor diversity
We identified soybean transcription factor genes by sequence comparison to known transcription factor gene families, as well as by
searching for known DNA-binding domains. In total, 5,671 putative
soybean transcription factor genes, distributed in 63 families, were
identified (Fig. 4a and Supplementary Table 9). This number represents 12.2% of the 46,430 predicted soybean protein-coding loci.
A similar analysis performed on the Arabidopsis genome identified
2,315 putative Arabidopsis transcription factor genes, representing
7.1% of the 32,825 predicted Arabidopsis protein-coding loci
(Fig. 4b). Transcription factor genes are homogeneously distributed
across the chromosomes in both soybean and Arabidopsis, with an
average relative abundance of 8–10% transcription factor genes on
each chromosome. On rare occasions, regions were identified in both
genomes that had a relatively low (,5%) or high density (.12%) of
transcription factor genes. Among the transcription factor genes identified, 9.5% of soybean genes (538 transcription factor genes) and
8.2% of Arabidopsis genes (190 Arabidopsis transcription factor genes)
b
AP2-EREBP: 381
AS2: 92
WRKY: 197
Arabidopsis are encoded by multigene families in soybean, including
ketoacyl-ACP synthase II (12 copies in soybean), malonyl-CoA:ACP
malonyltransferase (2 copies), enoyl-ACP reductase (5 copies), acylACP thioesterase FatB (6 copies) and plastid homomeric acetyl-CoA
carboxylase (3 copies). Long-chain acyl-CoA synthetases, ER acyltransferases, mitochondrial glycerol-phosphate acyltransferases, and
lipoxygenases are all unusually large gene families in soybean, containing as many as 24, 21, 20 and 52 members, respectively. The
multigenic nature of these and many other activities involved in acyl
lipid metabolism suggests the potential for more complex transcriptional control in soybean compared to Arabidopsis.
ABI3/VP1: 71 AP2-EREBP: 146
ZF-HD: 17
Other TFs: 241
AS2: 43
WRKY: 73
AUX-IAA-ARF: 129
AUX-IAA-ARF: 51
TPR: 319
bHLH: 393
TCP: 65
TPR: 65
TCP: 6
bHLH: 172
Bromodomain: 57
SNF2: 69
BTB/POZ: 145
BZIP: 176
Bromodomain: 29
SNF2: 33
PHD: 55
BTB/POZ: 98
PHD: 222
C2C2 (Zn) CO-like: 72
NAC: 114
BZIP: 78
C2C2 (Zn) Dof: 82
C2C2 (Zn) GATA: 62
C2C2 (Zn) CO-like: 34
NAC: 208
C2C2 (Zn) Dof: 36
C2H2 (Zn): 395
MYB/HD-like: 726
C2C2 (Zn) GATA: 29
MYB/HD-like: 279
C2H2 (Zn): 173
C3H-type1(Zn): 147
CCAAT: 106
MYB: 65
Jumonji: 77
MADS: 212
CCHC (Zn): 144
GRAS: 130
Homeodomain/Homeobox: 319
Figure 4 | Distribution of soybean (a) and Arabidopsis (b) transcription
factor genes in different transcription factor families. Only the distribution
of the most representative transcription families is detailed here. AUX-IAAARF, indole-3-acetic acid-auxin response factor; BTB/POZ, bric-à-brac
tramtrack broad complex/pox viruses and zinc fingers; BZIP, basic leucine
C3H-type1(Zn): 69
MYB: 24
CCAAT: 38
CCHC (Zn): 66
Homeodomain/Homeobox: 112
Jumonji: 21
MADS: 109
GRAS: 33
zipper; GRAS, (GAI, RGA, SCR); NAC, (NAM, ATAF1/2, CUC2); PHD,
plant homeodomain-finger transcription factor; TCP, (TB1, CYC, PCF);
TFs, transcription factors; TPR, tetratricopepitide repeat; WRKY, conserved
amino acid sequence WRKYGQK at its N-terminal end.
181
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 463 | 14 January 2010
are tandemly duplicated. By way of example, only one region in
Arabidopsis has more than five duplicated transcription factor genes
in tandem (seven ABI3/VP1 genes (At4G31610 to At4G31660)),
whereas in soybean several such regions are present (for example, 13
C3H-type 1 (Zn) (Glyma15g19120 to Glyma15g19240); six MYB/
HD-like (Glyma06g45520 to Glyma06g45570); and five MADS
(Glyma20g27320 to Glyma20g27360); Supplementary Table 8). The
overall distribution of soybean transcription factor genes among the
various known protein families is very similar between Arabidopsis
and soybean (Supplementary Fig. 10a, b). However, some families are
relatively sparser or more abundant in soybean, perhaps reflecting
differences in biological function. For example, members of the
ABI3/VP1 family are 2.2-times more abundant in Arabidopsis,
whereas members of the TCP family are 4.4-times more abundant
in soybean. In addition, those gene families with fewer members are
differentially represented between soybean and Arabidopsis. FHA,
HD-Zip (homeodomain/leucine zipper), PLATZ, SRS and TUB transcription factor genes are more abundant in soybean (2.7, 2.9, 4.1, 3,
and 4.9 times, respectively) and HTH-ARAC (helix–turn–helix araC/
xylS-type) genes were identified exclusively in soybean. In contrast,
HSF, HTH-FIS (helix–turn–helix-factor for inversion stimulation),
TAZ and U1-type (Zn) genes are present in relatively larger numbers
in Arabidopsis (5.4, 4.9, 24.5 and 2.9 times, respectively). Notably,
both ABI3/VP1, TCP, SRS and Tubby transcription factor genes were
shown to have critical roles in plant development (for example, ABI3/
VP1 during seed development; TCP, SRS and Tubby affect overall
plant development29–33). The differences seen in relative transcription
factor gene abundance indicates that regulatory pathways in soybean
may differ from those described in Arabidopsis.
Impact on agriculture
Hundreds of qualitatively inherited (single gene) traits have been
characterized in soybean and many genetically mapped. However,
most important crop production traits and those important to seed
quality for human health, animal nutrition and biofuel production
are quantitatively inherited. The regions of the genome containing
DNA sequence affecting these traits are called quantitative trait loci
(QTL). QTL mapping studies have been ongoing for more than 90
distinct traits of soybean including plant developmental and reproductive characters, disease resistance, seed quality and nutritional
traits. In most cases, the causal functional gene or transcription factor
underlying the QTL is unknown. However, the integration of the
whole genome sequence with the dense genetic marker map that
now exists in soybean2–5 (http://www.Soybase.org) will allow the
association of mapped phenotypic effectors with the causal DNA
sequence. There are already examples where the availability of the
soybean genomic sequence has accelerated these discovery efforts.
Having access to the sequence allowed cloning and identification of
the rsm1 (raffinose synthase) mutation that can be used to select for
low-stachyose-containing soybean lines that will improve the ability
of animals and humans to digest soybeans34. Using a comparative
genomics approach between soybean and maize, a single-base mutation was found that causes a reduction in phytate production in
soybean35. Phytate reduction could result in a reduction of a major
environmental runoff contaminant from swine and poultry waste.
Perhaps most exciting for the soybean community, the first resistance
gene for the devastating disease Asian soybean rust (ASR) has been
cloned with the aid of the soybean genomic sequence and confirmed
with viral-induced gene silencing36. In countries where ASR is well
established, soybean yield losses due to the disease can range from
10% to 80%36 and the development of soybean strains resistant to
ASR will greatly benefit world soybean production.
Soybean, one of the most important global sources of protein and
oil, is now the first legume species with a complete genome sequence. It
is, therefore, a key reference for the more than 20,000 legume species,
and for the remarkable evolutionary innovation of nitrogen-fixing
symbiosis. This genome, with a common ancestor only 20 million years
removed from many other domesticated bean species, will allow us to
knit together knowledge about traits observed and mapped in all of the
beans and relatives. The genome sequence is also an essential framework for vast new experimental information such as tissue-specific
expression and whole-genome association data. With knowledge of
this genome’s billion-plus nucleotides, we approach an understanding
of the plant’s capacity to turn carbon dioxide, water, sunlight and
elemental nitrogen and minerals into concentrated energy, protein
and nutrients for human and animal use. The genome sequence opens
the door to crop improvements that are needed for sustainable human
and animal food production, energy production and environmental
balance in agriculture worldwide.
METHODS SUMMARY
Seeds from cultivar Williams 82 were grown in a growth chamber for 2 weeks and
etiolated for 5 days before harvest. A standard phenol/chloroform leaf extraction
was performed. DNA was treated with RNase A and proteinase K and precipitated with ethanol.
All sequencing reads were collected with Sanger sequencing protocols on ABI
3730XL capillary sequencing machines, a majority at the Joint Genome Institute
in Walnut Creek, California.
A total of 15,332,163 sequence reads were assembled using Arachne
v.20071016 (ref. 37) to form 3,363 scaffolds covering 969.6 Mb of the soybean
genome. The resulting assembly was integrated with the genetic and physical
maps previously built for soybean and a newly constructed genetic map to
produce 20 chromosome-scale scaffolds covering 937.3 Mb and an additional
1,148 unmapped scaffolds that cover 17.7 Mb of the genome.
Genes were annotated using Fgenesh138 and GenomeScan39 informed by EST
alignments and peptide matches to genome from Arabidopsis, rice and grapevine.
Models were reconciled with EST alignments and UTR added using PASA40. Models
were filtered for high confidence by penalizing genes which were transposableelement-related, had low sequence entropy, short introns, incomplete start or stop,
low C-score, no UniGene hit at 1 3 1025, or the model was less than 30% the length
of its best hit.
LTR retrotransposons were identified by the program LTR_STRUC41, manually inspected to check structure features and classified into distinct families based
on the similarities to LTR sequences. DNA transposons were identified using
conserved protein domains as queries in TBLASTN42 searches of the genome.
Identified elements were used as a custom library for RepeatMasker (current
version: open 3.2.8; http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker)
to detect missed intact elements, truncated elements and fragments.
Virtual suffix trees with six-frame translation were generated using Vmatch43 and
then clustered into families. Pairwise alignments between gene family members
were performed using ClustalW44. Identification of homologous blocks was performed using i-ADHoRe v2.1 (ref. 25). Visualization of blocks was performed with
Circos26.
Received 19 August; accepted 12 November 2009.
1.
Arumuganathan, K. & Earle, E. D. Nuclear DNA content of some important plant
species. Plant Mol. Biol. Rep. 9, 208–218 (1991).
2. Choi, I. Y. et al. A soybean transcript map: gene distribution, haplotype and singlenucleotide polymorphism analysis. Genetics 176, 685–696 (2007).
3. Hyten, D. L. et al. High-throughput SNP discovery through deep resequencing of a
reduced representation library to anchor and orient scaffolds in the soybean
whole genome sequence. BMC Genomics (in the press).
4. Hyten, D. L. et al. A high density integrated genetic linkage map of soybean and the
development of a 1,536 Universal Soy Linkage Panel for QTL mapping. Crop Sci.
(in the press).
5. Song, Q. J. et al. A new integrated genetic linkage map of the soybean. Theor. Appl.
Genet. 109, 122–128 (2004).
6. Lin, J. Y. et al. Pericentromeric regions of soybean (Glycine max L. Merr.)
chromosomes consist of retroelements and tandemly repeated DNA and are
structurally and evolutionarily labile. Genetics 170, 1221–1230 (2005).
7. Vahedian, M. et al. Genomic organization and evolution of the soybean SB92
satellite sequence. Plant Mol. Biol. 29, 857–862 (1995).
8. Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of
grasses. Nature 457, 551–556 (2009).
9. Umezawa, T. et al. Sequencing and analysis of approximately 40,000 soybean
cDNA clones from a full-length-enriched cDNA library. DNA Res. 15, 333–346
(2008).
10. Roy, S. W. & Penny, D. Patterns of intron loss and gain in plants: intron lossdominated evolution and genome-wide comparison of O. sativa and A. thaliana.
Mol. Biol. Evol. 24, 171–181 (2007).
11. Wang, H. et al. Rosid radiation and the rapid rise of angiosperm-dominated
forests. Proc. Natl Acad. Sci. USA 106, 3853–3858 (2009).
182
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 463 | 14 January 2010
12. Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. &
Gray). Science 313, 1596–1604 (2006).
13. Michelmore, R. & Meyers, B. C. Clusters of resistance genes in plants evolve by
divergent selection and a birth-and-death process. Genome Res. 8, 1113–1130
(1998).
14. Bruggmann, R. et al. Uneven chromosome contraction and expansion in the maize
genome. Genome Res. 16, 1241–1251 (2006).
15. Ma, J., Devos, K. M. & Bennetzen, J. L. Analyses of LTR-retrotransposon structures
reveal recent and rapid genomic DNA loss in rice. Genome Res. 14, 860–869
(2004).
16. Pfeil, B. E., Schlueter, J. A., Shoemaker, R. C. & Doyle, J. J. Placing paleopolyploidy
in relation to taxon divergence: a phylogenetic analysis in legumes using 39 gene
families. Syst. Biol. 54, 441–454 (2005).
17. Schlueter, J. A. et al. Mining EST databases to resolve evolutionary events in major
crop species. Genome 47, 868–876 (2004).
18. Schlueter, J. A., Scheffler, B. E., Jackson, S. & Shoemaker, R. C. Fractionation of
synteny in a genomic region containing tandemly duplicated genes across Glycine
max, Medicago truncatula, and Arabidopsis thaliana. J. Hered. 99, 390–395 (2008).
19. Gill, N. et al. Molecular and chromosomal evidence for allopolyploidy in soybean,
Glycine max (L.) Merr. Plant Physiol. 151, 1167–1174 (2009).
20. Bertioli, D. J. et al. An analysis of synteny of Arachis with Lotus and Medicago sheds
new light on the structure, stability and evolution of legume genomes. BMC
Genomics 10, 45 (2009).
21. Lavin, M., Herendeen, P. S. & Wojciechowski, M. F. Evolutionary rates analysis of
Leguminosae implicates a rapid diversification of lineages during the tertiary. Syst.
Biol. 54, 575–594 (2005).
22. Bowers, J. E., Chapman, B. A., Rong, J. & Paterson, A. H. Unravelling angiosperm
genome evolution by phylogenetic analysis of chromosomal duplication events.
Nature 422, 433–438 (2003).
23. Jaillon, O. et al. The grapevine genome sequence suggests ancestral
hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).
24. Tang, H. et al. Unraveling ancient hexaploidy through multiply-aligned
angiosperm gene maps. Genome Res. 18, 1944–1954 (2008).
25. Simillion, C., Janssens, K., Sterck, L. & Van de Peer, Y. i-ADHoRe 2.0: an improved
tool to detect degenerated genomic homology using genomic profiles.
Bioinformatics 24, 127–128 (2008).
26. Krzywinski, M. et al. Circos: An information aesthetic for comparative genomics.
Genome Res 19, 1639–1645 (2009).
27. Wang, B. B. & Brendel, V. Genomewide comparative analysis of alternative
splicing in plants. Proc. Natl Acad. Sci. USA 103, 7175–7180 (2006).
28. Beisson, F. et al. Arabidopsis genes involved in acyl lipid metabolism. A 2003
census of the candidates, a study of the distribution of expressed sequence tags in
organs, and a web-based database. Plant Physiol. 132, 681–697 (2003).
29. Fridborg, I., Kuusk, S., Moritz, T. & Sundberg, E. The Arabidopsis dwarf mutant shi
exhibits reduced gibberellin responses conferred by overexpression of a new
putative zinc finger protein. Plant Cell 11, 1019–1032 (1999).
30. Barkoulas, M., Galinha, C., Grigg, S. P. & Tsiantis, M. From genes to shape:
regulatory interactions in leaf development. Curr. Opin. Plant. Biol. 10, 660–666
(2007).
31. Lai, C. P. et al. Molecular analyses of the Arabidopsis TUBBY-like protein gene
family. Plant Physiol. 134, 1586–1597 (2004).
32. Herve, C. et al. In vivo interference with AtTCP20 function induces severe plant
growth alterations and deregulates the expression of many genes important for
development. Plant Physiol. 149, 1462–1477 (2009).
33. Stone, S. L. et al. LEAFY COTYLEDON2 encodes a B3 domain transcription factor
that induces embryo development. Proc. Natl Acad. Sci. USA 98, 11806–11811
(2001).
34. Skoneczka, J., Saghai Maroof, M. A., Shang, C. & Buss, G. R. Identification of
candidate gene mutation associated with low stachyose phenotype in soybean
line PI 200508. Crop Sci. 49, 247–255 (2009).
35. Saghai Maroof, M. A., Glover, N. M., Biyashev, R. M., Buss, G. R. & Grabau, E. A.
Genetic basis of the low-phytate trait in the soybean line CX1834. Crop Sci. 49,
69–76 (2009).
36. Meyer, J. D. F. et al. Identification and analyses of candidate genes for Rpp4mediated resistance to Asian soybean rust in soybean (Glycine max (L.) Merr.).
Plant Physiol. 150, 295–307 (2009).
37. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes:
Arachne 2. Genome Res. 13, 91–96 (2003).
38. Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA.
Genome Res. 10, 516–522 (2000).
39. Yeh, R. F., Lim, L. P. & Burge, C. B. Computational inference of homologous gene
structures in the human genome. Genome Res. 11, 803–816 (2001).
40. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal
transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
41. McCarthy, E. M. & McDonald, J. F. LTR_STRUC: a novel search and identification
program for LTR retrotransposons. Bioinformatics 19, 362–367 (2003).
42. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
43. Beckstette, M., Homann, R., Giegerich, R. & Kurtz, S. Fast index based algorithms
and software for matching position specific scoring matrices. BMC Bioinformatics
7, 389 (2006).
44. Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids
Res. 22, 4673–4680 (1994).
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature.
Acknowledgements We thank N. Weeks for informatics support and C. Gunter for
critical reading of the manuscript. We acknowledge funding from the National
Science Foundation (DBI-0421620 to G.S.; DBI-0501877 and 082225 to S.A.J.)
and the United Soybean Board.
Author Contributions Sequencing, assembly and integration: J. Schmutz, S.B.C., J.
Schlueter, W.N., U.H., E.L., M.P., D. Grant, S.S., D. Goodstein, K.B., A.S., J.G. and D.R.
Annotation: J.M., T.M., J.J.T., J.C., D.X., J.D., Z.T., L.Z., N.G., T.J., M.L., X.-C.Z. and
G.S. EST sequencing: G.D.M., T.S., T.U., M.B., D.S., B.V., K.S. and H.T.N. Physical
mapping: Y.Y., M.F.G., R.A.W. and R.C.S. Genetic mapping: D.H., J. Specht, Q.S. and
P.C. Writing/coordination: S.A.J.
Author Information This whole-genome shotgun project has been deposited at
DDBJ/EMBL/GenBank under the accession ACUP00000000. The version
described here is the first version, ACUP01000000. Full annotation is available at
http://www.phytozome.net. Reprints and permissions information is available at
www.nature.com/reprints. The authors declare no competing financial interests.
This paper is distributed under the terms of the Creative Commons
Attribution-Non-Commercial-Share Alike licence, and is freely available to all
readers at www.nature.com/nature. Correspondence and requests for materials
should be addressed to S.A.J. ([email protected]).
183
©2010 Macmillan Publishers Limited. All rights reserved
Vol 463 | 11 February 2010 | doi:10.1038/nature08747
ARTICLES
Genome sequencing and analysis of the
model grass Brachypodium distachyon
The International Brachypodium Initiative*
Three subfamilies of grasses, the Ehrhartoideae, Panicoideae and Pooideae, provide the bulk of human nutrition and are
poised to become major sources of renewable energy. Here we describe the genome sequence of the wild grass
Brachypodium distachyon (Brachypodium), which is, to our knowledge, the first member of the Pooideae subfamily to be
sequenced. Comparison of the Brachypodium, rice and sorghum genomes shows a precise history of genome evolution across
a broad diversity of the grasses, and establishes a template for analysis of the large genomes of economically important
pooid grasses such as wheat. The high-quality genome sequence, coupled with ease of cultivation and transformation, small
size and rapid life cycle, will help Brachypodium reach its potential as an important model system for developing new energy
and food crops.
Grasses provide the bulk of human nutrition, and highly productive
grasses are promising sources of sustainable energy1. The grass family
(Poaceae) comprises over 600 genera and more than 10,000 species
that dominate many ecological and agricultural systems2,3. So far,
genomic efforts have largely focused on two economically important
grass subfamilies, the Ehrhartoideae (rice) and the Panicoideae
(maize, sorghum, sugarcane and millets). The rice4 and sorghum5
genome sequences and a detailed physical map of maize6 showed
extensive conservation of gene order5,7 and both ancient and relatively recent polyploidization.
Most cool season cereal, forage and turf grasses belong to the
Pooideae subfamily, which is also the largest grass subfamily. The
genomes of many pooids are characterized by daunting size and
complexity. For example, the bread wheat genome is approximately
17,000 megabases (Mb) and contains three independent genomes8.
This has prohibited genome-scale comparisons spanning the three
most economically important grass subfamilies.
Brachypodium, a member of the Pooideae subfamily, is a wild
annual grass endemic to the Mediterranean and Middle East9 that
has promise as a model system. This has led to the development of
highly efficient transformation10,11, germplasm collections12–14, genetic
markers14, a genetic linkage map15, bacterial artificial chromosome
(BAC) libraries16,17, physical maps18 (M.F., unpublished observations),
mutant collections (http://brachypodium.pw.usda.gov, http://www.
brachytag.org), microarrays and databases (http://www.brachybase.
org, http://www.phytozome.net, http://www.modelcrop.org, http://
mips.helmholtz-muenchen.de/plant/index.jsp) that are facilitating
the use of Brachypodium by the research community. The genome
sequence described here will allow Brachypodium to act as a powerful
functional genomics resource for the grasses. It is also an important
advance in grass structural genomics, permitting, for the first time,
whole-genome comparisons between members of the three most economically important grass subfamilies.
Genome sequence assembly and annotation
The diploid inbred line Bd21 (ref. 19) was sequenced using wholegenome shotgun sequencing (Supplementary Table 1). The ten largest
scaffolds contained 99.6% of all sequenced nucleotides (Supplementary Table 2). Comparison of these ten scaffolds with a genetic map
(Supplementary Fig. 1) detected two false joins and created a further
seven joins to produce five pseudomolecules that spanned 272 Mb
(Supplementary Table 3), within the range measured by flow cytometry20,21. The assembly was confirmed by cytogenetic analysis (Supplementary Fig. 2) and alignment with two physical maps and
sequenced BACs (Supplementary Data). More than 98% of expressed
sequence tags (ESTs) mapped to the sequence assembly, consistent
with a near-complete genome (Supplementary Table 4 and Supplementary Fig. 3). Compared to other grasses, the Brachypodium
genome is very compact, with retrotransposons concentrated at the
centromeres and syntenic breakpoints (Fig. 1). DNA transposons and
derivatives are broadly distributed and primarily associated with generich regions.
We analysed small RNA populations from inflorescence tissues
with deep Illumina sequencing, and mapped them onto the genome
sequence (Fig. 2a, Supplementary Fig. 4 and Supplementary Table 5).
Small RNA reads were most dense in regions of high repeat density,
similar to the distribution reported in Arabidopsis22. We identified
413 and 198 21- and 24-nucleotide phased short interfering RNA
(siRNA) loci, respectively. Using the same algorithm, the only phased
loci identified in Arabidopsis were five of the eight trans-acting siRNA
loci, and none was 24-nucelotide phased. The biological functions of
these clusters of Brachypodium phased siRNAs, which account for a
significant number of small RNAs that map outside repeat regions,
are not known at present.
A total of 25,532 protein-coding gene loci was predicted in the v1.0
annotation (Supplementary Information and Supplementary Table 6).
This is in the same range as rice (RAP2, 28,236)23 and sorghum (v1.4,
27,640)5, suggesting similar gene numbers across a broad diversity of
grasses. Gene models were evaluated using ,10.2 gigabases (Gb) of
Illumina RNA-seq data (Supplementary Fig. 5)24. Overall, 92.7%
of predicted coding sequences (CDS) were supported by Illumina data
(Fig. 2b), demonstrating the high accuracy of the Brachypodium
gene predictions. These gene models are available from several databases (such as http://www.brachybase.org, http://www.phytozome.net,
http://www.modelcrop.org and http://mips.org).
Between 77 and 84% of gene families (defined according to Supplementary Fig. 6) are shared among the three grass subfamilies
represented by Brachypodium, rice and sorghum, reflecting a relatively
*A list of participants and their affiliations appears at the end of the paper.
763
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 463 | 11 February 2010
a
10,000
5,000
0
5,000
10,000
STA
cLTRs
sLTRs
DNA-TEs
MITEs
CDS
1,500
750
0
750
1,500
1,500
750
0
750
1,500
Chr. 2
STA
cLTRs
sLTRs
DNA-TEs
MITEs
CDS
Retrotransposons
Genes (introns)
Genes (CDS exons)
DNA transposons
Satellite tandem arrays
Chr. 5
STA
cLTRs
sLTRs
DNA-TEs
MITEs
CDS
Figure 1 | Chromosomal distribution of the main Brachypodium genome
features. The abundance and distribution of the following genome elements
are shown: complete LTR retroelements (cLTRs); solo-LTRs (sLTRs);
potentially autonomous DNA transposons that are not miniature invertedrepeat transposable elements (MITEs) (DNA-TEs); MITEs; gene exons (CDS);
gene introns and satellite tandem arrays (STA). Graphs are from 0 to 100 per
cent base-pair (%bp) coverage of the respective window. The heat map tracks
have different ranges and different maximum (max) pseudocolour levels: STA
(0–55, scaled to max 10) %bp; cLTRs (0–36, scaled to max 20) %bp; sLTRs
(0–4) %bp; DNA-TEs (0–20) %bp; MITEs (0–22) %bp; CDS (exons)
(0–22.3) %bp. The triangles identify syntenic breakpoints.
recent common origin (Fig. 2c). Grass-specific genes include transmembrane receptor protein kinases, glycosyltransferases, peroxidases
and P450 proteins (Supplementary Table 7B). The Pooideae-specific
gene set contains only 265 gene families (Supplementary Table 7C)
comprising 811 genes (1,400 including singletons). Genes enriched in
grasses were significantly more likely to be contained in tandem arrays
than random genes, demonstrating a prominent role for tandem
gene expansion in the evolution of grass-specific genes (Supplementary Fig. 7 and Supplementary Table 8).
To validate and improve the v1.0 gene models, we manually annotated 2,755 gene models from 97 diverse gene families (Supplementary
Tables 9–11) relevant to bioenergy and food crop improvement. We
annotated 866 genes involved in cell wall biosynthesis/modification
and 948 transcription factors from 16 families25. Only 13% of the gene
5
Phased small RNA loci
Repeat-normalized RNA-seq reads
0
b
c
Rice
Ehrhardtoideae
16,235 families
20,559 genes
in families
Sorghum
Panicoideae
17,608 families
25,816 genes
in families
0.9
495
0.7
1,479
860
13,580
681
1,689
0.5
0.3
265
C
DS
SJ
S
0.1
5′
Chr. 4
4
50,000
U
STA
cLTRs
sLTRs
DNA-TEs
MITEs
CDS
3
Repeat-normalized 24-nt reads
TR
U
TR
In
tro
ns
Ex
on
cD s
N
As
Chr. 3
100,000
Coverage over feature length
STA
cLTRs
sLTRs
DNA-TEs
MITEs
CDS
70
35
0
2
Repeat-normalized 21-nt reads
3′
Chr. 1
1
Total small RNA reads/loci
Brachypodium
Wheat/barley
Pooideae
16,215 families
20,562 genes
in families
Figure 2 | Transcript and gene identification and distribution among three
grass subfamilies. a, Genome-wide distribution of small RNA loci and
transcripts in the Brachypodium genome. Brachypodium chromosomes (1–5)
are shown at the top. Total small RNA reads (black lines) and total small RNA
loci (red lines) are shown on the top panel. Histograms plot 21-nucleotide (nt)
(blue) or 24-nucleotide (red) small RNA reads normalized for repeated matches
to the genome. The phased loci histograms plot the position and phase-score of
21-nucleotide (blue) and 24-nucleotide (red) phased small RNA loci. Repeatnormalized RNA-seq read histograms plot the abundance of reads matching
RNA transcripts (green), normalized for ambiguous matches to the genome.
b, Transcript coverage over gene features. Perfect match 32-base oligonucleotide
Illumina reads were mapped to the Brachypodium v1.0 annotation features
using HashMatch (http://mocklerlab-tools.cgrb.oregonstate.edu/). Plots of
Illumina coverage were calculated as the percentage of bases along the length of
the sequence feature supported by Illumina reads for the indicated gene model
features. The bottom and top of the box represent the 25th and 75th quartiles,
respectively. The white line is the median and the red diamonds denote the
mean. SJS, splice junction site. c, Venn diagram showing the distribution of
shared gene families between representatives of Ehrhartoideae (rice RAP2),
Panicoideae (sorghum v1.4) and Pooideae (Brachypodium v1.0, and Triticum
aestivum and Hordeum vulgare TCs (transcript consensus)/EST sequences).
Paralogous gene families were collapsed in these data sets.
models required modification and very few pseudogenes were identified, demonstrating the accuracy of the v1.0 annotation.
Phylogenetic trees for 62 gene families were constructed using genes
from rice, Arabidopsis, sorghum and poplar. In nearly all cases,
Brachypodium genes had a similar distribution to rice and sorghum,
demonstrating that Brachypodium is suitably generic for grass functional genomics research (Supplementary Figs 8 and 9). Analysis of the
predicted secretome identified substantial differences in the distribution of cell wall metabolism genes between dicots and grasses
(Supplementary Tables 12, 13 and Supplementary Fig. 10), consistent
with their different cell walls26. Signal peptide probability curves also
suggested that start codons were accurately predicted (Supplementary
Fig. 11).
Maintaining a small grass genome size
Exhaustive analysis of transposable elements (Supplementary
Information and Supplementary Table 14) showed retrotransposon
sequences comprise 21.4% of the genome, compared to 26% in rice,
764
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 463 | 11 February 2010
54% in sorghum, and more than 80% in wheat27. Thirteen retroelement sets were younger than 20,000 years, showing a recent activation compared to rice28 (Supplementary Fig. 12), and a further 53
retroelement sets were less than 0.1 million years (Myr) old. A
minimum of 17.4 Mb has been lost by long terminal repeat (LTR)–
LTR recombination, demonstrating that retroelement expansion is
countered by removal through recombination. In contrast, retroelements persist for very long periods of time in the closely related
Triticeae28.
DNA transposons comprise 4.77% of the Brachypodium genome,
within the range found in other grass genomes5,29. Transcriptome data
and structural analysis suggest that many non-autonomous Mariner
DTT and Harbinger elements recruit transposases from other families.
Two CACTA DTC families (M and N) carried five non-element genes,
and the Harbinger U family has amplified a NBS-LRR gene family
(Supplementary Figs 13 and 14), adding it to the group of transposable
elements implicated in gene mobility30,31. Centromeric regions were
characterized by low gene density, characteristic repeats and retroelement clusters (Supplementary Fig. 15). Other repeat classes are
b
hy
ce
Bd5
Ri
W
Br
he
ac
at
So
rg
hu
m
po
di
um
a
Bd1
Bd4
32–39
40–54
45–60
WGD
56–73
Bd2
Bd3
c
Rice
6
7
3
5
1
1
10
2
10
1
2
8
2
3
3
9
11
9
4
4
7
12
4
5
5
8
6
Sorghum
d
e
Barley
4
7
3
1
6
5
2
Brachypodium
1
1
2
3
4
7
3
1
6
5
Aegilops tauschii
3
4
5
5
4
4
2
2
7
3
1
Wheat
6
5
2
described in Supplementary Table 15. Conserved non-coding sequences
are described in Supplementary Fig. 16.
Whole-genome comparison of three diverse grass genomes
The evolutionary relationships between Brachypodium, sorghum,
rice and wheat were assessed by measuring the mean synonymous
substitution rates (Ks) of orthologous gene pairs (Supplementary
Information, Supplementary Fig. 17 and Supplementary Table 16),
from which divergence times of Brachypodium from wheat 32–39
Myr ago, rice 40–53 Myr ago, and sorghum 45–60 Myr ago (Fig. 3a)
were estimated. The Ks of orthologous gene pairs in the intragenomic
Brachypodium duplications (Fig. 3b) suggests duplication 56–72 Myr
ago, before the diversification of the grasses. This is consistent with
previous evolutionary histories inferred from a small number of
genes3,32–34.
Paralogous relationships among Brachypodium chromosomes
showed six major chromosomal duplications covering 92.1% of the
genome (Fig. 3b), representing ancestral whole-genome duplication35.
Using the rice and sorghum genome sequences, genetic maps of
barley36 and Aegilops tauschii (the D genome donor of hexaploid
wheat)37, and bin-mapped wheat ESTs38,39, 21,045 orthologous relationships between Brachypodium, rice, sorghum and Triticeae were
identified (Supplementary Information). These identified 59 blocks
of collinear genes covering 99.2% of the Brachypodium genome
(Fig. 3c–e). The orthologous relationships are consistent with an evolutionary model that shaped five Brachypodium chromosomes from a
five-chromosome ancestral genome by a 12-chromosome intermediate involving seven major chromosome fusions39 (Supplementary Fig. 18). These collinear blocks of orthologous genes provide a
robust and precise sequence framework for understanding grass
genome evolution and aiding the assembly of sequences from other
pooid grasses. We identified 14 major syntenic disruptions between
Brachypodium and rice/sorghum that can be explained by nested insertions of entire chromosomes into centromeric regions (Fig. 4a, b)2,37,40.
Similar nested insertions in sorghum37 and barley (Fig. 4c, d) were also
identified. Centromeric repeats and peaks in retroelements at the junctions of chromosome insertions are footprints of these insertion events
(Supplementary Fig. 15C and Fig. 1), as is higher gene density at the
former distal regions of the inserted chromosomes (Fig. 1). Notably,
the reduction in chromosome number in Brachypodium and wheat
occurred independently because none of the chromosome fusions are
shared by Brachypodium and the Triticeae37 (Supplementary Fig. 18).
Figure 3 | Brachypodium genome evolution and synteny between grass
subfamilies. a, The distribution maxima of mean synonymous substitution
rates (Ks) of Brachypodium, rice, sorghum and wheat orthologous gene pairs
(Supplementary Table 16) were used to define the divergence times of these
species and the age of interchromosomal duplications in Brachypodium.
WGD, whole-genome duplication. The numbers refer to the predicted
divergence times measured as Myr ago by the NG or ML methods.
b, Diagram showing the six major interchromosomal Brachypodium
duplications, defined by 723 paralogous relationships, as coloured bands
linking the five chromosomes. c, Identification of chromosome relationships
between the Brachypodium, rice and sorghum genomes. Orthologous
relationships between the 25,532 protein-coding Brachypodium genes, 7,216
sorghum orthologues (12 syntenic blocks), and 8,533 rice orthologues (12
syntenic blocks) were defined. Sets of collinear orthologous relationships are
represented by a coloured band according to each Brachypodium
chromosome (blue, chromosome (chr.) 1; yellow, chr. 2; violet, chr. 3; red,
chr. 4; green, chr. 5). The white region in each Brachypodium chromosome
represents the centromeric region. d, Orthologous gene relationships
between Brachypodium and barley and Ae. tauschii were identified using
genetically mapped ESTs. 2,516 orthologous relationships defined 12
syntenic blocks. These are shown as coloured bands. e, Orthologous gene
relationships between Brachypodium and hexaploid bread wheat defined by
5,003 ESTs mapped to wheat deletion bins. Each set of orthologous
relationships is represented by a band that is evenly spread across each
deletion interval on the wheat chromosomes.
765
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 463 | 11 February 2010
a
Bd1
Os1
Os5
Os3
Os7
Os10
Os11
Os12
Os8
Os9
Os2
Os4
Os6
Bd2
Bd3
Bd4
10 Mb
Bd5
b
Bd1
Os6
Os10
Os5
Os7
Os3
Bd3
Os11
A5
A7
A11
A8
A4
Bd4
Bd2
Os1
Os8
Os9
Os2
Os12
Os4
Bd5
c
Sb1
Sb2
10 Mb
d
H1
H2
H4
Figure 4 | A recurring pattern of nested chromosome fusions in grasses.
a, The five Brachypodium chromosomes are coloured according to
homology with rice chromosomes (Os1–Os12). Chromosomes descended
from an ancestral chromosome (A4–A11) through whole-genome
duplication are shown in shades of the same colour. Gene density is
indicated as a red line above the chromosome maps. Major discontinuities in
gene density identify syntenic breakpoints, which are marked by a diamond.
White diamonds identify fusion points containing remnant centromeric
repeats. b, A pattern of nested insertions of whole chromosomes into
centromeric regions explains the observed syntenic break points. Bd5 has
not undergone chromosome fusion. c, Examples of nested chromosome
insertions in sorghum (Sb) chromosomes 1 and 2. d, Examples of nested
chromosome insertions in barley (H chromosomes) inferred from genetic
maps. Nested insertions were not identified in other chromosomes, possibly
owing to the low resolution of genetic maps.
Comparisons of evolutionary rates between Brachypodium,
sorghum, rice and Ae. tauschii demonstrated a substantially higher
rate of genome change in Ae. tauschii (Supplementary Table 17).
This may be due to retroelement activity that increases syntenic
disruptions, as proposed for chromosome 5S later41. Among seven
relatively large gene families, four were highly syntenic and two
(NBS-LRR and F-box) were almost never found in syntenic order
when compared to rice and sorghum (Supplementary Table 18),
consistent with the rapid diversification of the NBS-LRR and
F-box gene families42.
The short arm of chromosome 5 (Bd5S) has a gene density roughly
half of the rest of the genome, high LTR retrotransposon density, the
youngest intact Gypsy elements and the lowest solo LTR density. Thus,
unlike the rest of the Brachypodium genome, Bd5S is gaining retrotransposons by replication and losing fewer by recombination.
Syntenic regions of rice (Os4S) and sorghum (Sb6S) demonstrate maintenance of this high repeat content for ,50–70 Myr
(Supplementary Fig. 19)43. Bd5S, Os4S and Sb6S also have the lowest
proportion of collinear genes (Fig. 4a and Supplementary Fig. 19). We
propose that the chromosome ancestral to Bd5S reached a tipping
point in which high retrotransposon density had deleterious effects
on genes.
Discussion
As the first genome sequence of a pooid grass, the Brachypodium
genome aids genome analysis and gene identification in the large
and complex genomes of wheat and barley, two other pooid grasses
that are among the world’s most important crops. The very high quality of the Brachypodium genome sequence, in combination with those
from two other grass subfamilies, enabled reconstruction of chromosome evolution across a broad diversity of grasses. This analysis
contributes to our understanding of grass diversification by explaining
how the varying chromosome numbers found in the major grass subfamilies derive from an ancestral set of five chromosomes by nested
insertions of whole chromosomes into centromeres. The relatively
small genome of Brachypodium contains many active retroelement
families, but recombination between these keeps genome expansion
in check. The short arm of chromosome 5 deviates from the rest of the
genome by exhibiting a trend towards genome expansion through
increased retroelement numbers and disruption of gene order more
typical of the larger genomes of closely related grasses.
Grass crop improvement for sustainable fuel44 and food45 production requires a substantial increase in research in species such as
Miscanthus, switchgrass, wheat and cool season forage grasses. These
considerations have led to the rapid adoption of Brachypodium as an
experimental system for grass research. The similarities in gene content
and gene family structure between Brachypodium, rice and sorghum
support the value of Brachypodium as a functional genomics model
for all grasses. The Brachypodium genome sequence analysis reported
here is therefore an important advance towards securing sustainable
supplies of food, feed and fuel from new generations of grass crops.
METHODS SUMMARY
Genome sequencing and assembly. Sanger sequencing was used to generate
paired-end reads from 3 kb, 8 kb, fosmid (35 kb) and BAC (100 kb) clones to
generate 9.43 coverage (Supplementary Table 1). The final assembly of 83 scaffolds covers 271.9 Mb (Supplementary Table 3). Sequence scaffolds were aligned
to a genetic map to create pseudomolecules covering each chromosome
(Supplementary Figs 1 and 2).
Protein-coding gene annotation. Gene models were derived from weighted
consensus prediction from several ab initio gene finders, optimal spliced alignments of ESTs and transcript assemblies, and protein homology. Illumina transcriptome sequence was aligned to predicted genome features to validate exons,
splice sites and alternatively spliced transcripts.
Repeats analysis. The MIPS ANGELA pipeline was used to integrate analyses
from expert groups. LTR-STRUCT and LTR-HARVEST46 were used for de novo
retroelement searches.
Received 29 August; accepted 9 December 2009.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Somerville, C. The billion-ton biofuels vision. Science 312, 1277 (2006).
Kellogg, E. A. Evolutionary history of the grasses. Plant Physiol. 125, 1198–1205
(2001).
Gaut, B. S. Evolutionary dynamics of grass genomes. New Phytol. 154, 15–28
(2002).
International Rice Genome Sequencing Project. The map-based sequence of the
rice genome. Nature 436, 793–800 (2005).
Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of
grasses. Nature 457, 551–556 (2009).
Wei, F. et al. Physical and genetic structure of the maize genome reflects its
complex evolutionary history. PLoS Genet. 3, e123 (2007).
Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Cereal genome evolution.
Grasses, line up and form a circle. Curr. Biol. 5, 737–739 (1995).
Salamini, F., Ozkan, H., Brandolini, A., Schafer-Pregl, R. & Martin, W. Genetics and
geography of wild cereal domestication in the near east. Nature Rev. Genet. 3,
429–441 (2002).
Draper, J. et al. Brachypodium distachyon. A new model system for functional
genomics in grasses. Plant Physiol. 127, 1539–1555 (2001).
Vain, P. et al. Agrobacterium-mediated transformation of the temperate grass
Brachypodium distachyon (genotype Bd21) for T-DNA insertional mutagenesis.
Plant Biotechnol. J. 6, 236–245 (2008).
Vogel, J. & Hill, T. High-efficiency Agrobacterium-mediated transformation
of Brachypodium distachyon inbred line Bd21–3. Plant Cell Rep. 27, 471–478
(2008).
Vogel, J. P., Garvin, D. F., Leong, O. M. & Hayden, D. M. Agrobacterium-mediated
transformation and inbred line development in the model grass Brachypodium
distachyon. Plant Cell Tissue Organ Cult. 84, 100179–100191 (2006).
Filiz, E. et al. Molecular, morphological and cytological analysis of diverse
Brachypodium distachyon inbred lines. Genome 52, 876–890 (2009).
Vogel, J. P. et al. Development of SSR markers and analysis of diversity in Turkish
populations of Brachypodium distachyon. BMC Plant Biol. 9, 88 (2009).
766
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
NATURE | Vol 463 | 11 February 2010
15. Garvin, D. F. et al. An SSR-based genetic linkage map of the model grass
Brachypodium distachyon. Genome 53, 1–13 (2009).
16. Huo, N. et al. Construction and characterization of two BAC libraries from
Brachypodium distachyon, a new model for grass genomics. Genome 49,
1099–1108 (2006).
17. Huo, N. et al. The nuclear genome of Brachypodium distachyon: analysis of BAC end
sequences. Funct. Integr. Genomics 8, 135–147 (2008).
18. Gu, Y. Q. et al. A BAC-based physical map of Brachypodium distachyon and its
comparative analysis with rice and wheat. BMC Genomics 10, 496 (2009).
19. Garvin, D. F. et al. Development of genetic and genomic research resources for
Brachypodium distachyon, a new model system for grass crop research. Crop Sci.
48, S-69–S-84 (2008).
20. Bennett, M. D. & Leitch, I. J. Nuclear DNA amounts in angiosperms: progress,
problems and prospects. Ann. Bot. (Lond.) 95, 45–90 (2005).
21. Vogel, J. P. et al. EST sequencing and phylogenetic analysis of the model grass
Brachypodium distachyon. Theor. Appl. Genet. 113, 186–195 (2006).
22. Rajagopalan, R., Vaucheret, H., Trejo, J. & Bartel, D. P. A diverse and evolutionarily
fluid set of microRNAs in Arabidopsis thaliana. Genes Dev. 20, 3407–3425
(2006).
23. Tanaka, T. et al. The rice annotation project database (RAP-DB): 2008 update.
Nucleic Acids Res. 36, D1028–D1033 (2008).
24. Fox, S., Filichkin, S. & Mockler, T. Applications of ultra-high-throughput
sequencing. Methods Mol. Biol. 553, 79–108 (2009).
25. Gray, J. et al. A recommendation for naming transcription factor proteins in the
grasses. Plant Physiol. 149, 4–6 (2009).
26. Vogel, J. Unique aspects of the grass cell wall. Curr. Opin. Plant Biol. 11, 301–307
(2008).
27. Bennetzen, J. L. & Kellogg, E. A. Do plants have a one-way ticket to genomic
obesity? Plant Cell 9, 1509–1514 (1997).
28. Wicker, T. & Keller, B. Genome-wide comparative analysis of copia
retrotransposons in Triticeae, rice, and Arabidopsis reveals conserved ancient
evolutionary lineages and distinct dynamics of individual copia families. Genome
Res. 17, 1072–1081 (2007).
29. Wicker, T. et al. Analysis of intraspecies diversity in wheat and barley genomes
identifies breakpoints of ancient haplotypes and provides insight into the
structure of diploid and hexaploid triticeae gene pools. Plant Physiol. 149,
258–270 (2009).
30. Jiang, N., Bao, Z., Zhang, X., Eddy, S. R. & Wessler, S. R. Pack-MULE transposable
elements mediate gene evolution in plants. Nature 431, 569–573 (2004).
31. Morgante, M. et al. Gene duplication and exon shuffling by helitron-like
transposons generate intraspecies diversity in maize. Nature Genet. 37, 997–1002
(2005).
32. Grass Phylogeny Working Group. Phylogeny and subfamilial classification of the
grasses (Poaceae). Ann. Mo. Bot. Gard. 88, 373–457 (2001).
33. Bossolini, E., Wicker, T., Knobel, P. A. & Keller, B. Comparison of orthologous loci
from small grass genomes Brachypodium and rice: implications for wheat
genomics and grass genome annotation. Plant J. 49, 704–717 (2007).
34. Charles, M. et al. Sixty million years in evolution of soft grain trait in grasses:
emergence of the softness locus in the common ancestor of Pooideae and
Ehrhartoideae, after their divergence from Panicoideae. Mol. Biol. Evol. 26,
1651–1661 (2009).
35. Paterson, A. H., Bowers, J. E. & Chapman, B. A. Ancient polyploidization predating
divergence of the cereals, and its consequences for comparative genomics. Proc.
Natl Acad. Sci. USA 101, 9903–9908 (2004).
36. Stein, N. et al. A 1,000-loci transcript map of the barley genome: new anchoring
points for integrative grass genomics. Theor. Appl. Genet. 114, 823–839 (2007).
37. Luo, M. C. et al. Genome comparisons reveal a dominant mechanism of
chromosome number reduction in grasses and accelerated genome evolution in
Triticeae. Proc. Natl Acad. Sci. USA 106, 15780–15785 (2009).
38. Qi, L. L. et al. A chromosome bin map of 16,000 expressed sequence tag loci and
distribution of genes among the three genomes of polyploid wheat. Genetics 168,
701–712 (2004).
39. Salse, J. et al. Identification and characterization of shared duplications between
rice and wheat provide new insight into grass genome evolution. Plant Cell 20,
11–24 (2008).
40. Srinivasachary, Dida M. M., Gale, M. D. & Devos, K. M. Comparative analyses
reveal high levels of conserved colinearity between the finger millet and rice
genomes. Theor. Appl. Genet. 115, 489–499 (2007).
41. Vicient, C. M., Kalendar, R. & Schulman, A. H. Variability, recombination, and
mosaic evolution of the barley BARE-1 retrotransposon. J. Mol. Evol. 61, 275–291
(2005).
42. Meyers, B. C., Kozik, A., Griego, A., Kuang, H. & Michelmore, R. W. Genome-wide
analysis of NBS-LRR-encoding genes in Arabidopsis. Plant Cell 15, 809–834
(2003).
43. Ma, J. & Bennetzen, J. L. Rapid recent growth and divergence of rice nuclear
genomes. Proc. Natl Acad. Sci. USA 101, 12404–12410 (2004).
44. U.S. Department of Energy Office of Science. Breaking the Biological Barriers to
Cellulosic Ethanol: A Joint Research Agenda Æ http://genomicscience.energy.gov/
biofuels/b2bworkshop.shtmlæ (2006).
45. Food and Agriculture Organization of the United Nations. World Agriculture:
Towards 2030/2050 Interim Report. Æ http://www.fao.org/ES/esd/
AT2050web.pdfæ (2006).
46. McCarthy, E. M. & McDonald, J. F. LTR_STRUC: a novel search and identification
program for LTR retrotransposons. Bioinformatics 19, 362–367 (2003).
Supplementary Information is linked to the online version of the paper at
www.nature.com/nature.
Acknowledgements We acknowledge the contributions of the late M. Gale, who
identified the importance of conserved gene order in grass genomes. This work was
mainly supported by the US Department of Energy Joint Genome Institute
Community Sequencing Program project with J.P.V., D.F.G., T.C.M. and M.W.B., a
BBSRC grant to M.W.B., an EU Contract Agronomics grant to M.W.B. and K.F.X.M.,
and GABI Barlex grant to K.F.X.M. Illumina transcriptome sequencing was
supported by a DOE Plant Feedstock Genomics for Bioenergy grant and an Oregon
State Agricultural Research Foundation grant to T.C.M.; small RNA research was
supported by the DOE Plant Feedstock Genomics for Bioenergy grants to P.J.G. and
T.C.M.; annotation was supported by a DOE Plant Feedstocks for Genomics
Bioenergy grant to J.P.V. A full list of support and acknowledgements is in the
Supplementary Information.
Author Information The whole-genome shotgun sequence of Brachypodium
distachyon has been deposited at DDBJ/EMBL/GenBank under the accession
ADDN00000000. (The version described in this manuscript is the first version,
accession ADDN01000000). EST sequences have been deposited with dbEST
(accessions 67946317–68053959) and GenBank (accessions
GT758162–GT865804). The short read archive accession for RNA-seq data is
SRA010177. Reprints and permissions information is available at
www.nature.com/reprints. This paper is distributed under the terms of the
Creative Commons Attribution-Non-Commercial-Share Alike licence, and is freely
available to all readers at www.nature.com/nature. The authors declare no
competing financial interests. Correspondence and requests for materials should
be addressed to J.P.V. ([email protected]) or D.F.G.
([email protected]) or T.C.M. ([email protected]) or
M.W.B. ([email protected]).
Author Contributions See list of consortium authors below.
The International Brachypodium Initiative
Principal investigators John P. Vogel1, David F. Garvin2, Todd C. Mockler3, Jeremy
Schmutz4, Dan Rokhsar5,6, Michael W. Bevan7; DNA sequencing and assembly Kerrie
Barry5, Susan Lucas5, Miranda Harmon-Smith5, Kathleen Lail5, Hope Tice5, Jeremy
Schmutz4 (Leader), Jane Grimwood4, Neil McKenzie7, Michael W. Bevan7;
Pseudomolecule assembly and BAC end sequencing Naxin Huo1, Yong Q. Gu1, Gerard R.
Lazo1, Olin D. Anderson1, John P. Vogel1 (Leader), Frank M. You8, Ming-Cheng Luo8, Jan
Dvorak8, Jonathan Wright7, Melanie Febrer7, Michael W. Bevan7, Dominika Idziak9,
Robert Hasterok9, David F. Garvin2; Transcriptome sequencing and analysis Erika
Lindquist5, Mei Wang5, Samuel E. Fox3, Henry D. Priest3, Sergei A. Filichkin3, Scott A.
Givan3, Douglas W. Bryant3, Jeff H. Chang3, Todd C. Mockler3 (Leader), Haiyan Wu10,24,
Wei Wu10, An-Ping Hsia10, Patrick S. Schnable10,24, Anantharaman Kalyanaraman11,
Brad Barbazuk12, Todd P. Michael13, Samuel P. Hazen14, Jennifer N. Bragg1, Debbie
Laudencia-Chingcuanco1, John P. Vogel1, David F. Garvin2, Yiqun Weng15, Neil
McKenzie7, Michael W. Bevan7; Gene analysis and annotation Georg Haberer16,
Manuel Spannagl16, Klaus Mayer16 (Leader), Thomas Rattei17, Therese Mitros6, Dan
Rokhsar6, Sang-Jik Lee18, Jocelyn K. C. Rose18, Lukas A. Mueller19, Thomas L. York19;
Repeats analysis Thomas Wicker20 (Leader), Jan P. Buchmann20, Jaakko Tanskanen21,
Alan H. Schulman21 (Leader), Heidrun Gundlach16, Jonathan Wright7, Michael Bevan7,
Antonio Costa de Oliveira22, Luciano da C. Maia22, William Belknap1, Yong Q. Gu1, Ning
Jiang23, Jinsheng Lai24, Liucun Zhu25, Jianxin Ma25, Cheng Sun26, Ellen Pritham26;
Comparative genomics Jerome Salse27 (Leader), Florent Murat27, Michael Abrouk27,
Georg Haberer16, Manuel Spannagl16, Klaus Mayer16, Remy Bruggmann13, Joachim
Messing13, Frank M. You8, Ming-Cheng Luo8, Jan Dvorak8; Small RNA analysis Noah
Fahlgren3, Samuel E. Fox3, Christopher M. Sullivan3, Todd C. Mockler3, James C.
Carrington3, Elisabeth J. Chapman3,28, Greg D. May29, Jixian Zhai30, Matthias
Ganssmann30, Sai Guna Ranjan Gurazada30, Marcelo German30, Blake C. Meyers30,
Pamela J. Green30 (Leader); Manual annotation and gene family analysis Jennifer N.
Bragg1, Ludmila Tyler1,6, Jiajie Wu1,8, Yong Q. Gu1, Gerard R. Lazo1, Debbie
Laudencia-Chingcuanco1, James Thomson1, John P. Vogel1 (Leader), Samuel P. Hazen14,
Shan Chen14, Henrik V. Scheller31, Jesper Harholt32, Peter Ulvskov32, Samuel E. Fox3,
Sergei A. Filichkin3, Noah Fahlgren3, Jeffrey A. Kimbrel3, Jeff H. Chang3, Christopher M.
Sullivan3, Elisabeth J. Chapman3,27, James C. Carrington3, Todd C. Mockler3, Laura E.
Bartley8,31, Peijian Cao8,31, Ki-Hong Jung8,31{, Manoj K Sharma8,31, Miguel
Vega-Sanchez8,31, Pamela Ronald8,31, Christopher D. Dardick33, Stefanie De Bodt34, Wim
Verelst34, Dirk Inzé34, Maren Heese35, Arp Schnittger35, Xiaohan Yang36, Udaya C.
Kalluri36, Gerald A. Tuskan36, Zhihua Hua37, Richard D. Vierstra37, David F. Garvin3, Yu
Cui24, Shuhong Ouyang24, Qixin Sun24, Zhiyong Liu24, Alper Yilmaz38, Erich
Grotewold38, Richard Sibout39, Kian Hematy39, Gregory Mouille39, Herman Höfte39,
Todd Michael13, Jérome Pelloux40, Devin O’Connor41, James Schnable41, Scott Rowe41,
Frank Harmon41, Cynthia L. Cass42, John C. Sedbrook42, Mary E. Byrne7, Sean Walsh7,
Janet Higgins7, Michael Bevan7, Pinghua Li19, Thomas Brutnell19, Turgay Unver43, Hikmet
Budak43, Harry Belcram44, Mathieu Charles44, Boulos Chalhoub44, Ivan Baxter45
767
©2010 Macmillan Publishers Limited. All rights reserved
ARTICLES
1
NATURE | Vol 463 | 11 February 2010
USDA-ARS Western Regional Research Center, Albany, California 94710, USA.
USDA-ARS Plant Science Research Unit and University of Minnesota, St Paul,
Minnesota 55108, USA. 3Oregon State University, Corvallis, Oregon 97331-4501, USA.
4
HudsonAlpha Institute, Huntsville, Alabama 35806, USA. 5US DOE Joint Genome
Institute, Walnut Creek, California 94598, USA. 6University of California Berkeley,
Berkeley, California 94720, USA. 7John Innes Centre, Norwich NR4 7UJ, UK. 8University
of California Davis, Davis, California 95616, USA. 9University of Silesia, 40-032
Katowice, Poland. 10Iowa State University, Ames, Iowa 50011, USA. 11Washington State
University, Pullman, Washington 99163, USA. 12University of Florida, Gainsville, Florida
32611, USA. 13Rutgers University, Piscataway, New Jersey 08855-0759, USA.
14
University of Massachusetts, Amherst, Massachusetts 01003-9292, USA.
15
USDA-ARS Vegetable Crops Research Unit, Horticulture Department, University of
Wisconsin, Madison, Wisconsin 53706, USA. 16Helmholtz Zentrum München, D-85764
Neuherberg, Germany. 17Technical University München, 80333 München, Germany.
18
Cornell University, Ithaca, New York 14853, USA. 19Boyce Thompson Institute for Plant
Research, Ithaca, New York 14853-1801, USA. 20University of Zurich, 8008 Zurich,
Switzerland. 21MTT Agrifood Research and University of Helsinki, FIN-00014 Helsinki,
Finland. 22Federal University of Pelotas, Pelotas, 96001-970, RS, Brazil. 23Michigan State
University, East Lansing, Michigan 48824, USA. 24China Agricultural University, Beijing
10094, China. 25Purdue University, West Lafayette, Indiana 47907, USA. 26The
University of Texas, Arlington, Arlington, Texas 76019, USA. 27Institut National de la
2
Recherché Agronomique UMR 1095, 63100 Clermont-Ferrand, France. 28University of
California San Diego, La Jolla, California 92093, USA. 29National Centre for Genome
Resources, Santa Fe, New Mexico 87505, USA. 30University of Delaware, Newark,
Delaware 19716, USA. 31Joint Bioenergy Institute, Emeryville, California 94720, USA.
32
University of Copenhagen, Frederiksberg DK-1871, Denmark. 33USDA-ARS
Appalachian Fruit Research Station, Kearneysville, West Virginia 25430, USA. 34VIB
Department of Plant Systems Biology, VIB and Department of Plant Biotechnology and
Genetics, Ghent University, Technologiepark 927, 9052 Gent, Belgium. 35Institut de
Biologie Moléculaire des Plantes du CNRS, Strasbourg 67084, France. 36BioEnergy
Science Center and Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831-6422,
USA. 37University of Wisconsin-Madison, Madison, Wisconsin 53706, USA. 38The Ohio
State University, Columbus, Ohio 43210, USA. 39Institut Jean-Pierre Bourgin, UMR1318,
Institut National de la Recherche Agronomique, 78026 Versailles cedex, France.
40
Université de Picardie, Amiens 80039, France. 41Plant Gene Expression Center,
University of California Berkeley, Albany, California 94710, USA. 42Illinois State
University and DOE Great Lakes Bioenergy Research Center, Normal, Illinois 61790,
USA. 43Sabanci University, Istanbul 34956, Turkey. 44Unité de Recherche en Génomique
Végétale: URGV (INRA-CNRS-UEVE), Evry 91057, France. 45USDA-ARS/Donald
Danforth Plant Science Center, St Louis, Missouri 63130, USA. {Present address: The
School of Plant Molecular Systems Biotechnology, Kyung Hee University, Yongin
446-701, Korea.
768
©2010 Macmillan Publishers Limited. All rights reserved