Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
articles Analysis of the genome sequence of the ¯owering plant Arabidopsis thaliana The Arabidopsis Genome Initiative* * Authorship of this paper should be cited as `The Arabidopsis Genome Iniative'. A full list of contributors appears at the end of this paper ............................................................................................................................................................................................................................................................................ The ¯owering plant Arabidopsis thaliana is an important model system for identifying genes and determining their functions. Here we report the analysis of the genomic sequence of Arabidopsis. The sequenced regions cover 115.4 megabases of the 125-megabase genome and extend into centromeric regions. The evolution of Arabidopsis involved a whole-genome duplication, followed by subsequent gene loss and extensive local gene duplications, giving rise to a dynamic genome enriched by lateral gene transfer from a cyanobacterial-like ancestor of the plastid. The genome contains 25,498 genes encoding proteins from 11,000 families, similar to the functional diversity of Drosophila and Caenorhabditis elegansÐ the other sequenced multicellular eukaryotes. Arabidopsis has many families of new proteins but also lacks several common protein families, indicating that the sets of common proteins have undergone differential expansion and contraction in the three multicellular eukaryotes. This is the ®rst complete genome sequence of a plant and provides the foundations for more comprehensive comparison of conserved processes in all eukaryotes, identifying a wide range of plant-speci®c gene functions and establishing rapid systematic ways to identify genes for crop improvement. The plant and animal kingdoms evolved independently from unicellular eukaryotes and represent highly contrasting life forms. The genome sequences of C. elegans1 and Drosophila2 reveal that metazoans share a great deal of genetic information required for developmental and physiological processes, but these genome sequences represent a limited survey of multicellular organisms. Flowering plants have unique organizational and physiological properties in addition to ancestral features conserved between plants and animals. The genome sequence of a plant provides a means for understanding the genetic basis of differences between plants and other eukaryotes, and provides the foundation for detailed functional characterization of plant genes. Arabidopsis thaliana has many advantages for genome analysis, including a short generation time, small size, large number of offspring, and a relatively small nuclear genome. These advantages promoted the growth of a scienti®c community that has investigated the biological processes of Arabidopsis and has characterized many genes3. To support these activities, an international collaboration (the Arabidopsis Genome Initiative, AGI) began sequencing the genome in 1996. The sequences of chromosomes 2 and 4 have been reported4,5, and the accompanying Letters describe the sequences of chromosomes 1 (ref. 6), 3 (ref. 7) and 5 (ref. 8). Here we report analysis of the completed Arabidopsis genome sequence, including annotation of predicted genes and assignment of functional categories. We also describe chromosome dynamics and architecture, the distribution of transposable elements and other repeats, the extent of lateral gene transfer from organelles, and the comparison of the genome sequence and structure to that of other Arabidopsis accessions (distinctive lines maintained by singleseed descent) and plant species. This report is the summation of work by experts interested in many biological processes selected to illuminate plant-speci®c functions including defence, photomorphogenesis, gene regulation, development, metabolism, transport and DNA repair. The identi®cation of many new members of receptor families, cellular components for plant-speci®c functions, genes of bacterial origin whose functions are now integrated with typical eukaryotic components, independent evolution of several families of transcription factors, and suggestions of as yet uncharacterized metabolic pathways are a few more highlights of this work. The implications of these discoveries are not only relevant for plant 796 biologists, but will also affect agricultural science, evolutionary biology, bioinformatics, combinatorial chemistry, functional and comparative genomics, and molecular medicine. Overview of sequencing strategy We used large-insert bacterial arti®cial chromosome (BAC), phage (P1) and transformation-competent arti®cial chromosome (TAC) libraries9±12 as the primary substrates for sequencing. Early stages of genome sequencing used 79 cosmid clones. Physical maps of the genome of accession Columbia were assembled by restriction fragment `®ngerprint' analysis of BAC clones13, by hybridization14 or polymerase chain reaction (PCR)15 of sequence-tagged sites and by hybridization and Southern blotting16. The resulting maps were integrated (http://nucleus/cshl.org/arabmaps/) with the genetic map and provided a foundation for assembling sets of contigs into sequence-ready tiling paths. End sequence (http://www. tigr.org/tdb/at/abe/bac_end_search.html) of 47,788 BAC clones was used to extend contigs from BACS anchored by marker content and to integrate contigs. Ten contigs representing the chromosome arms and centromeric heterochromatin were assembled from 1,569 BAC, TAC, cosmid and P1 clones (average insert size 100 kilobases (kb)). Twenty-two PCR products were ampli®ed directly from genomic DNA and sequenced to link regions not covered by cloned DNA or to optimize the minimal tiling path. Telomere sequence was obtained from speci®c yeast arti®cial chromosome (YAC) and phage clones, and from inverse polymerase chain reaction (IPCR) products derived from genomic DNA. Clone ®ngerprints, together with BAC end sequences, were generally adequate for selection of clones for sequencing over most of the genome. In the centromeric regions, these physical mapping methods were supplemented with genetic mapping to identify contig positions and orientation17. Selected clones were sequenced on both strands and assembled using standard techniques. Comparison of independently derived sequence of overlapping regions and independent reassembly sequenced clones revealed accuracy rates between 99.99 and 99.999%. Over half of the sequence differences were between genomic and BAC clone sequence. All available sequenced genetic markers were integrated into sequence assemblies to verify sequence contigs4±8. The total length of sequenced regions, which extend from either the telomeres or ribosomal DNA repeats to the 180-base-pair © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles (bp) centromeric repeats, is 115,409,949 bp (Table 1). Estimates of the unsequenced centromeric and rDNA repeat regions measure roughly 10 megabases (Mb), yielding a genome size of about 125 Mb, in the range of the 50±150 Mb haploid content estimated by different methods18. In general, features such as gene density, expression levels and repeat distribution are very consistent across the ®ve chromosomes (Fig. 1), and these are described in detail in reports on individual chromosomes4±8 and in the analysis of centromere, telomere and rDNA sequences. We used tRNAscan-SE 1.21 (ref. 19) and manual inspection to identify 589 cytoplasmic transfer RNAs, 27 organelle-derived tRNAs and 13 pseudogenesÐmore than in any other genome sequenced to date. All 46 tRNA families needed to decode all possible 61 codons were found, de®ning the completeness of the functional set. Several highly ampli®ed families of tRNAs were found on the same strand6; excluding these, each amino acid is decoded by 10±41 tRNAs. The spliceosomal RNAs (U1, U2, U4, U5, U6) have all been experimentally identi®ed in Arabidopsis. The previously identi®ed sequences for all RNAs were found in the genome, except for U5 where the most similar counterpart was 92% identical. Between 10 and 16 copies of each small nuclear RNA (snRNA) were found across all chromosomes, dispersed as singletons or in small groups. The small nucleolar RNAs (snoRNAs) consist of two subfamilies, the C/D box snoRNAs, which includes 36 Arabidopsis genes, and the H/ACA box snoRNAs, for which no members have been identi®ed in Arabidopsis. U3 is the most numerous of the C/D box snoRNAs, with eight copies found in the genome. We identi®ed forty-®ve additional C/D box snoRNAs using software (www.rna.wustl.edu/ snoRNAdb/) that detects snoRNAs that guide ribose methylation of ribosomal RNA. A combination of algorithms, all optimized with parameters based on known Arabidopsis gene structures, was used to de®ne gene structure. We used similarities to known protein and expressed sequence tag (EST) sequence to re®ne gene models. Eighty per cent of the gene structures predicted by the three centres involved were completely consistent, 93% of ESTs matched gene models, and less than 1% of ESTs matched predicted non-coding regions, indicating 100 kb Chr. 1 29.1 Mb Genes ESTs TEs MT/CP RNAs Chr. 2 19.6 Mb Genes ESTs TEs MT/CP RNAs Chr. 3 23.2 Mb Genes ESTs TEs MT/CP RNAs Chr. 4 17.5 Mb Genes ESTs TEs MT/CP RNAs Chr. 5 26.0 Mb Genes ESTs TEs MT/CP RNAs Pseudo-colour spectra: High density Low density Figure 1 Representation of the Arabidopsis chromosomes. Each chromosome is represented as a coloured bar. Sequenced portions are red, telomeric and centromeric regions are light blue, heterochromatic knobs are shown black and the rDNA repeat regions are magenta. The unsequenced telomeres 2N and 4N are depicted with dashed lines. Telomeres are not drawn to scale. Images of DAPI-stained chromosomes were kindly supplied by P. Fransz. The frequency of features was given pseudo-colour assignments, from red (high density) to deep blue (low density). Gene density (`Genes') NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com ranged from 38 per 100 kb to 1 gene per 100 kb; expressed sequence tag matches (`ESTs') ranged from more than 200 per 100 kb to 1 per 100 kb. Transposable element densities (`TEs') ranged from 33 per 100 kb to 1 per 100 kb. Mitochondrial and chloroplast insertions (`MT/CP') were assigned black and green tick marks, respectively. Transfer RNAs and small nucleolar RNAs (`RNAs') were assigned black and red ticks marks, respectively. © 2000 Macmillan Magazines Ltd 797 articles that most potential genes were identi®ed. The sensitivity and selectivity of the gene prediction software used in this report has been comprehensively and independently assessed20. The 25,498 genes predicted (Table 1) is the largest gene set published to date: C. elegans1 has 19,099 genes and Drosophila2 13,601 genes. Arabidopsis and C. elegans have similar gene density, whereas Drosophila has a lower gene density; Arabidopsis also has a signi®cantly greater extent of tandem gene duplications and segmental duplications, which may account for its larger gene set. The rDNA repeat regions on chromosomes 2 and 4 were not sequenced because of their known repetitive structure and content. The centromeric regions are not completely sequenced owing to large blocks of monotonic repeats such as 5S rDNA and 180-bp repeats. The sequence continues to be extended further into centromeric and other regions of complex sequence. Characterization of the coding regions To assess the similarities and differences of the Arabidopsis gene complement compared with other sequenced eukaryotic genomes, we assigned functional categories to the complete set of Arabidopsis genes. For chromosome 4 genes and the yeast genome, predicted functions were previously manually assigned5,21. All other predicted proteins were automatically assigned to these functional categories22, assuming that conserved sequences re¯ect common functional relationships. The functions of 69% of the genes were classi®ed according to sequence similarity to proteins of known function in all organisms; only 9% of the genes have been characterized experimentally (Fig. 2a). Generally similar proportions of gene products were predicted to be targeted to the secretory pathway and mitochondria in Arabidopsis and yeast, and up to 14% of the gene products are Table 1 Summary statistics of the Arabidopsis genome Feature Value ................................................................................................................................................................................................................................................................................................................................................................... (a) The DNA molecules Length (bp) Top arm (bp) Bottom arm (bp) Base composition (%GC) Overall Coding Non-coding Number of genes Gene density (kb per gene) Average gene length (bp) Average peptide length (bp) Exons Number Total length (bp) Average per gene Average size (bp) Introns Number Total length (bp) Average size (bp) Number of genes with ESTs (%) Number of ESTs Chr. 1 Chr. 2 Chr. 3 Chr. 4 Chr. 5 S 29,105,111 14,449,213 14,655,898 19,646,945 3,607,091 16,039,854 23,172,617 13,590,268 9,582,349 17,549,867 3,052,108 14,497,759 25,953,409 11,132,192 14,803,217 115,409,949 33.4 44.0 32.4 6,543 4.0 35.5 44.0 32.9 4,036 4.9 35.4 44.3 33.0 5,220 4.5 35.5 44.1 32.8 3,825 4.6 34.5 44.1 32.5 5,874 4.4 2,078 1,949 1,925 2,138 1,974 446 421 424 448 429 35,482 8,772,559 5.4 247 19,631 5,100,288 4.9 259 26,570 6,654,507 5.1 250 20,073 5,150,883 5.2 256 31,226 7,571,013 5.3 242 13,2982 33,249,250 28,939 4,828,766 168 60.8 15,595 2,768,430 177 56.9 21,350 3,397,531 159 59.8 16,248 3,030,649 186 61.4 25,352 4,030,045 159 61.4 107,484 18,055,421 30,522 14,989 20,732 16,605 22,885 105,733 6,543 4,194 64.1% 2,334 35.7% 2,513 38.4% 4,036 1,205 29.9% 1,322 32.8% 1,424 35.3% 5,220 2,989 57.8% 1,615 30.9% 1,664 31.9% 3,825 1,545 40.4% 1,402 36.7% 1,304 34.1% 5,874 3,136 53.4% 1,940 33.0% 2,121 36.1% 25,498 13,069 51.3% 8,613 33.8% 9,026 35.4% 25,498 ................................................................................................................................................................................................................................................................................................................................................................... (b) The proteome Classi®cation/function Total proteins With INTERPRO domains Genes containing at least one TM domain Genes containing at least one SCOP domain With putative signal peptides Secretory pathway .0.95 speci®city Chloroplast .0.95 speci®city mitochondria .0.95 speci®city 1,242 1,146 866 602 901 113 19.0% 17.5% 13.2% 9.2% 13.8% 1.7% 675 632 535 290 425 49 16.7% 15.7% 13.2% 7.2% 10.5% 1.2% 877 813 754 420 554 63 17.0% 15.7% 14.6% 8.1% 10.7% 1.2% 659 632 532 298 390 59 17.2% 16.5% 13.9% 7.8% 10.2% 1.5% 1,014 964 887 475 627 65 17.3% 16.4% 15.1% 8.1% 10.7% 1.1% 4,467 4,167 3,574 2,085 2,897 349 17.6% 16.4% 14.0% 8.2% 11.4% 1.4% Functional classi®cation Cellular metabolism Transcription Plant defence Signalling Growth Protein fate Intracellular transport Transport Protein synthesis 1,188 880 640 573 542 520 435 236 216 22.7% 16.8% 12.2% 11.0% 10.4% 9.9% 8.3% 4.5% 4.1% 620 474 276 296 263 273 214 139 111 23.3% 17.8% 10.4% 11.1% 9.9% 10.2% 8.9% 5.2% 4.2% 745 566 354 356 357 314 269 155 148 22.8% 17.3% 10.8% 10.9% 10.9% 9.6% 8.2% 4.7% 4.5% 588 335 295 210 448 264 220 113 90 22.9% 13.1% 11.5% 8.2% 17.5% 10.3% 8.6% 4.4% 3.5% 868 763 490 420 469 395 334 206 165 21.1% 18.6% 11.9% 10.2% 11.4% 9.6% 8.1% 5.0% 4.0% 4,009 3,018 2,055 1,855 2,079 1,766 1,472 849 730 22.5% 16.9% 11.5% 10.4% 11.7% 9.9% 8.3% 4.8% 4.1% Total 5,230 2,666 3,264 2,563 4,110 17,833 ................................................................................................................................................................................................................................................................................................................................................................... The features of Arabidopsis chromosomes 1±5 and the complete nuclear genome are listed. Specialized searches used the following programs and databases: INTERPRO23; transmembrane (TM) domains by ALOM2 (unpublished); SCOP domain database121; functional classi®cation by the PEDANT analysis system22. Signal peptide prediction (secretory pathway, targeted to chloroplast or mitochondria) was performed using TargetP122 and http://www.cbs.dtu.dk/services/TargetP/. * Default value. 798 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles likely to be targeted to the chloroplast (Table 1). The signi®cant proportion of genes with predicted functions involved in metabolism, gene regulation and defence is consistent with previous analyses5. Roughly 30% of the 25,498 predicted gene products, (Fig. 2a), comprising both plant-speci®c proteins and proteins with similarity to genes of unknown function from other organisms, could not be assigned to functional categories. To compare the functional catagories in more detail, we compared data from the complete genomes of Escherichia coli23, Synechocystis sp.24, Saccharomyces cerevisiae21, C. elegans1 and Drosophila2, and a non-redundant protein set of Homo sapiens, with the Arabidopsis genome data (Fig. 2b), using a stringent BLASTP threshold value of E , 10-30. The proportion of Arabidopsis proteins having related counterparts in eukaryotic genomes varies by a factor of 2 to 3 depending on the functional category. Only 8±23% of Arabidopsis proteins involved in transcription have related genes in other eukaryotic genomes, re¯ecting the independent evolution of many plant transcription factors. In contrast, 48±60% of genes involved in protein synthesis have counterparts in the other eukaryotic genomes, re¯ecting highly conserved gene functions. The relatively high proportion of matches between Arabidopsis and bacterial proteins in the categories `metabolism' and `energy' re¯ects both the acquisition of bacterial genes from the ancestor of the plastid and high conservation of sequences across all species. Finally, a comparison between unicellular and multicellular eukaryotes indicates that Arabidopsis genes involved in cellular communication and signal transduction have more counterparts in multicellular eukaryotes than in yeast, re¯ecting the need for sets of genes for communication in multicellular organisms. Pronounced redundancy in the Arabidopsis genome is evident in segmental duplications and tandem arrays, and many other genes with high levels of sequence conservation are also scattered over the genome. Sequence similarity exceeding a BLASTP value E , 10-20 and extending over at least 80% of the protein length were used as parameters to identify protein families (Table 2). A total of 11,601 protein types were identi®ed. Thirty-®ve per cent of the predicted proteins are unique in the genome, and the proportion of proteins belonging to families of more than ®ve members is substantially higher in Arabidopsis (37.4%) than in Drosophila (12.1%) or Cell growth, cell division and DNA synthesis a Metabolism Transcription Cell rescue, defence, cell death, ageing Cellular communciation/ signal transduction Protein destination Intracellular transport Unclassified Cellular biogenesis Transport facilitation Energy Protein synthesis Ionic homeostasis b E. coli Synechocystis S. cerevisiae C. elegans Drosophila Human 0.7 0.6 0.5 0.4 0.3 0.2 0.1 C el lg M et ab ol ro is m an wth E d ,c n DN e er A ll di gy sy vis Tr nth ion an e Pr sc sis ot rip ei ti n sy on Pr ot nt he ei n si Tr de s an st sp i n In or a t tra t io n ce fac ilit llu a la t i r C Ce tra on el ns lu llul la a po r r rt si com bio gn ge C al mu ne el tr n l r an ic sis ceesc sd atio C ll ue uc n/ la d , t ss Io eat de ion ifi ca nic h, fen tio ho ag ce e , n no me ing t y ost a et cl sis ea r-c U nc ut la ss ifi ed 0 Figure 2 Functional analysis of Arabidopsis genes. a, Proportion of predicted Arabidopsis genes in different functional categories. b, Comparison of functional categories between organisms. Subsets of the Arabidopsis proteome containing all proteins that fall into a common functional class were assembled. Each subset was searched against the complete set of translations from Escherichia coli, Synechocystis sp. PCC6803, NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com Saccharomyces cerevisae, Drosophila, C. elegans and a Homo sapiens non-redundant protein database. The percentage of Arabidopsis proteins in a particular subset that had a BLASTP match with E # 10-30 to the respective reference genome is shown. This re¯ects the measure of sequence conservation of proteins within this particular functional category between Arabidopsis and the respective reference genome. y axis, 0.1 = 10%. © 2000 Macmillan Magazines Ltd 799 articles Table 2 Proportion of genes in different organisms present as either singletons or in paralogous families No of singletons and distinct gene families Unique Gene families containing 2 members 3 members 4 members 5 members .5 members 6.8% 13.8% 8.5% 12.0% 12.5% 2.3% 3.5% 3.4% 4.5% 7.0% 0.7% 2.2% 1.9% 2.7% 4.4% 0.0% 0.7% 1.6% 1.6% 3.6% 1.4% 8.4% 12.1% 24.0% 37.4% ................................................................................................................................................................................................................................................................................................................................................................... H. in¯uenzae S. cerevisiae D. melanogaster C. elegans Arabidopsis 1,587 5,105 10,736 14,177 11,601 88.8% 71.4% 72.5% 55.2% 35.0% ................................................................................................................................................................................................................................................................................................................................................................... The number of genes in the genomes of Haemophilus in¯uenzae, S. cerevisiae, Drosophila, C. elegans and Arabidopsis that are present either as singletons or in gene families with two or more members are listed. To be grouped in a gene family, two genes had to show similarity exceeding a BLASTP value E , 10-20 and a FASTA alignment over at least 80% of the protein length. In column 1, the number of genes that are unique plus the number of gene families are listed. Columns 2 to 6 give the percentage of genes present as singletons or in gene families of n members. C. elegans (24.0%). The absolute number of Arabidopsis gene families and singletons (types) is in the same range as the other multicellular eukaryotes, indicating that a proteome of 11,000± 15,000 types is suf®cient for a wide diversity of multicellular life. The proportion of gene families with more than two members is considerably more pronounced in Arabidopsis than in other eukaryotes (Fig. 3). As segmental duplication is responsible for 6,303 gene duplications (see below), the extent of tandem gene duplications accounts for a signi®cant proportion of the increased family size. These features of the Arabidopsis, and presumably other plant genomes, may indicate more relaxed constraints on genome size in plants, or a more prominent role of unequal crossing over to generate new gene copies. Conserved protein domains revealed more informative differences through INTERPRO25 analysis of the predicted gene products from Arabidopsis, S. cerevisiae, C. elegans and Drosophila. Statistically over-represented domains, and those that are absent from the Arabidopsis genome, indicate domains that may have been gained or lost during the evolution of plants (Supplementary Information Table 1). Proteins containing the Pro-Pro-Arg repeat, which is involved in RNA stabilization and RNA processing, are overrepresented as compared to yeast, ¯y and worm; 400 proteins containing this signature were detected in Arabidopsis compared with only 10 in total in yeast, Drosophila and C. elegans. Protein kinases and associated domains, 169 proteins containing a disease resistance protein signature, and the Toll/IL-1R (TIR) domain, a component of pathogen recognition molecules26, are also relatively abundant. This suggests that pathways transducing signals in response to pathogens and diverse environmental cues are more abundant in plants than in other organisms. The RING zinc ®nger domain is relatively over-represented in Arabidopsis compared with yeast, Drosophila and C. elegans, whereas the F-box domain is over-represented as compared with yeast and Drosophila only. These domains are involved in targeting proteins to the proteasome27 and ubiquitinylation28 pathways of protein degradation, respectively. In plants many processes such as hormone and defence responses, light signalling, and circadian rhythms and pattern formation use F-box function to direct negative regulators Number of arrays 1,200 1,000 1052 800 600 400 249 200 0 108 2 3 4 57 36 20 18 17 15 5 6 7 8 9 10 11–15 16–20 21–23 6 2 2 Number of tandemly repeated genes per gene array Figure 3 Distribution of tandemly repeated gene arrays in the Arabidopsis genome. Tandemly repeated gene arrays were identi®ed using the BLASTP program with a threshold of E , 10-20. One unrelated gene among cluster members was tolerated. The histogram gives the number of clusters in the genome containing 2 to n similar gene units in tandem. 800 to the ubiquitin degradation pathway. This mode of regulation appears to be more prevalent in plants and may account for a higher representation of the F box than in Drosophila and for the overrepresentation of the ubiquitin domain in the Arabidopsis genome. RING ®nger domain proteins in general have a role in ubiquitin protein ligases, indicating that proteasome-mediated degradation is a more widespread mode of regulation in plants than in other kingdoms. Most functions identi®ed by protein domains are conserved in similar proportions in the Arabidopsis, S. cerevisiae, Drosophila and C. elegans genomes, pointing to many ubiquitous eukaryotic pathways. These are illustrated by comparing the list of human disease genes29 to the complete Arabidopsis gene set using BLASTP. Out of 289 human disease genes, 139 (48%) had hits in Arabidopsis using a BLASTP threshold E , 10-10. Sixty-nine (24%) exceeded an E , 10-40 threshold, and 26 (9.3%) had scores better than E , 10-100 (Table 3). There are at least 17 human disease genes more similar to Arabidopsis genes than yeast, Drosophila or C. elegans genes (Table 3). This analysis shows that, although numerous families of proteins are shared between all eukaryotes, plants contain roughly 150 unique protein families. These include transcription factors, structural proteins, enzymes and proteins of unknown function. Members of the families of genes common to all eukaryotes have undergone substantial increases or decreases in their size in Arabidopsis. Finally, the transfer of a relatively small number of cyanobacteria-related genes from a putative endosymbiotic ancestor of the plastid has added to the diversity of protein structures found in plants. Genome organization and duplication The Arabidopsis genome sequence provides a complete view of chromosomal organization and clues to its evolutionary history. Gene families organized in tandem arrays of two or more units have been described in C. elegans1 and Drosophila2. Analysis of the Arabidopsis genome revealed 1,528 tandem arrays containing 4,140 individual genes, with arrays ranging up to 23 adjacent members (Fig. 3). Thus 17% of all genes of Arabidopsis are arranged in tandem arrays. Large segmental duplications were identi®ed either by directly aligning chromosomal sequences or by aligning proteins and searching for tracts of conserved gene order. All ®ve chromosomes were aligned to each other in both orientations using MUMmer30, and the results were ®ltered to identify all segments at least 1,000 bp in length with at least 50% identity (Supplementary Information Fig. 1). These revealed 24 large duplicated segments of 100 kb or larger, comprising 65.6 Mb or 58% of the genome. The only duplicated segment in the centromeric regions was a 375-kb segment on chromosome 4. Many duplications appear to have undergone further shuf¯ing, such as local inversions after the duplication event. We used TBLASTX5 to identify collinear clusters of genes residing in large duplicated chromosomal segments. The duplicated regions encompass 67.9 Mb, 60% of the genome, slightly more than was © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles found in the DNA-based alignment (Fig. 4), and these data extend earlier ®ndings4,5,31. The extent of sequence conservation of the duplicated genes varies greatly, with 6,303 (37%) of the 17,193 genes in the segments classi®ed as highly conserved (E , 10-30) and a further 1,705 (10%) showing less signi®cant similarity up to E , 10-5. The proportion of homologous genes in each duplicated segment also varies widely, between 20% and 47% for the highly conserved class of genes. In many cases, the number of copies of a gene and its counterpart differ (for example, one copy on one chromosome and multiple copies on the other; see Supplementary Information Fig. 2); this could be due to either tandem duplication or gene loss after the segmental duplication. What does the duplication in the Arabidopsis genome tell us about the ancestry of the species? Polyploidy occurs widely in plants and is proposed to be a key factor in plant evolution32. As the majority of the Arabidopsis genome is represented in duplicated (but not triplicated) segments, it appears most likely that Arabidopsis, like maize, had a tetraploid ancestor33. A comparative sequence analysis of Arabidopsis and tomato estimated that a duplication occurred ,112 Myr ago to form a tetraploid34. The degrees of conservation of the duplicated segments might be due to divergence from an ancestral autotetraploid form, or might re¯ect differences present in an allotetraploid ancestor. It is also possible, however, that several independent segmental duplication events took place instead of tetraploid formation and stabilization. The diploid genetics of Arabidopsis and the extensive divergence of the duplicated segments have masked its evolutionary history. The determination of Arabidopsis gene functions must therefore be pursued with the potential for functional redundancy taken into account. The long period of time over which genome stabilization has occurred has, however, provided ample opportunity for the divergence of the functions of genes that arose from duplications. Comparative analysis of Arabidopsis accessions Comparing the multiple accessions of Arabidopsis allows us to identify commonly occurring changes in genome microstructure. It also enables the development of new molecular markers for genetic mapping. High rates of polymorphism between Arabidopsis accessions, including both DNA sequence and copy number of tandem arrays, are prevalent at loci involved in disease resistance35. This has been observed for other plant species, and such loci are thought to serve as templates for illegitimate recombination 5 Mb 5 Mb 10 Mb 10 Mb Figure 4 Segmentally duplicated regions in the Arabidopsis genome. Individual chromosomes are depicted as horizontal grey bars (with chromosome 1 at the top), centromeres are marked black. Coloured bands connect corresponding duplicated NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com to create new pathogen response speci®cities36. We carried out a comparative analysis between 82 Mb of the genome sequence of Arabidopsis accession Columbia (Col-0) and 92.1 Mb of nonredundant low-pass (twofold redundant) sequence data of the genomic DNA of accession Landsberg erecta (Ler). We identi®ed two classes of differences between the sequences: single nucleotide polymorphisms (SNPs), and insertion±deletions (InDels). As we used high stringency criteria, our results represent a minimum estimate of numbers of polymorphisms between the two genomes. In total, we detected 25,274 SNPs, representing an average density of 1 SNP per 3.3 kb. Transitions (A/T±G/C) represented 52.1% of the SNPs, and transversions accounted for the remainder: 17.3% for A/T±T/A, 22.7% for A/T±C/G and 7.9% for C/G±G/C. In total, we detected 14,570 InDels at an average spacing of 6.1 kb. They ranged from 2 bp to over 38 kilobase-pairs, although 95% were smaller than 50 bp. Only 10% of the InDels were co-located with simple sequence repeats identi®ed with the program Sputnik. An analysis of 416 relative insertions greater than 250 bp in Col-0 showed that 30% matched transposon-related proteins, indicating that a substantial proportion of the large InDels are the result of transposon insertion or excision. Many InDels contained entire active genes not related to transposons. Half of such genes absent from corresponding positions in the Col-0 sequence were found elsewhere on the genome of Ler. This indicates that genes have been transferred to new genomic locations. Gene structures are often affected by small InDels and SNPs. The positions of SNPs and InDels were mapped relative to 87,427 exons and 70,379 introns annotated in the Col-0 sequence. SNPs were found in exons, introns and intergenic regions at frequencies of 1 SNP per 3.1, 2.2 and 3.5 kb, respectively. The frequencies for InDels were 1 per 9.3, 3.1 and 4.3 kb, respectively. Polymorphisms were detected in 7% of exons, and alter the spliced sequences of 25% of the predicted genes. For InDels in exons, insertion lengths divisible by three are prevalent for small insertions (, 50 bp), indicating that many proteins can withstand small insertions or deletions of amino acids without loss of function. Our analyses show that sequence polymorphisms between accessions of Arabidopsis are common, and that they occur in both coding and non-coding regions. We found evidence for the relocation of genes in the genome, and for changes in the complement of transposable elements. The data presented here are available at http://www.arabidopsis.org/cereon/. 15 Mb 20 Mb 25 Mb 30 Mb 15 Mb 20 Mb 25 Mb 30 Mb segments. Similarity between the rDNA repeats are excluded. Duplicated segments in reversed orientation are connected with twisted coloured bands. The scale is in megabases. © 2000 Macmillan Magazines Ltd 801 articles Comparison of Arabidopsis and other plant genera Comparative genetic mapping can reveal extensive conservation of genome organization between closely related species37,38. The comparative analysis of plant genome microstructure reveals much about the evolution of plant genomes and provides unprecedented opportunities for crop improvement by establishing the detailed structures of, and relationships between, the genomes of crops and Arabidopsis. The lineages leading to Arabidopsis and Capsella rubella (shepherd's purse) diverged between 6.2 and 9.8 Myr ago, and the gene content and genome organization of C. rubella is very similar to that of Arabidopsis39, including the large-scale duplications. Alignment of Arabidopsis complementary DNA and EST sequences with genomic DNA sequences of Arabidopsis and C. rubella showed conservation of exon length and intron positions. Coding sequences predicted from these alignments differed from the annotated Arabidopsis gene sequences in two out of ®ve cases. The ancestral lineages of Arabidopsis and the Brassica (cabbage and mustard) genera diverged 12.2±19.2 Myr ago40. Brassica genes show a high level of nucleotide conservation with their Arabidopsis orthologues, typically more than 85% in coding regions40. The structure of Brassica genomes resembles that of Arabidopsis, but with extensive triplication and rearrangement41, and extensive divergence of microstructure (Supplementary Information Fig. 3). The divergence between the genomes of Arabidopsis and Brassica oleracea is in striking contrast to that observed between Arabidopsis and C. rubella, although the time since divergence is only twofold greater. This accelerated rate of change in triplicated segments of the genome of B. oleracea indicates that polyploidy fosters rapid chromosomal evolution. The Arabidopsis and tomato lineages diverged roughly 150 Myr ago, and comparative sequence analysis of segments of their genomes has revealed complex relationships34. Four regions of the Arabidopsis genome are related to each other and to one region in the tomato genome, suggesting that two rounds of duplication may have occurred in the Arabidopsis lineage. The extensive duplication described here supports the proposal that the more recent of these duplications, estimated to have occurred ,112 Myr ago, was the result of a polyploidization event. The lineages of Arabidopsis and rice diverged ,200 Myr ago42. Three regions of the genome of Arabidopsis were related to each other and to one region in the rice genome, providing further evidence for multiple duplication events43,44. The frequent occurrence of tandem gene duplications and the apparent deletion of single genes, or small groups of adjacent genes, from duplicated regions suggests that unequal crossing over may be a key mechanism affecting the evolution of plant genome microstructure. However, the segmental inversions and gene translocations in the genomes of both rice and B. oleracea that are not found in Arabidopsis indicate that additional mechanisms may be involved40. Integration of the three genomes in the plant cell The three genomes in the plant cellÐthose of the nucleus, the plastids (chloroplasts) and the mitochondriaÐdiffer markedly in gene number, organization and stability. Plastid genes are densely packed in an order highly conserved in all plants45, whereas mitochondrial genes46 are widely dispersed and subjected to extensive recombination. Organellar genomes are remnants of independent organismsÐ plastids are derived from the cyanobacterial lineage and mitochondria from the a-Proteobacteria. The remaining genes in plastids include those that encode subunits of the photosystem and the electron transport chain, whereas the genes in mitochondria encode essential subunits of the respiratory chain. Both organelles contain sets of speci®c membrane proteins that, together with housekeeping proteins, account for 61% of the genes in the chloroplast and 88 % in the mitochondrion (Table 4). The balances are involved in transcription and translation. The number of proteins encoded in the nucleus likely to be found Table 3 Arabidopsis genes with similarities to human disease genes Human disease gene E value Gene code Arabidopsis hit 5.9 ´ 10-272 7.2 ´ 10-228 9.6 ´ 10-214 7.1 ´ 10-188 1.0 ´ 10-182 2.4 ´ 10-181 7.6 ´ 10-181 8.2 ´ 10-172 2.8 ´ 10-168 3.1 ´ 10-168 1.2 ´ 10-166 1.1 ´ 10-153 1.5 ´ 10-150 2.7 ´ 10-150 6.5 ´ 10-147 1.4 ´ 10-146 7.6 ´ 10-137 2.3 ´ 10-135 7.9 ´ 10-135 6.6 ´ 10-134 5.1 ´ 10-128 4.1 ´ 10-125 9.6 ´ 10-122 4.4 ´ 10-109 2.2 ´ 10-107 5.8 ´ 10-99 7.1 ´ 10-89 1.3 ´ 10-84 3.2 ´ 10-83 5.2 ´ 10-81 8.5 ´ 10-81 1.4 ´ 10-76 1.6 ´ 10-75 3.3 ´ 10-74 1.9 ´ 10-73 6.9 ´ 10-72 T27I1_16 F15K9_19 AT5g41360 F20D22_11 AT4g38510 At2g41700 AT5g44790 T6D22_10 At2g41700 AT3g48190 F7F22_1 F2K11_17 AT4g09140 At2g31900 T1G11_14 AT5g41150 AT5g40760 AT3g62700 T21F11_21 AT4g25540 AT4g02460 AT5g08470 AT4g02070 T19D16_15 AT5g57320 F10O3_11 AT3g28030 AT5g39040 AT4g24830 AT3g08720 AT3g17050 At2g20470 F26G16_9 AT5g26240 68069_m00158 AT3g08730 Putative calcium ATPase Putative DNA repair protein DNA excision repair cross-complementing protein Multidrug resistance protein Probable H+-transporting ATPase Putative ABC transporter ATP-dependent copper transporter DNA ligase Putative ABC transporter Ataxia telangiectasia mutated protein AtATM Niemann±Pick C disease protein-like protein ATP-dependent copper transporter, putative MLH1 protein Putative unconventional myosin Putative myosin heavy chain Repair endonuclease (gb|AAF01274.1) Glucose-6-phosphate dehydrogenase ABC transporter-like protein Putative glycerol kinase Putative DNA mismatch repair protein No title Putative protein G/T DNA mismatch repair enzyme DNA helicase isolog Villin Putative transport protein Hypothetical protein ABC transporter-like protein Argininosuccinate synthase-like protein Putative ribosomal-protein S6 kinase (ATPK19) Unknown protein Putative protein kinase Cation-chloride co-transporter, putative CLC-d chloride channel protein Hypothetical protein Putative ribosomal-protein S6 kinase (ATPK6) ................................................................................................................................................................................................................................................................................................................................................................... Darier±White, SERCA Xeroderma Pigmentosum, D-XPD Xeroderma pigment, B-ERCC3 Hyperinsulinism, ABCC8 Renal tubul. acidosis, ATP6B1 HDL de®ciency 1, ABCA1 Wilson, ATP7B Immunode®ciency, DNA Ligase 1 Stargardt's, ABCA4 Ataxia telangiectasia, ATM Niemann±Pick, NPC1 Menkes, ATP7A HNPCC*, MLH1 Deafness, hereditary, MYO15 Fam, cardiac myopathy, MYH7 Xeroderma Pigmentosum, F-XPF G6PD de®ciency, G6PD Cystic ®brosis, ABCC7 Glycerol kinase de®c, GK HNPCC, MSH3 HNPCC, PMS2 Zellweger, PEX1 HNPCC, MSH6 Bloom, BLM Finnish amyloidosis, GSN Chediak±Higashi, CHS1 Xeroderma Pigmentosum, G-XPG Bare lymphocyte, ABCB3 Citrullinemia, type I, ASS Cof®n±Lowry, RPS6KA3 Keratoderma, KRT9 Myotonic dystrophy, DM1 Bartter's, SLC12A1 Dents, CLCN5 Diaphanous 1, DAPH1 AKT2 ................................................................................................................................................................................................................................................................................................................................................................... 802 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles in organelles was predicted using default settings on TargetP (Table 1). Many nuclear gene products that are targeted to either (or both) organelles were originally encoded in the organelle genomes and were transferred to the nuclear genome during evolutionary history. A large number also appear to be of eukaryotic origin, with functions such as protein import components, which were probably not required by the free-living ancestors of the endosymbionts. To identify nuclear genes of possible organellar ancestry, we compared all predicted Arabidopsis proteins to all proteins from completed genomes including those from plastids and mitochondria (Supplementary Information Table 2). This search identi®ed proteins encoded by the Arabidopsis nuclear genome that are most similar to proteins encoded by other species' organelle genomes (14 mitochondrial and 44 plastid). These represent organelle-tonuclear gene transfers that have occurred sometime after the divergence of the organelle-containing lineages47. There is a great excess of nuclear encoded proteins most similar to proteins from the cyanobacteria Synechocystis (Supplementary Information Fig. 4; 806 Arabidopsis predicted proteins matching 404 different Synechocystis proteins, providing further evidence of a genome duplication). These 806 Arabidopsis predicted proteins, and many others of greatly diverse function, are possibly of plastid descent. Through searches against proteins from other cyanobacteria (with incompletely sequenced genomes), we identi®ed 69 additional genes of possibly plastid descent. Only 25% of these putatively plastidderived proteins displayed a target peptide predicted by TargetP, indicating potential cytoplasmic functions for most of these genes. The difference between predicted plastid-targeted and predicted plastid-derived genes indicates that there is a probable overestimation by ab initio targeting prediction methods and a lack of resolution with respect to destination organelles, the possible extensive divergence of some endosymbiont-derived genes in the nuclear genome, the co-opting of nuclear genes for targeting to organelles, and cytoplasmic functions for cyanobacteria-derived proteins. Clearly more re®ned tools and extensive experimentation is required to catalogue plastid proteins. The transfer of genes between genomes still continues (Supplementary Information Table 3). Plastid DNA insertions in the nucleus (17 insertions totalling 11 kb) contain full-length genes encoding proteins or tRNAs, fragments of genes and an intron as well as intergenic regions. Subsequent reshuf¯ing in the nucleus is illustrated by the atpH gene, which was originally transferred completely, but is now in two pieces separated by 2 kb. The 13 small mitochondrial DNA insertions total 7 kb in addition to the large insertion close to the centromere of chromosome 2 (ref. 3). The high level of recombination in the mitochondrial genome may account for these events. Transposable elements Transposons, which were originally identi®ed in maize by Barbara McClintock, have been found in all eukaryotes and prokaryotes. A Table 4 General features of genes encoded by the three genomes in Arabidopsis Nucleus/cytoplasm Plastid Mitochondria 125 Mb 2 60% 25,498 Variable, but syntenic 4.5 154 kb 560 17% 79 Conserved 1.2 367 kb 26 10% 58 Variable 6.25 1,900 nt 79% 1/0.03 14% 900 nt 18.4% 1/0 0% 860 nt 12% 1/0.2±0.5 4% ............................................................................................................................................................................. Genome size Genome equivalent/cell Duplication Number of protein genes Gene order Density (kb per protein gene) Average coding length Genes with introns Genes/pseudogenes Transposons (% of total genome size) ............................................................................................................................................................................. NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com subset of transposons replicate through an RNA intermediate (class I), whereas others move directly through a DNA form (class II). Transposons are further classi®ed by similarity either between their mobility genes or between their terminal and/or internal motifs, as well as by the size and sequence of their target site. Internally deleted elements can often be mobilized in trans by fully functional elements. Transposons in Arabidopsis account for at least 10% of the genome, or about one-®fth of the intergenic DNA. The Arabidopsis genome has a wealth of class I (2,109) and II (2,203) elements, including several new groups (1,209 elements; Supplementary Information Table 4). Mobile histories for many elements were obtained by identifying regions of the genome with signi®cant similarity to `empty' target sites (RESites) thus providing high-resolution information concerning the termini and target site duplications48,49. These regions were readily detected because of the propensity of transposons to integrate into repeats and because of duplications in the genome sequence. In several cases, genes appear to have been included as `passengers' in transposable units48. In some cases, shared sequence similarity, coding capacity and RESites attest to recent activity of transposable elements in the Arabidopsis genome. Only about 4% of the complete elements identi®ed correspond to an EST, however, suggesting that most are not transcribed. Transposable elements found in many other plant genomes are well represented in Arabidopsis, including copia- and gypsy-like long terminal repeat (LTR) retrotransposons, long interspersal nuclear elements (LINEs); short interspersed nuclear elements (SINEs), hobo/Activator/Tam3 (hAT)-like elements, CACTA-like elements and miniature inverted-repeat transposable elements (MITES). Although usually small in size, some larger Tourist-like MITEs contain open reading frames (ORFs) with similarity to the transposases of bacterial insertion sequences48. Basho and many Mutatorlike elements (MULEs), ®rst discovered in the Arabidopsis sequence, represent structurally unique transposons48±50. Basho elements have a target site preference for mononucleotide `A' and wide distribution among plants48,51. MULEs exhibit a high level of sequence diversity and members of most groups lack long terminal inverted repeats (TIRs). Phylogenetic analysis of the Arabidopsis MURA-like transposases suggests that TIR-containing MULEs are more closely related to one another than to MULEs lacking TIRs49,52. For many plants with large genomes, class I retrotransposons contribute most of the nucleotide content53. In the small Arabidopsis genome, class I elements are less abundant and primarily occupy the centromere. In contrast, Basho elements and class II transposons such as MITEs and MULEs predominate on the periphery of pericentromeric domains (Fig. 5). In class II transposons, MULEs and CACTA elements are clustered near centromeres and heterochromatic knobs, whereas MITEs and hAT elements have a less pronounced bias. The distribution pattern of transposable elements observed in Arabidopsis may re¯ect different types of pericentromeric heterochromatin regions and may be similar to those found in animals. Numerous centromeric satellite repeats are located between each chromosome arm and have not yet been sequenced, but are represented in part by unanchored BAC contigs (R. Martienssen and M. Marra, unpublished data). End sequence suggests that these domains contain many more class I than class II elements, consistent with the distribution reported here (K. Lemcke and R. Martienssen, unpublished data). We do not know the signi®cance of the apparent paucity of elements in telomeric regions and in the region ¯anking the rDNA repeats on chromosome 4 (but not on chromosome 2). Overall, transposon-rich regions are relatively gene-poor and have lower rates of recombination and EST matches, indicating a correlation between low gene expression, high transposon density and low recombination51. The role of transposons in genome © 2000 Macmillan Magazines Ltd 803 articles organization and chromosome structure can now be addressed in a model organism known to undergo DNA methylation and other forms of chromatin modi®cation thought to regulate transposition52. rDNA, telomeres and centromeres Nucleolar organizers (NORs) contain arrays of unit repeats encoding the 18S, 5.8S and 25S ribosomal RNA genes and are transcribed by RNA polymerase I. Together with 5S RNA, which is transcribed by RNA polymerase III, these rRNAs form the structural and catalytic cores of cytoplasmic ribosomes. In Arabidopsis, the NORs juxtapose the telomeres of chromosomes 2 and 4, and comprise uninterrupted 18S, 5.8S and 25S units all orientated on the chromosomes in the same direction54. In contrast, the 5S rRNA genes are localized to heterogeneous arrays in the centromeric regions of chromosomes 3, 4 and 5 (ref. 55; and Fig. 6). Both NORs are roughly 3.5±4.0 megabase-pairs and comprise ,350±400 highly methylated rRNA gene units, each ,10 kb (ref. 54). The sequence between the euchromatic arms and NORs has been determined. Elsewhere in the genome, only one other 18S, 5.8S, 25S rRNA gene unit was identi®ed in centromere 3. Although minor variations in sequence length and composition occur in the NOR repeats, these variants are highly clustered, supporting a model of sequence maintenance through concerted evolution55. Arabidopsis telomeres are composed of CCCTAAA repeats and average ,2±3 kb (ref. 56). For TEL4N (telomere 4 North), consensus repeats are adjacent to the NOR; the remaining telomeres are typically separated from coding sequences by repetitive subtelomeric regions measuring less than 4 kb. Imperfect telomere-like arrays of up to 24 kb are found elsewhere in the genome, particularly Frequency a 12 8 Frequency c Class I Class II Basho 4 0 0 Frequency 20 10 Position (Mb) 16 4 0 d Chr. 3 Chr. 2 8 30 0 20 10 Position (Mb) 20 Chr. 4 16 12 12 8 8 4 0 e b 12 Chr. 1 4 0 0 10 20 Position (Mb) 20 0 10 Position (Mb) 20 Membrane transport Chr. 5 16 12 8 4 0 0 10 20 30 Position (Mb) Figure 5 Distribution of class I, II and Basho transposons in Arabidopsis chromosomes. The frequency of class I retroelements (green), class II DNA transposons (blue) and Basho elements (purple) are shown at 100-kb intervals along the ®ve chromosomes (a±e) of Arabidopsis. 804 near centromeres. These arrays might affect the expression of nearby genes and may have resulted from ancient rearrangements, such as inversions of the chromosome arms. Centromere DNA mediates chromosome attachment to the meiotic and mitotic spindles and often forms dense heterochromatin. Genetic mapping of the regions that confer centromere function provided the markers necessary to precisely place BAC clones at individual centromeres17; 69 clones were targeted for sequencing, resulting in over 5 Mb of DNA sequence from the centromeric regions. The unsequenced regions of centromeres are composed primarily of long, homogeneous arrays that were characterized previously with physical57 and genetic mapping17 and contain over 3 Mb of repetitive arrays, including the 180-bp repeats and 5S rDNA51 (Fig. 6). Arabidopsis centromeres, like those of many higher eukaryotes, contain numerous repetitive elements including retroelements, transposons, microsatellites and middle repetitive DNA17. These repeats are rare in the euchromatic arms and often most abundant in pericentromeric DNA. The repeats, af®nity for DNA-binding dyes, dense methylation patterns and inhibition of homologous recombination indicate that the centromeric regions are highly heterochromatic, and such regions are generally viewed as very poor environments for gene expression. Unexpectedly, we found at least 47 expressed genes encoded in the genetically de®ned centromeres of Arabidopsis (http://preuss.bsd.uchicago.edu/arabidopsis. genome.html). In several cases, these genes reside on islands of unique sequence ¯anked by repetitive arrays, such as 180-bp or 5S rDNA repeats. Among the genes encoded in the centromeres are members of 11 of the 16 functional categories that comprise the proteome. The centromeres are not subject to recombination; consequently, genes residing in these regions probably exhibit unique patterns of molecular evolution. The function of higher eukaryotic centromeres may be speci®ed by proteins that bind to centromere DNA, by epigenetic modi®cations, or by secondary or higher order structures. A pairwise comparison of the non-repetitive portions of all ®ve centromeres showed they share limited (1±7%) sequence similarity. Forty-one families of small, conserved centromere sequences (AtCCS, see http://preuss.bsd.uchicago.edu/arabidopsis.genome. html) are enriched in the centromeric and pericentromeric regions and differ from sequences found in the centromeres of other eukaryotes. Molecular and genetic assays will be required to determine whether these conserved motifs nucleate Arabidopsis centromere activity. Apart from the AtCCS sequences, most centromere DNA is not shared between chromosomes, complicating efforts to derive clear evolutionary relationships. In contrast, genetic and cytological assays indicate that homologous centromeres are highly conserved among Arabidopsis accessions, albeit subject to rearrangements such as inversions to form knobs5,58,59 and insertions4. Further investigation of centromere DNA promises to yield information on the evolutionary forces that act in regions of limited recombination, as well as an improved understanding of the role of DNA sequence patterns in chromosome segregation. Transporters in the plasma and intracellular membranes of Arabidopsis are responsible for the acquisition, redistribution and compartmentalization of organic nutrients and inorganic ions, as well as for the ef¯ux of toxic compounds and metabolic end products, energy and signal transduction, and turgor generation. Previous genomic analyses of membrane transport systems in S. cerevisiae and C. elegans led to the identi®cation of over 100 distinct families of membrane transporters60,61. We compared membrane transport processes between Arabidopsis, animals, fungi and prokaryotes, and identi®ed over 600 predicted membrane transport systems in Arabidopsis (http://www-biology.ucsd.edu/ ,ipaulsen/transport/), a similar number to that of C. elegans © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles (,700 transporters) and over twofold greater than either S. cerevisiae or E. coli (,300 transporters). We compared the transporter complement of Arabidopsis, C. elegans and S. cerevisiae in terms of energy coupling mechanisms (Fig. 7a). Unlike animals, which use a sodium ion P-type ATPase pump to generate an electrochemical gradient across the plasma membrane, plants and fungi use a proton P-type ATPase pump to form a large membrane potential (-250 mV)62. Consequently, plant secondary transporters are typically coupled to protons rather than to sodium63. Compared with C. elegans, Arabidopsis has a surprisingly high percentage of primary ATP-dependent transporters (12% and 21% of transporters, respectively), re¯ecting increased numbers of P-type ATPases involved in metal ion transport and ABC ATPases proposed to be involved in sequestering unusual metabolites and drugs in the vacuole or in other intracellular compartments. These processes may be necessary for pathogen defence and nutrient storage. About 15% of the transporters in Arabidopsis are channel proteins, ®ve times more than in any single-celled organism but half the number in C. elegans (Fig. 7b). Almost half of the Arabidopsis channel proteins are aquaporins, and Arabidopsis has 10-fold more Mfamily major intrinsic protein (MIP) family water channels than any other sequenced organism. This abundance emphasizes the importance of hydraulics in a wide range of plant processes, including sugar and nutrient transport into and out of the vasculature, opening of stomatal apertures, cell elongation and epinastic movements of leaves and stems. Although Arabidopsis has a diverse range of metal cation transporters, C. elegans has more, many of which function in cell±cell signalling and nerve signal transduction. Arabidopsis also possesses transporters for inorganic anions such as phosphate, sulphate, nitrate and chloride, as well as for metal cation channels that serve in signal transduction or cell homeostasis. Compared with other sequenced organisms, Arabidopsis has 10fold more predicted peptide transporters, primarily of the protondependent oligopeptide transport (POT) family, emphasizing the importance of peptide transport or indicating that there is broader substrate speci®city than previously realized. There are nearly 1,000 Arabidopsis genes encoding Ser/Thr protein kinases, suggesting that peptides may have an important role in plant signalling64. Virtually no transporters for carboxylates, such as lactate and pyruvate, were identi®ed in the Arabidopsis genome. About 12% of the transporters were predicted to be sugar transporters, mostly consisting of paralogues of the MFS family of hexose transporters. Notably, S. cerevisiae, C. elegans and most prokaryotes use APC family transporters as their principle means of amino-acid transport, but Arabidopsis appears to rely primarily on the AAAP family of amino-acid and auxin transporters. More than 10% of the transporters in Arabidopsis are homologous to drug ef¯ux pumps; these probably represent transporters involved in the sequestration into vacuoles of xenobiotics, secondary metabolites, and breakdown products of chlorophyll. Surprisingly, Arabidopsis has close homologues of the human ABC TAP transporters of antigenic peptides for presentation to the major histocompatability complex (MHC). In Arabidopsis, these transporters may be involved in peptide ef¯ux, or more speculatively, in some form of cell-recognition response. Arabidopsis also has 10-fold more members of the multi-drug and toxin extrusion (MATE) family than any other sequenced organism; in bacteria, these transporters function as drug ef¯ux pumps. Curiously, Arabidopsis has several homologues of the Drosophila RND transporter family Patched protein, which functions in segment polarity, and more than ten homologues of the Drosophila ABC family eye pigment transporters. In plants, these are presumably involved in intracellular sequestration of secondary metabolites. DNA repair and recombination DNA repair and recombination pathways have many functions in different species such as maintaining genomic integrity, regulating mutation rates, chromosome segregation and recombination, genetic exchange within and between populations, and immune system development. Comparing the Arabidopsis genome with other species65 indicates that Arabidopsis has a similar set of DNA repair and recombination (RAR) genes to most other eukaryotes. The pathways represented include photoreactivation, DNA ligation, non-homologous end joining, base excision repair, mismatch excision repair, nucleotide excision repair and many aspects of DNA recombination (Supplementary Information Table 5). The Arabidopsis RAR genes include homologues of many DNA repair genes that are defective in different human diseases (for example, hereditary breast cancer and non-polyposis colon cancer, xeroderma pigmentosum and Cockayne's syndrome). One feature that sets Arabidopsis apart from other eukaryotes is the presence of additional homologues of many RAR genes. This is seen for almost every major class of DNA repair, including recombination (four RecA), DNA ligation (four DNA ligase I), photoreactivation (one class II photolyase and ®ve class I photolyase homologues) and nucleotide excision repair (six RPA1, two RPA2, two Rad25, three TFB1 and four Rad23). This is most striking for genes with probable roles in base excision repair. Arabidopsis encodes 16 homologues of DNA base glycosylases (enzymes that CEN1 F28L22 F2C1 T28N5 F12G6 T18N24 F9D18 T4I21 F25O15 F5A13 F9M8 Key 180 bp CEN2 T25N22 F27C21 T5M2 T13E11 F9A16 T17H1 T18C6 T12J2 T14C8 T15D9 F7B19 T5E7 160 bp Mitochondrial CEN3 5S rDNA T8N9 T7B9 T15D2 F6H5 F1D9 T13O13 F23H6 T18B3 T14A11 F21A14 T27B3 T14K23 T28G19 T26P13 T4P3 F4M19 F26B15 100 kb CEN4 T19B17 F4H6 T4B21 T32N4 C6L9 T27D20 T26N6 T19J18 T1J1 C17L7 T1J24 F6H8 F21I2 F14G16 F28D6 CEN5 T3P1 F3F24 F23C8 F17M7 F7I20 F19I11 Figure 6 Predicted centromere composition. Genetically de®ned centromere boundaries are indicated by ®lled circles; fully and partially assembled BAC sequences are represented by solid and dashed black lines, respectively. Estimates of repeat sizes within NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com F3D18 T6F8 F19N2 T32B3 F13C19 F18O9 T29A4 F18A12 T25B21 F14C23 T15F17 F15I15 the centromeres were derived from consideration of repeat copy number, physical mapping and cytogenetic assays. © 2000 Macmillan Magazines Ltd 805 articles recognize abnormal DNA bases and cleave them from the sugarphosphate backbone)Ðmore than any other species known. This includes several homologues of each of three families of alkylation damage base glycosylases: two of the S. cerevisiae MPG; six of the E. coli TagI; and two of the E. coli AlkA. Arabidopsis also encodes three homologues of the apurinic-apyrimidinic (AP) endonuclease Xth. AP endonucleases continue the base excision repair started by glycosylases by cleaving the DNA backbone at abasic sites. Evolutionary analysis indicates that some of the extra copies of RAR genes in Arabidopsis originated through relatively recent gene duplicationsÐbecause many of the sets of genes are more closely related to each other than to their homologues in any other species. As duplication is frequently accompanied by functional divergence, the duplicate (paralogous) genes may have different repair speci®cities or may have evolved functions that are outside RAR functions (as is the case for two of the ®ve class I photolyase homologues, which function as blue-light receptors). In most cases, it is not known whether the paralogous gene copies have different functions. The presence of multiple paralogues might also allow functional redundancy or a greater repair or recombination capacity. The multiplicity of RAR genes in Arabidopsis is also partly due to the transfer of genes from the organellar genomes to the nucleus. Repair gene homologues that appear to be of chloroplast origin (Supplementary Information Tables 2 and 5) include the recombination proteins RecA, RecG and SMS, two class I photolyase homologues, Fpg, two MutS2 proteins, and the transcriptionrepair coupling factor Mfd. Two of these (RecA and Fpg) are involved in RAR functions in the plastid, suggesting that the others may be as well. The ®nding of an Mfd orthologue of cyanobacterial descent is surprising. In E. coli, Mfd couples nucleotide excision repair carried out by UvrABC to transcription, leading to the rapid repair of DNA damage on the transcribed strand of transcribed genes66 The absence of orthologues of UvrABC in Arabidopsis renders the function of Mfd dif®cult to predict. The presence of Mfd but not UvrABC has been reported for only one other species, a bacterial endosymbiont of the pea aphid. Other nuclear-encoded Arabidopsis DNA repair gene homologues are evolutionarily related to genes from a-Proteobacteria, and thus may be of mitochondrial descent. In particular, the six homologues of the alkyl-base glycosylase TagI appear to be the result of a large expansion in plants after transfer from the mitochondrial genome. Whether any of these TagI homologues function in the repair and maintenance of mitochondrial DNA has not been determined. More detailed phylogenetic analysis may reveal additional Arabidopsis RAR genes to be of organellar ancestry. There are some notable absences of proteins important for RAR in other species, including alkyltransferases, MSH4, RPA3 and many components of TFIIH (TFB2, TFB3, TFB4, CCL1, Kin28). Nevertheless, Arabidopsis shows many similarities to the set of DNA repair genes found in other eukaryotes, and therefore offers an experimental system for determining the functions of many of these proteins, in part through characterization of mutants defective in DNA repair67. Gene regulation Eukaryotic gene expression involves many nuclear proteins that modulate chromatin structure, contribute to the basal transcription machinery, or mediate gene regulation in response to developmental, environmental or metabolic cues. As predicted by sequence similarity, more than 3,000 such proteins may be encoded by the Arabidopsis genome, suggesting that it has a comparable complexity of gene regulation to other eukaryotes. Arabidopsis has an additional level of gene regulation, however, with DNA methylation potentially mediating gene silencing and parental imprinting. Plants have evolved several variations on chromatin remodelling proteins, such as the family of HD2 histone deacetylases68. Although Arabidopsis possesses the usual number of SNF2-type chromatin 806 remodelling ATPases, which regulate the expression of nearly all genes, there are signi®cant structural differences between yeast and metazoan SNF2-type genes and their orthologues in Arabidopsis. DDM1, a member of the SNF2 superfamily, and MOM1, a gene with similarity to the SNF2 family, are involved in transcriptional gene silencing in Arabidopsis. MOM1 has no clear orthologue in fungal or metazoan genomes. Consistent with its methylated DNA, Arabidopsis possesses eight DNA methyltransferases (DMTs). Two of the three types are orthologous to mammalian DMT69 whereas one, chromomethyltransferase70, is unique to plants. No DMTs are found in yeast or C. elegans, although two DMT-like genes are found in Drosophila71. Arabidopsis also encodes eight proteins with methylDNA-binding domains (MBDs). Despite lacking methylated DNA, Drosophila encodes four MBD proteins and C. elegans has two. These differences in chromatin components are likely to re¯ect important differences in chromatin-based regulatory control of gene expression in eukaryotes (Supplementary Information Table 6; http://Ag.Arizona.Edu/chromatin/chromatin.html). The Arabidopsis genome encodes transcription machinery for the three nuclear DNA-dependent RNA polymerase systems typical of eukaryotes (Supplementary Information Table 6). Transcription by RNA polymerases II and III appears to involve the same machinery as is used in other eukarotes; however, most transcription factors for RNA polymerase I are not readily identi®ed. Only two polymerase I regulators (other than polymerase subunits and TATA-binding protein) are apparent in Arabidopsis, namely homologues of yeast RRN3 and mouse TTF-1. All eukaryotes examined to date have distinct genes for the largest and second largest subunits of polymerase I, II and III. Unexpectedly, Arabidopsis has two genes encoding a fourth class of largest subunit and second-largest subunit (Supplementary Information Fig. 5). It will be interesting to determine whether the atypical subunits comprise a polymerase that has a plant-speci®c function. Four genes encoding singlesubunit plastid or mitochondrial RNA polymerases have been identi®ed in Arabidopsis (Supplementary Information Table 6). Genes for the bacterial b-, b9- and a-subunits of RNA polymerase are also present, as are homologues of various s-factors, and these proteins may regulate chloroplast gene expression. Mutations in the Sde-1 gene, encoding RNA-dependent RNA polymerase (RdRp), lead to defective post-transcriptional gene silencing72. We also identi®ed ®ve more closely related RdRp genes. Our analysis, using both similarity searches and domain matches, has identi®ed 1,709 proteins with signi®cant similarity to known classes of plant transcription factors classi®ed by conserved DNAbinding domains. This analysis used a consistent conservative threshold that probably underestimates the size of families of diverse sequence. This class of protein is the least conserved among all classes of known proteins, showing only 8±23% similarity to transcription factors in other eukaryotes (Fig. 2b). This reduced similarity is due to the absence of certain classes of transcription factors in Arabidopsis and large numbers of plantspeci®c transcription factors. We did not detect any members of several widespread families of transcription factors, such as the REL (Rel-like DNA-binding domain) homology region proteins, nuclear steroid receptors and forkhead-winged helix and POU (Pit-1, Octand Unc-8b) domain families of developmental regulators. Conversely, of 29 classes of Arabidopsis transcription factors, 16 appear to be unique to plants (Supplementary Information Table 6). Several of these, such as the AP2/EREBP-RAV, NAC and ARFAUX/IAA families, contain unique DNA-binding domains, whereas others contain plant-speci®c variants of more widespread domains, such as the DOF and WRKY zinc-®nger families and the two-repeat MYB family. Functional redundancy among members of large families of closely related transcription factors in Arabidopsis is a signi®cant potential barrier to their characterization73. For example, in the © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles SHATTERPROOF and SEPALLATA families of MADS box transcription factors, all genes must be defective to produce visible mutant phenotypes74,75. These functionally redundant genes are found on the segmental duplications described above. Our analyses, together with the signi®cant sequence similarity found in large families of transcription factors such as the R2R3-repeat MYB and WRKY families, suggest that strategies involving overexpression will be important in determining the functions of members of transcription factor families. Arabidopsis has two or over three times more transcription factors than identi®ed in Drosophila29 or C.elegans1, respectively. The signi®cantly greater extent of segmental chromosomal and local tandem duplications in the Arabidopsis genome generates larger gene families, including transcription factors. The partly overlapping functions de®ned for a few transcription factors are also likely to be much more widespread, implicating many sequence-related transcription factors in the same cellular processes. Finally, the expanded number of genes involved in metabolism, defence and environmental interaction in Arabidopsis (Fig. 2a), which have few counterparts in Drosophila and C. elegans, all require additional numbers and classes of transcription factors to integrate gene function in response to a vast range of developmental and environmental cues. Cellular organization Plant cells differ from animal cells in many features such as plastids, vacuoles, Golgi organization, cytoskeletal arrays, plasmodesmata linking cytoplasms of neighbouring cells, and a rigid polysaccharide-rich extracellular matrixÐthe cell wall. Because the cell wall maintains the position of a cell relative to its neighbours, both changes in cell shape and organized cell divisions, involving cytoskeleton reorganization and membrane vesicle targeting, have major roles in plant development. Plant cytokinesis is also unique in that the partitioning membrane is formed de novo by vesicle fusion. We compared the Arabidopsis genome with those of C. elegans, a A. thaliana C. elegans S. cerevisiae Channels Secondary transport Primary transport Uncharacterized b Cations (inorganic) Amines, amides and polyamines Anions (inorganic) Peptides Water Sugars and derivatives Carboxylates Amino acids Bases and derivatives Vitamins and cofactors Drugs and toxins Macromolecules Unknown Figure 7 Comparison of the transport capabilities of Arabidopsis, C. elegans and S. cerevisiae. Pie charts show the percentage of transporters in each organism according to bioenergetics (a) and substrate speci®city (b). NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com Drosophila and yeast to glimpse the genetic basis of plant-cellspeci®c features. The principal components of the plant cytoskeleton are microtubules (MTs) and actin ®laments (AFs); intermediate ®laments (IFs) have not been described in plants. Arabidopsis appears to lack genes for cytokeratin or vimentin, the main components of animal IFs, but has several variants of actin, a- and b-tubulin. The Arabidopsis genome also encodes homologues of chaperones that mediate the folding of tubulin and actin polypeptides in yeast and animal cells, such as the prefoldin and cytosolic chaperonin complexes and tubulin-folding cofactors. The dynamic stability of MTs and AFs is in¯uenced by MT-associated proteins and actin-binding proteins, respectively, several of which are encoded by Arabidopsis genes. These include the MT-severing ATPase katanin, AF-crosslinking/bundling proteins, such as ®mbrins and villins, and AFdisassembling proteins, such as pro®lin and actin-depolymerizing factor/co®lin. The Arabidopsis proteome appears to lack homologues of proteins that, in animal cells, link the actin cytoskeleton across the plasma membrane to the extracellular matrix, such as integrin, talin, spectrin, a-actinin, vitronectin or vinculin. This apparent lack of `anchorage' proteins is consistent with the different composition of the cell wall and with a prominence of cortical MTs at the expense of cortical AFs in plant cells. Plant-speci®c cytoskeletal arrays include interphase cortical MTs mediating cell shape, the preprophase band marking the cortical site of cell division, and the phragmoplast assisting in cytokinesis76. Although plant cells lack structural counterparts of the yeast spindle pole body and the animal centrosome, Arabidopsis has homologues of core components of the MT-nucleating g-tubulin ring complex, such as g-tubulin, Spc97/hGCP2 and Spc98/hGCP3. Arabidopsis has numerous motor molecules, both kinesins and dyneins with associated dynactin complex proteins, which are presumably involved in the dynamic organization of MTs and in transporting cargo along MT tracks. There are also myosin motors that may be involved in AF-supported organelle traf®cking. Essential features of the eukaryotic cytoskeleton appear to be conserved in Arabidopsis. The Arabidopsis genome encodes homologues of proteins involved in vesicle budding, including several ARFs and ARFrelated small G-proteins, large but not small ARF GEFs (adenosine ribosylation factor on guanine nucleotide exchange factor), adapter proteins, and coat proteins of the COP and non-COP types. Arabidopsis also has homologues of proteins involved in vesicle docking and fusion, including SNAP receptors (SNAREs), Nethylmaleimide-sensitive factor (NSF) and Cdc48-related ATPases, accessory proteins such as Sec1 and soluble NSF attachment protein (SNAP), and Rab-type GTPases. The large number of Arabidopsis SNAREs can be grouped by sequence similarity to yeast and animal counterparts involved in speci®c traf®cking pathways, and some have been localized to the trans-Golgi and the pre-vacuolar pathway77. Arabidopsis also has a receptor for retention of proteins in the endoplasmic reticulum, a cargo receptor for transport to the vacuole and several phragmoplastins related to animal dynamin GTPases. Thus, plant cells appear to use the same basic machinery for vesicle traf®cking as yeast and animal cells. Animal cells possess many functionally diverse small G-proteins of the Ras superfamily involved in signal transduction, AF reorganization, vesicle fusion and other processes. Surprisingly, Arabidopsis appears to lack genes for G-proteins of the Ras, Rho, Rac and Cdc42 subfamilies but has many Rab-type G-proteins involved in vesicle fusion and several Rop-type G-proteins, one of which has a role in actin organization of the tip-growing pollen tube78. The signi®cance of this divergent ampli®cation of different subfamilies of small G-proteins in plants and animals remains to be determined. Arabidopsis possesses cyclin-dependent kinases (CDKs), including a plant-speci®c Cdc2b kinase expressed in a cell-cycle-dependent manner, several cyclin subtypes, including a D-type cyclin that © 2000 Macmillan Magazines Ltd 807 articles mediates cytokinin-stimulated cell-cycle progression79, a retinoblastoma-related protein and components of the ubiquitin-dependent proteolytic pathway of cyclin degradation. In yeast and animal cells, chromosome condensation is mediated by condensins, sister chromatids are held together by cohesins such as Scc1, and metaphase± anaphase transition is triggered by separin/Esp1 endopeptidase proteolysis of Scc1 on APC-mediated degradation of its inhibitor, securin/Psd1. Related proteins are encoded by the Arabidopsis genome. Thus, the basic machinery of cell-cycle progression, genome duplication and segregation appears to be conserved in plants. By contrast, entry into M phase, M-phase progression and cytokinesis seem to be modi®ed in plant cells. Arabidopsis does not appear to have homologues of Cdc25 phosphatase, which activates Cdc2 kinase at the onset of mitosis, or of polo kinase, which regulates M-phase progression in yeast and animals. Conversely, plant-speci®c mitogen-actived protein (MAP) kinases appear to be involved in cytokinesis. Cytokinesis partitions the cytoplasm of the dividing cell. Yeast and animal cells expand the membrane from the surface towards the centre in a cleavage process supported by septins and a contractile ring of actin and type II myosin. By contrast, plant cytokinesis starts in the centre of the division plane and progresses laterally. A transient membrane compartment, the cell plate, is formed de novo by fusion of Golgi-derived vesicles traf®cking along the phragmoplast MTs80. Consistent with the unique mode of plant cytokinesis, Arabidopsis appears to lack genes for septins and type II myosin. Conversely, cell-plate formation requires a cytokinesisspeci®c syntaxin that has no close homologue in yeast and animals. Although syntaxin-mediated membrane fusion occurs in animal cytokinesis and cellularization, the vesicles are delivered to the base of the cleavage furrow. Thus, the plant-speci®c mechanism of cell division is linked to conserved eukaryotic cell-cycle machinery. Two main conclusions are suggested by this comparative analysis. First, Arabidopsis and eukaryotic cells have common features related to intracellular activities, such as vesicle traf®cking, cytoskeleton and cell cycle. Second, evolutionarily divergent features, such as organization of the cytoskeleton and cytokinesis, appear to relate to the plant cell wall. Development The regulation of development in Arabidopsis, as in animals, involves cell±cell communication, hierarchies of transcription factors, and the regulation of chromatin state; however, there is no reason to suppose that the complex multicellular states of plant and animal development have evolved by elaborating the same general processes during the 1.6 billion years since the last common unicellular ancestor of plants and animals81,82. Our genome analyses re¯ect the long, independent evolution of many processes contributing to development in the two kingdoms. Plants and animals have converged on similar processes of pattern formation, but have used and expanded different transcription factor families as key causal regulators. For example, segmentation in insects and differentiation along the anterior±posterior and limb axes in mammals both involve the spatially speci®c activation of a series of homeobox gene family members. The pattern of activation is causal in the later differentiation of body and limb axis regions. In plants the pattern of ¯oral whorls (sepals, petals, stamens, carpels) is also established by the spatially speci®c activation of members of a family of transcription factors, but in this instance the family is the MADS box family. Plants also have homeobox genes and animals have MADS box genes, implying that each lineage invented separately its mechanism of spatial pattern formation, while converging on actions and interactions of transcription factors as the mechanism. Other examples show even greater divergence of plant and animal developmental control. Examples are the AP2/EREBP and NAC families of transcription factors, which have important roles in ¯ower and meristem development; both families are so far found 808 only in plants (Supplementary Information Table 6). A similar story can be told for cell±cell communication. Plants do not seem to have receptor tyrosine kinases, but the Arabidopsis genome has at least 340 genes for receptor Ser/Thr kinases, belonging to many different families, de®ned by their putative extracellular domains (Supplementary Information Table 7). Several families have members with known functions in cell±cell communication, such as the CLV1 receptor involved in meristem cell signalling, the S-glycoprotein homologues involved in signalling from pollen to stigma in self-incompatible Brassica species, and the BRI1 receptor necessary for brassinosteroid signalling83. Animals also have receptor Ser/Thr kinases, such as the transforming growth factor-b (TGF-b) receptors, but these act through SMAD proteins that are absent from Arabidopsis. The leucine-rich repeat (LRR) family of Arabidopsis receptor kinases shares its extracellular domain with many animal and fungal proteins that do not have associated kinase domains, and there are at least 122 Arabidopsis genes that code for LRR proteins without a kinase domain. Other Arabidopsis receptor kinase families have extracellular domains that are unfamiliar in animals. Thus, evolution is modular, and the plant and animal lineages have expanded different families of receptor kinases for a similar set of developmental processes. Several Arabidopsis genes of developmental importance appear to be derived from a cyanobacteria-like genome (Supplementary Information Table 2), with no close relationship to any animal or fungal protein. One salient example is the family of ethylene receptors; another gene family of apparent chloroplast origin is the phytochromesÐlight receptors involved in many developmental decisions (see below). Whereas the land plant phytochromes show clear homology to the cyanobacterial light receptors, which are typical prokaryotic histidine kinases, the plant phytochromes are histidine kinase paralogues with Ser/Thr speci®city84. Similarly to the ethylene receptors, the proteins that act downstream of plant phytochrome signalling are not found in cyanobacteria, and thus it appears that a bacterial light receptor entered the plant genome through horizontal transfer, altered its enzymatic activity, and became linked to a eukaryotic signal transduction pathway. This infusion of genes from a cyanobacterial endosymbiont shows that plants have a richer heritage of ancestral genes than animals, and unique developmental processes that derive from horizontal gene transfer. Signal transduction Being generally sessile organisms, plants have to respond to local environmental conditions by changing their physiology or redirecting their growth. Signals from the environment include light and pathogen attack, temperature, water, nutrients, touch and gravity. In addition to local cellular responses, some stimuli are communicated across the plant body, with plant hormones and peptides acting as secondary messengers. Some hormones, such as auxin, are taken up into the cell, whereas others, such as ethylene and brassinosteroids, and the peptide CLV3, act as ligands for receptor kinases on the plasma membrane. No matter where the signal is perceived by the cell, it is transduced to the nucleus, resulting in altered patterns of gene expression. Comparative genome analysis between Arabidopsis, C. elegans and Drosophila supports the idea that plants have evolved their own pathways of signal transduction85. None of the components of the widely adopted signalling pathways found in vertebrates, ¯ies or worms, such as Wingless/Wnt, Hedgehog, Notch/lin12, JAK/STAT, TGF-b/SMADs, receptor tyrosine kinase/Ras or the nuclear steroid hormone receptors, is found in Arabidopsis. By contrast, brassinosteroids are ligands of the BRI1 Ser/Thr kinase, a member of the largest recognizable class of transmembrane sensors encoded by 340 receptor-like kinase (RLK) genes in the Arabidopsis genome (Supplementary Information Table 7). With a few notable exceptions, such as CLV1, the types of ligands sensed by RLKs are © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles completely unknown, providing an enormous future challenge for plant biologists. G-protein-coupled receptors (GPCRs)/ seventransmembrane proteins are an abundant class of proteins in mammalian genomes, instrumental in signal transduction. INTERPRO detected 27 GPCR-related domains in Arabidopsis (Supplementary Information Table 1), although there is no direct experimental evidence for these. Arabidopsis contains a family of 18 seven-transmembrane proteins of the mildew resistance (MlO) class, several of which are involved in defence responses. Notably, only single Ga (GPA1) and Gb (AGB1) subunits are found in Arabidopsis, both previously known86. Although cyclic GMP has been proposed to be involved in signal transduction in Arabidopsis87, a protein containing a guanylate cyclase domain was not identi®ed in our analyses. Nevertheless, cyclic nucleotide-binding domains were detected in various proteins, indicating that cNMPs may have a role in plant signal transduction. Thus, although cNMP-binding domains appear to have been conserved during evolution, cNMP synthesis in Arabidopsis may have evolved independently. We were unable to identify a protein with signi®cant similarity to known Gg subunits, but recent biochemical studies suggest that a protein with this functional capacity is likely to be present in plant cells (H. Ma, personal communication). Therefore, there is potential for the formation of only a single heterotrimeric G-protein complex; however, its functional interaction with any of the potential GPCR-related proteins remains to be determined. Modules of cellular signal pathways from bacteria and animals have been combined and new cascades have been innovated in plants. A pertinent example is the response to the gaseous plant hormone ethylene88. Ethylene is perceived and its signal transmitted by a family of receptors related to bacterial-type two-component histidine kinases (HKs). In bacteria, yeast and plants, these proteins sense many extracellular signals and function in a His-to-Asp phosphorelay network89. In turn, these proteins physically interact with the genetically downstream protein CTR1, a Raf/MAPKKKrelated kinase, revealing the juxtaposition of bacterial-type twocomponent receptors and animal-type MAP kinase cascades. Unlike animals, however, Arabidopsis does not seem to have a Ras protein to activate the MAP kinase cascade. MAP kinases are found in abundance in Arabidopsis: we identi®ed ,20, a higher number than in any other eukaryote. As potentially counteracting components, we found ,70 putative PP2C protein phosphatases. Although this group is largely uncharacterized functionally, several members are related to ABI1/ABI2, key negative regulators in the signalling pathway for the plant hormone abscisic acid. Additional components of the His-to-Asp phosphorelay system were also found in Arabidopsis, including authentic response regulators (ARRs), pseudoresponse regulators (PRRs) and phosphotransfer intermediate protein (HPt)90. We found 11 HKs in the proteome (3 new), 16 RRs (2 new) and 8 PRRs (2 new). The biological roles of most ARRs, PRRs and HPts are largly unknown, but several have been found to have diverse functions in plants, including transcriptional activation in response to the plant hormone cytokinin91, and as components of the circadian clock92. Plants seem to have evolved unique signalling pathways by combining a conserved MAP kinase cascade module with new receptor types. In many cases, however, the ligands are unknown. Conversely, some known signalling molecules, such as auxin, are still in search of a receptor. Auxin signalling may represent yet another plant-speci®c mode of signalling, with protein degradation through the ubiquitin-proteasome pathway preceding altered gene expression. With many Arabidopsis genes encoding components of the ubiquitin-proteasome pathway, elimination of negative regulators may be a more widespread phenomenon in plant signalling. Recognizing and responding to pathogens Plants are constantly exposed to pests, parasites and pathogens and NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com have evolved many defences. In mammals, polymorphism for parasite recognition encoded in the MHC genes contributes to resistance. In plants, disease resistance (R) genes that confer parasite recognition are also extremely polymorphic. This polymorphism has been proposed to restrict parasites, and its absence may explain the breakdown of resistance in crop monocultures93. In contrast to MHC genes, plant resistance genes are found at several loci, and the complete genome sequence enables analysis of their complement and structure. Parasite recognition by resistance genes triggers defence mechanisms through various signalling molecules, such as protein kinases and adapter proteins, ion ¯uxes, reactive oxygen intermediates and nitric oxide. These halt pathogen colonization through transcriptional activation of defence genes and a form of programmed cell death called the hypersensitive response94. The Arabidopsis genome contains diverse resistance genes distributed at many loci, along with components of signalling pathways, and many other genes whose role in disease resistance has been inferred from mutant phenotypes. Most resistance genes encode intracellular proteins with a nucleotide-binding (NB) site typical of small G proteins, and carboxyterminal LRRs95. Their amino termini either carry a TIR domain, or a putative coiled coil (CC). There are 85 TIR±NB±LRR resistance genes at 64 loci, and 36 CC±NB±LRR resistance genes at 30 loci. Some NB±LRR resistance genes express neither obvious TIR nor CC domains at their N termini. This potential class is present seven times, at six loci. There are 15 truncated TIR±NB genes that lack an LRR at 10 loci, often adjacent to full TIR±NB±LRR genes. There are also six CC±NB genes, at ®ve loci. These truncated products may function in resistance. Intriguingly, two TIR±NB±LRR genes carry a WRKY domain, found in transcription factors that are implicated in plant defence, and one of these also encodes a protein kinase domain. Resistance gene evolution may involve duplication and divergence of linked gene families36; however, most (46) resistance genes are singletons; 50 are in pairs, 21 are in 7 clusters of 3 family members, with single clusters of 4, 5, 7, 8 and 9 members, respectively. Of the non-singletons, ,60% of pairs are in direct repeats, and ,40% are in inverted repeats. Resistance genes are unevenly distributed between chromosomes, with 49 on chromosome 1; 2 on chromosome 2; 16 on chromosome 3; 28 on chromosome 4; and 55 on chromosome 5. In other plant species, resistance genes encode both transmembrane receptors for secreted pathogen products and protein kinases, and some other classes are also found. The Cf genes in tomato encode extracellular LRRs with a transmembrane domain and short cytoplasmic domain. Mutation in an Arabidopsis homologue, CLAVATA2, results in enlarged meristems, but to date no resistance function has been assigned to the 30 Arabidopsis CLV2 homologues. CLAVATA1, a transmembrane LRR kinase, is also required for meristem function. Xa21, a rice LRR-kinase, confers Xanthomonas resistance, and the Arabidopsis FLS2 LRR kinase confers recognition of ¯agellin. It has been proposed that CLV1 and CLV2 function as a heterodimer; perhaps this is also true for Xa21, FLS2 and Cf proteins. There are 174 LRR transmembrane kinases in Arabidopsis, with only FLS2 assigned a role in resistance. A unique resistance gene, beet Hs1pro-1, which confers nematode resistance, has two Arabidopsis homologues. The tomato Pto Ser/Thr kinase acts as a resistance protein in conjunction with an NB±LRR protein, so similar kinases might do the same for Arabidopsis NB±LRR proteins. There are 860 Ser/Thr kinases in the Arabidopsis sequence. Fifteen of these share 50% identity over the Pto-aligned region. The Toll pathway in Drosophila and mammals regulates innate immune responses through LRR/TIR domain receptors that recognize bacterial lipopolysaccharides96. Pto is highly homologous to Drosophila PELLE and mammalian IRAK protein kinases that mediate the TIR pathway. © 2000 Macmillan Magazines Ltd 809 articles Additional genes have been de®ned that are required for resistance by our analysis of the genome sequence. The ndr1 mutation de®nes a gene required by the CC±NB±LRR gene RPS2 and RPM1. NDR1 is 1 of 28 Arabidopsis genes that are similar both to each other and to the tobacco HIN1 gene that is transcriptionally induced early during the hypersensitive response. EDS1 is a gene required for TIR±NB±LRR function, and like PAD4, encodes a protein with a putative lipase motif. EDS1, PAD4 and a third gene comprise the EDS1/PAD4 family. The NPR1/NIM1/SAI1 gene is required for systemic acquired resistance, and we found ®ve additional NPR1 homologues. Recessive mutations at both the barley Mlo and Arabidopsis LSD1 loci confer broad-spectrum resistance and derepress a cell-death program. There are at least 18 Mlo family members that resemble heterotrimeric GPCRs in Arabidopsis, and only two LSD1 homologues. One of the earliest responses to pathogen recognition is the production of reactive oxygen intermediates. This involves a specialized respiratory burst oxidase protein that transfers an electron across the plasma membrane to make superoxide. Arabidopsis encodes eight apparently functional gp91 homologues, called Atrboh genes. Unlike gp91, they all carry an ,300 amino-acid Nterminal extension carrying an EF-hand Ca2+-binding domain. In mammals, activation of the respiratory oxidative burst complex in the neutrophil, which includes gp91, requires the action of Rac proteins. As no Rac or Ras proteins are found in Arabidopsis, members of the large rop family of G proteins may carry this out. Similarly, we did not detect any Arabidopsis homologues of other mammalian respiratory burst oxidase components (p22, p47, p67, p40). There are no clear homologues of many mammalian defence and cell-death control genes. Although nitric oxide production is involved in plant defence, there is no obvious homologue of nitric oxide synthase. Also absent are apparent homologues of the REL domain transcription factors involved in innate immunity in both Drosophila and mammals. We found no similarity to proteins involved in regulating apoptosis in animal cells, such as classical caspases, bcl2/ced9 and baculovirus p35. There are, however, 36 cysteine proteases. There are also eight homologues of a newly de®ned metacaspase family97, two of which, along with LSD1, have a clear GATA-type zinc-®nger. Photomorphogenesis and photosynthesis Because nearly all plants are sessile and most depend on photosynthesis, they have evolved unique ways of responding to light. Light serves as an energy source, as well as a trigger and modulator of complex developmental pathways, including those regulated by the circadian clock. Light is especially important during seedling emergence, where it stimulates chlorophyll production, leaf development, cotyledon expansion, chloroplast biogenesis and the coordinated induction of many nuclear- and chloroplast-encoded genes, while at the same time inhibiting stem growth. The goal of this process, called photomorphogenesis, is the establishment of a body plan that allows the plant to be an ef®cient photosynthetic machine under varying light conditions98. The signal transduction cascade leading to light-induced responses begins with the activation of photoreceptors. Next, the light signal is transduced via positively and negatively acting nuclear and cytoplasmic proteins, causing activation or derepression of nuclear and chloroplast-encoded photosynthetic genes and enabling the plant to establish optimal photoautotrophic growth. Although genetic and biochemical studies have de®ned many of the components in this process, the genome sequence provides an opportunity to identify comprehensively Arabidopsis genes involved in photomorphogenesis and the establishment of photoautotrophic growth. We identi®ed at least 100 candidate genes involved in light perception and signalling, and 139 nuclear-encoded genes that potentially function in photosynthesis. 810 The roles have been described of only 35 of the 100 candidate photomorphogenic genes (Supplementary Information Table 8). All of the light photoreceptors had been discovered previously, including ®ve red/far-red absorbing phytochromes (PHYA-E), two blue/ultraviolet-A absorbing cryptochromes (CRY1 and CRY2), one blue-absorbing phototropin (NPH1) and one NPH1-like (or NPL1). In contrast, we uncovered many new proteins similar to the photomorphogenesis regulators COP/DET/FUS, PKS1, PIF3, NDPK2, SPA1, FAR1, GIGANTEA, FIN219, HY5, CCA1, ATHB-2, ZEITLUPE, FKF1, LKP1, NPH3 and RPT2. Both the phytochromes and NPH1 contain chromophores for light sensing coupled to kinase domains for signal transmission. Phytochromes have an N-terminal chromophore-binding domain, two PAS domains, and a C-terminal Ser/Thr kinase domain99, whereas NPH1 has two LOV domains (members of the PAS domain superfamily) for ¯avin mononucleotide binding and a C-terminal Ser/Thr kinase domain100. PAS domains potentially sense changes in light, redox potential and oxygen energy levels, as well as mediating protein±protein interactions99,100. We searched for uncharacterized proteins with the combination of a kinase domain and either a phytochrome chromophore-binding site or PAS domains. Although we found no new phytochrome-like genes, we did identify four predicted proteins that contain PAS and kinase domains (Supplementary Information Fig. 6). These proteins share 80% amino-acid identity, but, unlike NPH1 and NPL1, have only one PAS domain. The combination of potential signal sensing and transmitting domains makes it tempting to speculate that these proteins may be receptors for light or other signals. Our screen included searches for components of photosynthetic reaction centres and light-harvesting complexes, enzymes involved in CO2 ®xation and enzymes in pigment biosynthesis. We identi®ed 11 core proteins of photosystem I, including the eukaryotic-speci®c components PsaG and PsaH101, and 8 photosystem II proteins, including a single member (psbW) of the photosystem II core. We also found 26 proteins similar to the Chlorophyll-a/b binding proteins (8 Lhca and 18 Lhcb). Of the seven subunits of the cytochrome b6f complex (PetA±D, PetG, PetL, PetM), only one (PetC) was found in the nuclear genome, whereas the remainder are probably encoded in the chloroplast. Similarly, of the nine subunits of the chloroplast ATP synthase complex, three are encoded in the nucleus, including the II- , g- and d-subunits; the remaining subunits (I, III, IV, a, b, e) are encoded in the chloroplast102. Ten genes were related to the soluble components of the electron transfer chain, including two plastocyanins, ®ve ferredoxins and three ferredoxin/NADP oxidoreductases. Forty genes are predicted to have a role in CO2 ®xation, including all of the enzymes in the Calvin±Benson cycle. For pigment biosynthesis, 16 genes in chlorophyll biosynthesis and 31 genes in carotenoid biosynthesis were found (Supplementary Information Table 8). Our analyses have identi®ed several potential components of the light perception pathway, and have revealed the complex distribution of components of the photosynthetic apparatus between nuclear and plastid genomes. Metabolism Arabidopsis is an autotrophic organism that needs only minerals, light, water and air to grow. Consequently, a large proportion of the genome encodes enzymes that support metabolic processes, such as photosynthesis, respiration, intermediary metabolism, mineral acquisition, and the synthesis of lipids, fatty acids, amino acids, nucleotides and cofactors103. With respect to these processes, Arabidopsis appears to contain a complement of genes similar to those in the photoautotropic cyanobacterium Synechocystis45, but, whereas Synechocystis generally has a single gene encoding an enzyme, Arabidopsis frequently has many. For example, Arabidopsis has at least seven genes for the glycolytic enzyme pyruvate kinase, with an © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles additional ®ve for pyruvate kinase-like proteins. Whatever the reason for this high level of redundancy, it varies from gene to gene in the same pathway; the 11 enzymes of glycolysis are encoded by up to 51 genes that are present in as few as one or as many as eight copies. Similarly, of the 59 genes encoding proteins involved in glycerolipid metabolism, 39 are represented by more than one gene104. Genome duplication and expansion of gene families by tandem duplication have contributed to this diversity. This high degree of apparent structural redundancy does not necessarily imply functional redundancy. For instance, although there are seven genes for serine hydroxymethyltransferase, a mutation in the gene for the mitochondrial form completely blocks the photorespiratory pathway105. Although there are 12 genes for cellulose synthase, mutations in at least 2 of the 12 confer distinct phenotypes because of tissue-speci®c gene expression106. The metabolome of Arabidopsis differs from that of cyanobacteria, or of any other organism sequenced to date, by the presence of many genes encoding enzymes for pathways that are unique to vascular plants. In particular, although relatively little is known about the enzymology of cell-wall metabolism, more than 420 genes could be assigned probable roles in pathways responsible for the synthesis and modi®cation of cell-wall polymers. Twelve genes encode cellulose synthase, and 29 other genes encode 6 families of structurally related enzymes thought to synthesize other major polysaccharides106. Roughly 52 genes encode polygalacturonases, 20 encode pectate lyases and 79 encode pectin esterases, indicating a massive investment in modifying pectin. Similarly, the presence of 39 b-1,3-glucanases, 20 endoxyloglucan transglycosylases, 50 cellulases and other hydrolases, and 23 expansins re¯ects the importance of wall remodelling during growth of plant cells. Excluding ascorbate and glutathione peroxidases, there are 69 genes with signi®cant similarity to known peroxidases and 15 laccases (diphenol oxidases). Their presence in such abundance indicates the importance of oxidative processes in the synthesis of lignin, suberin and other cell-wall polymers. The high degree of apparent redundancy in the genes for cell-wall metabolism might re¯ect differences in substrate speci®city by some of the enzymes. The high degree of apparent redundancy in the genes for cell wall metabolism might re¯ect differences in substrate speci®city by some of the enzymes. It is already known that cell types have different wall compositions, which may require that the relevant enzymes be subject to cell-type-speci®c transcriptional regulation. Of the 40 or so cell types that plants make, almost all can be identi®ed by unique features of their cell wall107. A large number of genes involved in wall metabolism have yet to be de®ned. Although more than 60 genes for glycosyltransferases can be found in the genome sequence, most of these are probably involved in protein glycosylation or metabolite catabolism and do not seem to be adequate to account for the polysaccharide complexity of the wall. For instance, at least 21 enzymes are required just to produce the linkages of the pectic polysaccharide RGII, and none of these enzymes has been identi®ed at present. Thus, if these and related enzymes involved in the synthesis of other cell-wall polymers are also represented by multiple genes, a substantial number of the genes of currently unknown function may be involved in cell-wall metabolism. Higher plants collectively synthesize more than 100,000 secondary metabolites. Because ¯owering plants are thought to have similar numbers of genes, it is apparent that a great deal of enzyme creation took place during the evolution of higher plants. An important factor in the rapid evolution of metabolic complexity is the large family of cytochrome P450s that are evident in Arabidopsis (Supplementary Information Table 1). These enzymes represent a superfamily of haem-containing proteins, most of which catalyse NADPH- and O2-dependent hydroxylation reactions. Plant P450s participate in myriad biochemical pathways including those devoted to the synthesis of plant products, such as phenylpropanoids, alkaloids, terpenoids, lipids, cyanogenic glycosides and NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com glucosinolates, and plant growth regulators, such as gibberellins, jasmonic acid and brassinosteroids. Whereas Arabidopsis has ,286 P450 genes, Drosophila has 94, C. elegans has 73 and yeast has only 3. This low number in yeast indicates that there are few reactions of basic metabolism that are catalysed by P450s. It seems likely that many animal P450s are involved in detoxi®cation of compounds from food plant sources. The role of endogenous enzymes is poorly understood; only a few dozen P450 enzymes from plants have been characterized to any extent. The discrepancy between the number of known P450-catalysed reactions and the number of genes suggests that Arabidopsis produces a relatively large number of metabolites that have yet to be identi®ed. In addition to the large number of cytochrome P450s, Arabidopsis has many other genes that suggest the existence of pathways or processes that are not currently known. For instance, the presence of 19 genes with similarity to anthranilate N-hydroxycinnamoyl/ benzoyl transferase is currently inexplicable. This enzyme is involved in the synthesis of dianthramide phytoalexins in Caryophyllaceae and Gramineae. No phytoalexins of this class have been described in Arabidopsis as yet. Similarly, the presence of 12 genes with sequence similarity to the berberine bridge enzyme, ((S)reticuline:oxygen oxidoreductase (methylene-bridge-forming); EC 1.5.3.9), and 13 genes with similarity to tropinone reductase, suggests that Arabidopsis may have the ability to produce alkaloids. In other plants, the berberine bridge enzyme transforms reticuline into scoulerine, a biosynthetic precursor to a multitude of speciesspeci®c protopine, protoberberine and benzophenanthridine alkaloids. The discovery of these and many other intriguing genes in the Arabidopsis genome has created a wealth of new opportunities to understand the metabolic and structural diversity of higher plants. Concluding remarks The twentieth century began with the rediscovery of Mendel's rules of inheritance in pea108, and it ends with the elucidation of the complete genetic complement of a model plant, Arabidopsis. The analysis of the completed sequence of a ¯owering plant reported here provides insights into the genetic basis of the similarities and differences of diverse multicellular organisms. It also creates the potential for direct and ef®cient access to a much deeper understanding of plant development and environmental responses, and permits the structure and dynamics of plant genomes to be assessed and understood. Arabidopsis, C. elegans and Drosophila have a similar range of 11,000±15,000 different types of proteins, suggesting this is the minimal complexity required by extremely diverse multicellular eukaryotes to execute development and respond to their environment. We account for the larger number of gene copies in Arabidopsis compared with these other sequenced eukaryotes with two possible explanations. First, independent ampli®cation of individual genes has generated tandem and dispersed gene families to a greater extent in Arabidopsis, and unequal crossing over may be the predominant mechanism involved. Second, ancestral duplication of the entire genome and subsequent rearrangements have resulted in segmental duplications. The pattern of these duplications suggests an ancient polyploidy event, and mutant analysis indicates that at least some of the many duplicate genes are functionally redundant. Their occurrence in a functionally diploid genetic model came as a surprise, and is reminiscent of the situation in maize, an ancient segmental allotetraploid. The remarkable degree of genome plasticity revealed in the large-scale duplications may be needed to provide new functions, as alternative promoters and alternative splicing appear to be less widely used in plants than they are in animals. Apart from duplicated segments, the overall chromosome structure of Arabidopsis closely resembles that of Drosophila; transposons and other repetitive sequences are concentrated in the heterochromatic regions surrounding the centromere, © 2000 Macmillan Magazines Ltd 811 articles whereas the euchromatic arms are largely devoid of repetitive sequences. Conversely, most protein-coding genes reside in the euchromatin, although a number of expressed genes have been identi®ed in centromeric regions. Finally, Arabidopsis is the ®rst methylated eukaryotic genome to be sequenced, and will be invaluable in the study of epigenetic inheritance and gene regulation. Unlike most animals, plants generally do not move, they can perpetuate inde®nitely, they reproduce through an extended haploid phase, and they synthesize all their metabolites. Our comparison of Arabidopsis, bacterial, fungal and animal genomes starts to de®ne the genetic basis for these differences between plants and other life forms. Basic intracellular processes, such as translation or vesicle traf®cking, appear to be conserved across kingdoms, re¯ecting a common eukaryotic heritage. More elaborate intercellular processes, including physiology and development, use different sets of components. For example, membrane channels, transporters and signalling components are very different in plants and animals, and the large number of transcription factors unique to plants contrasts with the conservation of many chromatin proteins across the three eukaryotic kingdoms. Unexpected differences between seemingly similar processes include the absence of intracellular regulators of cell division (Cdc25) and apoptosis (Bcl-2). On the other hand, DNA repair appears more highly conserved between plants and mammals than within the animal kingdom, perhaps re¯ecting common factors such as DNA methylation. Our analysis also shows that many genes of the endosymbiotic ancestor of the plastid have been transferred to the nucleus, and the products of this rich prokaryotic heritage contribute to diverse functions such as photoautotrophic growth and signalling. The sequence reported here changes the fundamental nature of plant genetic analysis. Forward genetics is greatly simpli®ed as mutations are more conveniently isolated molecularly, but at the same time extensive gene duplications mean that functional redundancy must be taken into account. At a biochemical level, the speci®city conferred by nucleotide sequence, and the completeness of the survey allow complex mixtures of RNA and protein to be resolved into their individual components using micro-arrays and mass spectrometry. This speci®city can also be used in the parallel analysis of genome-wide polymorphisms and quantitative traits in natural populations109. Looking ahead, the challenge of determining the function of the large set of predicted genes, many of which are plant-speci®c, is now a clear priority, and multinational programs have been initiated to accomplish this goal using site-selected mutagenesis among the the necessary tools110. Finally, productive paths of crop improvement, based on enhanced knowledge of Arabidopsis gene function, will help meet the challenge of sustaining our food supply in the coming years. Note added in proof: at the time of publication 17 centromeric BACs and 5 sequence gaps in chromosome arms are being sequenced. M The three centres used similar annotation approaches involving in silico gene-®nding methods, comparison to EST and protein databases, and manual reconciliation of that data. Gene ®nding involved three steps: (1) analysis of BAC sequences using a computational gene ®nder; (2) alignment of the sequence to the protein and EST databases; (3) assignment of functions to each of the genes. Genscan111, GeneMark.HMM112, Xgrail113 Gene®nder (P. Green, unpublished software) and GlimmerA114 were used to analyse BAC sequences. All of these systems were specially trained for Arabidopsis genes. Splice sites were predicted using NetGene2115, Splice Predictor116 and GeneSplicer (M. Pertea and S. Salzberg, unpublished software). For the second step, BACs were aligned to ESTs and to the Arabidopsis gene index117 using programs such as DDS/GAP2118 or BLASTN119. Segmental duplications were analysed and displayed using a modi®ed version of DIALIGN2 (ref. 120). The C. elegans Sequencing Consortium. Sequence and analysis of the genome of C. elegans. Science 282, 2012±2018 (1998). Adams, M. D. The genome sequence of Drosophila melanogaster. Science 287, 2185±2195 (2000). Meinke, D. W., Cherry, J. M., Dean, C., Rounsley, S. D. & Koornneef, M. Arabidopsis thaliana: a model plant for genome analysis. Science 282, 662±665 (1998). 812 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 38. 39. 40. 41. 42. 43. Received 20 October; accepted 15 November 2000. 2. 3. 5. 37. Methods 1. 4. 44. 45. 46. Lin, X. et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402, 761±768 (1999). Mayer, K. et al. Sequence and analysis of chromsome 4 of the plant Arabidopsis thaliana. Nature 402, 769±777 (1999). Theologis, A. et al. Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana. Nature 408, 816±820 (2000). Salanoubat, M. et al. Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana. Nature 408, 820±822 (2000). Tabata, S. et al. Sequence and analysis of chromosome 5 of the plant Arabidopsis thaliana. Nature 408, 820±822 (2000). Choi, S. D., Creelman, R., Mullet, J. & Wing, R. A. Construction and characterisation of a bacterial arti®cial chromosome library from Arabidopsis thaliana. Weeds World 2, 17±20 (1995). Mozo, T., Fischer, S., Shizuya, H. & Altmann, T. Construction and characterization of the IGF Arabidopsis BAC library. Mol. Gen. Genet. 258, 562±570 (1998). Lui, Y. -G., Mitsukawa, N., Vazquez-Tello, A. & Whittier, R. F. Generation of a high-quality P1 library of Arabidopsis suitable for chromosome walking. Plant J. 7, 351±358 (1995). Lui, Y. -G. et al. Complementation of plant mutants with large genomic DNA fragments by a transformation-competent arti®cial chromosome vector accelerates positional cloning. Proc. Natl Acad. Sci. USA 96, 6535±6540 (1999). Marra, M. et al. A map or sequence analysis of the Arabidopsis thaliana genome. Nature Genet. 22, 265±270 (1999). Mozo, T. et al. A complete BAC-based physical map of the Arabidopsis thaliana genome. Nature Genet. 22, 271±275 (1999). Sato, S. et al. Structural analysis of Arabidopsis thaliana chromosome 5. I. Sequence features of the 1. 6 Mb regions covered by twenty physically assigned P1 clones. DNA Res. 4, 215±230 (1997). Bent, E., Johnson, S. & Bancroft, I. BAC representation of two low-copy regions of the genome of Arabidopsis thaliana. Plant J. 13, 849±855 (1998). Copenhaver, G. P. et al. Genetic de®nition and sequence analysis of Arabidopsis centromeres. Science 286, 2468±2474 (1999). Meyerowitz, E. M. & Somerville, C. R. Arabidopsis (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 1994) Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955±964 (1997). Pavy, N. et al. Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. BioInformatics 15, 887±900 (1999). Mewes, H. W. et al. Overview of the yeast genome. Nature 387 (Suppl.) 7±65 (1997). Frishman, D. et al. Functional and structural genomics using PEDANT. BioInformatics (in the press). Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453±1462 (1997). Kotani, H. & Tabata, S. Lessons from the sequencing of the genome of a unicellular cyanobacterium, Synechocystis SP. PCC6803. Annu. Rev. Plant Physiol. Plant Mol. Biol. 49, 151±171 (1998). Apweiler, R. et al. INTERPRO (http://www. ebi. ac. uk/interpro/). Collaborative Computer Project 11 Newsletter no. 10 (Cambridge, 2000). Bent, A. F. et al. RPS2 of Arabidopsis thaliana a leucine-rich repeat class of plant disease resistance genes. Science 265, 1856±1860 (1994). Skowyra, D. et al. F box proteins are receptors that recruit phsphorylated substrates to the SCF ubiquitin-ligase complex. Cell 91, 209±219 (1997). Joazeiro, C. A. P. & Weissman, A. M. RING ®nger proteins: mediators of ubiquitin ligase activity. Cell 102, 549±552 (2000). Rubin, G. M. et al. Comparative genomics of the eukaryotes. Science 287, 2204±2215 (2000). Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369±2376 (1999). Blanc, G. et al. Extensive duplication and reshuf¯ing in the Arabidopsis genome. Plant Cell 12, 1093± 1102 (2000). Wendel, J. F. Genome evolution in polyploids. Plant Mol. Biol. 42, 225±249 (2000). Gaut, B. S. & Doebley, J. F. DNA sequence evidence for the segmental allotetraploid origin of maize. Proc. Natl Acad. Sci. USA 94, 6809±6814 (1997). Ku, H. -M., Vision, T., Liu, J. & Tanksley, S. D. Comparing sequenced segments of the tomato and Arabidopsis genomes: Large-scale duplication followed by selective gene loss creates a network of synteny. Proc. Natl Acad. Sci. USA 97, 9121±9126 (2000). Noel, L. et al. Pronounced intraspeci®c haplotype divergence at the RPP5 complex disease resistance locus of Arabidopsis. Plant Cell 11, 2099±2111 (1999). Ellis, J., Dodds, P. & Pryor, T. Structure, function, and evolution of plant disease resistance genes. Trends Plant Sci. 3, 278±284 (2000). Tanksley, S. D. et al. High density molecular linkage maps of the tomato and potato genomes. Genetics 132, 1141±1160 (1992). Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Grasses, line up and form a circle. Curr. Biol. 5, 737±739 (1995). Acarkan, A., Rossberg, M., Koch, M. & Schmidt, R. Comparative genome analysis reveals extensive conservation of genome organisation for Arabidopsis thaliana and Capsella rubella. Plant J. 23, 55± 62 (2000). Cavell, A., Lydiate, D., Parkin, I., Dean, C. & Trick, M. A 30 centimorgan segment of Arabidopsis thaliana chromosome 4 has six collinear homologues within the Brassica napus genome. Genome 41, 62±69 (1998). O'Neill, C. & Bancroft, I. Comparative physical mapping of segments of the genome of Brassica oleracea var alboglabra that are homologous to sequenced regions of the chromosomes 4 and 5 of Arabidopsis thaliana. Plant J. 23, 233±243 (2000). Wolfe, K. H., Gouy, M., Yang, Y. -W., Sharp, P. M. & Li, W. -H. Date of the monocot-dicot divergence estimated from the chloroplast DNA sequence data. Proc. Natl Acad. Sci. USA 86, 6201±6205 (1989). van Dodeweerd, A. -M. et al. Identi®cation and analysis of homologous segments of the genomes of rice and Arabidopsis thaliana. Genome 42, 887±892 (1999) Mayer, K. Sequence level analysis of homologous segments of the genomes of rice and Arabidopsis thaliana. Genome Res. (submitted). Sato, S. Complete structure of the chloroplast genome of Arabidopsis thaliana. DNA Research 6, 283±290 (1999). Unseld, M., Marienfeld, J., Brandt, P. & Brennicke, A. The mitochondrial genome in Arabidopsis © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles thaliana contains 57 genes in 366,924 nucleotides. Nature Genet. 15, 57±61 (1997). 47. Palmer, J. D. et al. Dynamic evolution of plant mitochondrial genomes: mobile genes and introns and highly variable mutation rates. Proc. Natl Acad. Sci. USA 97, 6960±6966 (2000). 48. Le, Q. -H. et al. Transposon diversity in Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 97, 7376± 7381 (2000). 49. Yu, Z., Wright, S. & Bureau, T. Mutator-like elements (MULEs) in Arabidopsis thaliana: Structure, diversity and evolution. Genetics (in the press). 50. Feschotte, C. & Mouches, C. Evidence that a family of miniature inverted-repeat transposable elements (MITEs) from the Arabidopsis thaliana genome has arisen from a pogo-like DNA transposon. Mol. Biol. Evol. 17, 730±737 (2000). 51. Martienssen, R. Transposons, DNA methylation and gene control. Trends Genet. 14, 263±264 (1998). 52. Singer, T., Yordan, C. & Martienssen, R. Robertson's Mutator transposons in Arabidopsis are regulated by the chromatin-remodeling gene Decrease in DNA Methylation (DDM1). Genes Dev. (in the press). 53. SanMiguel, P. et al. Nested retrotransposons in the intergenic regions of the maize genome. Science 274, 765±768 (1996). 54. Copenhaver, G. P. & Pikaard, C. S. Two-dimensional RFLP analyses reveal megabase-sized clusters of rRNA gene variants in Arabidopsis thaliana, suggesting local spreading of variants as the mode for gene homogenization during concerted evolution. Plant J. 9, 273±282 (1996). 55. Fransz, P. et al. Cytogenetics for the model system Arabidopsis thaliana. Plant J. 13, 867±876 (1998). 56. Richards, E. J. & Ausubel, F. M. Isolation of a higher eukarotic telomere from Arabidopsis thaliana. Cell 53, 127±136 (1988). 57. Round, E. K., Flowers, S. K. & Richards, E. J. Arabidopsis thaliana centromere regions: genetic map positions and repetitive DNA structure. Genome Res. 7, 1045±1053 (1997). 58. The CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium. The complete sequence of a heterochromatic island from a higher eukaryote. Cell 100, 377±386 (2000). 59. Fransz, P. F. et al. Integrated cytogenetic map of chromosome arm 4S of A. thaliana: Structural organization of heterochromatic knob and centromere region. Cell 100, 367±376 (2000). 60. Paulsen, I. T., Nguyen, L., Sliwinski, M. K., Rabus, R. & Saier, M. H. Jr Microbial genome analyses: comparative transport capabilities in eighteen prokaryotes. J. Mol. Biol. 301, 75±101 (2000). 61. Paulsen, I. T., Sliwinski, M. K., Nelissen, B., Goffeau, A. & Saier, M. H. Jr Uni®ed inventory of established and putative transporters encoded within the complete genome of Saccharomyces cerevisiae. FEBS Lett. 430, 116±125 (1998). 62. Hirsch, R. E., Lewis, B. D, Spalding, E. P. & Sussman, M. R. A role for the AKT1 potassium channel in plant nutrition. Science 280, 918±921 (1998). 63. Slayman, C. L. & Slayman, C. W. Depolarization of the plasma membrane of Neurospora during active transport of glucose: evidence for a proton-dependent cotransport system. Proc. Natl Acad. Sci. USA 71, 1035±1939 (1974). 64. Ryan, C. A. & Pearce, G. Systemin: a polypeptide signal for plant defensive genes. Annu. Rev. Cell. Dev. Biol. 14, 1±17 (1998). 65. Eisen, J. A. & Hanawalt, P. C. A phylogenomic study of DNA repair genes, proteins, and processes. Mutat. Res. 435, 171±213 (1999). 66. Selby, C. P. & Sancar, A. Structure and function of transcription-repair coupling factor. Structural domains and binding properties. J. Biol. Chem. 270, 4882±4889 (1995). 67. Britt, A. B. Molecular genetics of DNA repair in higher plants. Trends Plant Sci. 4, 20±25 (1999). 68. Dangl, M. Response to Aravind, L. & Koonin, E. V. Second Family of Histone Deacetylases. Science 280, 1167 (1998). 69. Cao, X. et al. Conserved plant genes with similarity to mammalian de novo DNA methyltransferases. Proc. Natl Acad. Sci. USA 97, 4979±4984 (2000). 70. Henikoff, S. & Comai, L. A DNA methyltransferase homologue with a chromodomain exists in multiple polymorphic forms in Arabidopsis. Genetics 149, 307±318 (1998). 71. Hung, M. -S. et al. Drosophila proteins related to vertebrate DNA (5-cytosine) methyltransferases. Proc Natl Acad. Sci. USA 96, 11940±11945 (1999). 72. Dalmay, T., Hamilton, A. J., Rudd, S., Angell, S. & Baulcombe, D. C. An RNA-dependent-RNA polymerase in Arabidopsis is required for post transcriptional gene silencing mediated by a transgene but not by a virusÐthe truth. Cell 101, 543±553 (2000). 73. Riechmann, J. L. & Ratcliffe, O. J. A genomic perspective on plant transcription factors. Curr. Opin. Plant Biol. 3, 423±434 (2000). 74. Liljegren, S. J. et al. SHATTERPROOF MADS-box genes control seed dispersal in Arabidopsis. Nature 404, 766±770 (2000). 75. Pelaz, S. et al. B and C ¯oral organ identity functions require SEPALLATA MADS-box genes. Nature 405, 200±203 (2000). 76. Canaday, J., Stoppin-Mellet, V., Mutterer, J., Lambert, A. M. & Schmit, A. C. Higher plant cells: gamma-tubulin and microtubule nucleation in the absence of centrosomes. Microsc. Res. Technol. 49, 487±495 (2000). 77. Bassham, D. C. & Raikhel, N. V. Unique features of the plant vacuolar sorting machinery. Curr. Opin. Cell Biol. 12, 491±495 (2000). 78. Zheng, Z. L. & Yang, Z. The Rrop GTPase switch turns on polar growth in pollen. Trends Plant Sci. 5, 298-303 (2000). 79. den Boer, B. G. & Murray, J. A. Triggering the cell cycle in plants. Trends Cell Biol. 10, 245±250 (2000). 80. Heese, M., Mayer, U. & Jurgens, G. Cytokinesis in ¯owering plants: cellular process and developmental integration. Curr. Opin. Plant Biol. 1, 486±491 (1998). 81. Meyerowitz, E. M. Plants, animals, and the logic of development. Trends Genet. 15, M65±M68 (1999). 82. Wang, D. Y. C. et al. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. R. Soc. Lond. B Bio. 266, 63±171 (1999). 83. Torii, K. Receptor kinase activation and signal transduction in plants: an emerging picture. Curr. Opin. Plant Biol. 3, 362±367 (2000). 84. Yeh, K. C. & Lagarias, J. C. Eukaryotic phytochromes: Light-regulated serine/threonine protein kinases with histidine kinase ancestry. Proc. Natl Acad. Sci. USA 95, 13976±13981 (1998). 85. McCarty, D. R. & Chory, J. Conservation and innovation in plant signaling pathways. Cell 103, 201± 211 (2000). 86. Weiss, C. A., Garnaat, C., Mukai, K., Hu, Y. & Ma, H. Molecular cloning of cDNAs from maize and NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com Arabidopsis encoding a G protein beta subunit. Proc. Natl Acad. Sci. USA 91, 9554±9558 (1994). 87. Bowler, C. et al. Cyclic GMP and calcium mediate phytochrome phototransduction. Cell 77, 73±81 (1994). 88. Stepanova, A. & Ecker, J. R. Ethylene signaling: from mutants to molecules. Curr. Opin. Plant Biol. 3, 353±360 (2000). 89. Urao, T., Yamaguchi-Shinozaki, K. & Shinozaki, K. Two-component systems in plant signal transduction. Trends Plant Sci. 5, 67±74 (2000). 90. Makino, S. et al. Genes encoding pseudo-response regulators: Insight into His-to-Asp phosphorelay and circadian rhythm in Arabidopsis thaliana. Plant Cell Physiol. 41, 791±803 (2000). 91. D'Agostino, I. B. & Kieber, J. J. Phosphorelay signal transduction: the emerging family of plant response regulators. Trends Biol. Sci. 24, 452±456 (1999). 92. Strayer, C. et al. Cloning of the Arabidopsis clock gene TOC1, an autoregulatory response regulator homologue. Science 289, 768±771 (2000). 93. Stahl, E. A. & Bishop, J. G. Plant-Pathogen arms races at the molecular level. Curr. Opin. Plant Biol. 3, 299±304 (2000). 94. McDowell, J. M. & Dangl, J. L. Signal transduction in the plant innate immune response. Trends Biochem. Sci. 25, 79±82 (2000). 95. Van der Biezen, E. A. & Jones, J. D. Plant disease-resistance proteins and the gene-for-gene concept. Trends Biochem Sci. 23, 454±456 (1998). 96. Belvin, M. P. & Anderson, K. V. A conserved signaling pathway: the Drosophila toll-dorsal pathway. Annu. Rev. Cell. Dev. Biol. 12, 393±416 (1996). 97. Uren, A. G. et al. Identi®cation of paracaspases and metacaspases: Two ancient families of caspaselike proteins, one of which plays a key role in MALT lymphoma. Mol. Cell 6, 961±967 (2000). 98. Fankhauser, C. & Chory, J. Light control of plant development. Annu. Rev. Cell. Dev. Biol. 13, 203± 229 (1997). 99. Briggs, W. R. & Huala, E. Blue-light photoreceptors in higher plants. Annu. Rev. Cell. Dev. Biol. 15, 33±62 (1999). 100. Christie, J. M., Salomon, M., Nozue, K., Wada, M. & Briggs, W. R. LOV (light, oxygen, or voltage) domains of the blue-light photoreceptor phototropin (nph1): binding sites for the chromophore ¯avin mononucleotide. Proc. Natl Acad. Sci. USA 96, 8779±8783 (1999). 101. Golbeck, J. H. Structure and function of photosystem I. Annu. Rev. Plant Physiol. Plant Mol. Biol. 43, 293±324 (1992). 102. Maier, R. M., Neckermann, K., Igloi, G. L. & Kossel, H. Complete sequence of the maize chloroplast genome: gene content, hotspots of divergence and ®ne tuning of genetic information by transcript editing. J. Mol. Biol. 251, 614±28 (1995). 103. Buchanan, B. B., Gruissem, W. & Jones, R. L. in Biochemistry and Molecular Biology of Plants 1367 (Am. Soc. Plant Physiol., Rockville, Maryland, 2000). 104. Mekhedov, S., MartõÂnez de IlaÂrduya, O. & Ohlrogge, J. Toward a functional catalog of the plant genome. A survey of genes for lipid biosynthesis. Plant Physiol. 122, 389±401 (2000). 105. Somerville, C. R., & Ogren, W. L. Photorespiration de®cient mutants of Arabidopsis thaliana lacking mitochrondrial serine transhydroxymethylase activity. Plant Physiol. 67, 666±671 (1981). 106. Richmond, T., & Somerville, C. R. The cellulose synthase superfamily. Plant Physiol 124, 495±499 (1999). 107. Carpita, N. Vergara C: A recipe for cellulose. Science 279, 672±673 (1998). 108. De Vries, H. Sur la loi de disjonction des hybrides. C. R. Acad. Sci. Paris 130, 845±847 (1900). 109. Alonso-Blanco, C. & Koornneef, M. Naturally occurring variation in Arabidopsis: an underexploited resource for plant genetics. Trends Plant Sci. 5, 1360±1385 (1999). 110. Chory, J. Functional genomics and the virtual plant. A blueprint for understanding how plants are built and how to improve them. Plant Physiology 123, 423±425 (2000). 111. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78±94 (1997). 112. Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: new solutions for gene ®nding. Nucleic Acids Res. 26, 1107±1115 (1998). 113. Uberbacher, E. C. & Mural, R. J. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl Acad. Sci. USA 88, 11261±11265 (1991). 114. Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J. & Tettelin, H. Interpolated Markov models for eukaryotic gene ®nding. Genomics 59, 24±31 (1999). 115. Hebsgaard, S. M. et al. Splice site prediction in Arabidopsis thaliana DNA by combining local and global sequence information. Nucleic Acids Res. 24, 3439±3452 (1996). 116. Brendel, V. & Kleffe, J. Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identi®cation in Arabidopsis thaliana genomic DNA. Nucleic Acids Res. 26, 4748±4757 (1998). 117. Quackenbush, J., Liang, F., Holt, I., Pertea, G. & Upton, J. The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res. 28, 141±145 (2000). 118. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R. A tool for analyzing and annotating genomic sequences. Genomics 46, 37±45 (1997). 119. Altschul, S. F. et al. Basic local alignment search tool. J. Mol. Biol. 215, 403±410 (1990). 120. Morgenstern, B. DIALIGN2: improvement of the segment-to-segment approach to multiple sequence alignment. BioInformatics 15, 211±218 (1999). 121. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classi®cation of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536±540 (1995). 122. Emanuelsson, O., Nielsen, H., Brunak, S. & von Heijne, G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300, 1005±1016 (2000). Supplementary information is available on Nature's World-Wide Web site (http://www.nature.com) or as paper copy from the London editorial of®ce of Nature. Acknowledgements This work was supported by the National Science Foundation (NSF) Cooperative Agreements (funded by the NSF, the US Department of Agriculture (USDA) and the US Department of Energy (DOE)), the Kazusa DNA Research Institute Foundation, and by the European Commission. Additional support from the USDA, MinisteÁre de la Recherche, GSF-Forschungszentrum f. Umwelt u. Gesundheit, BMBF (Bundesministerium f. Bildung, Forschung und Technologie), the BBSRC (Biotechnology and Biological © 2000 Macmillan Magazines Ltd 813 articles Sciences Research Council) and the Plant Research International, Wageningen, is also gratefully acknowledged. The authors wish to thank E. Magnien, D. Nasser and J. D. Watson for their continual support and encouragement. Correspondence and requests for materials should be addressed to The Arabidopsis Genome Initiative (e-mail: [email protected] or [email protected]). The Arabidopsis Genome Initiative Three groups contributed to the work reported here. The Genome Sequencing groups, arranged here in order of sequence contribution, sequenced and annotated assigned chromosomal regions. The Genome Analysis group carried out the analyses described. The Contributing Authors interpreted the genome analyses, incorporating other data and analyses, with respect to selected biological topics. Genome Sequencing Groups Samir Kaul, Hean L. Koo, Jennifer Jenkins, Michael Rizzo, Timothy Rooney, Luke J. Tallon, Tamara Feldblyum, William Nierman, Maria-Ines Benito, Xiaoying Lin, Christopher D. Town, J. Craig Venter & Claire M. Fraser The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA Satoshi Tabata, Yasukazu Nakamura, Takakazu Kaneko, Shusei Sato, Erika Asamizu, Tomohiko Kato, Hirokazu Kotani & Shigemi Sasamoto Kazusa DNA Research Institute, 1532-3 Yana, Kisarazu, Chiba 292, Japan Joseph R. Ecker1*², Athanasios Theologis2*, Nancy A. Federspiel3*², Curtis J. Palm3, Brian I. Osborne2, Paul Shinn1, Aaron B. Conway3, Valentina S. Vysotskaia2, Ken Dewar1, Lane Conn3, Catherine A. Lenz2, Christopher J. Kim1, Nancy F. Hansen3, Shirley X. Liu2, Eugen Buehler1, Hootan Alta®3, Hitomi Sakano2, Patrick Dunn1, Bao Lam3, Paul K. Pham2, Qimin Chao1, Michelle Nguyen3, Guixia Yu2, Huaming Chen1, Audrey Southwick3, Jeong Mi Lee2, Molly Miranda3, Mitsue J. Toriumi2 & Ronald W. Davis3 1, Plant Science Institute, Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 USA; 2, Plant Gene Expression Center/USDA-U.C.Berkeley, 800 Buchanan Street, Albany, California 94710, USA; 3, Stanford Genome Technology Center, 855 California Avenue, Palo Alto, California 94304, USA. * These authors contributed equally to this work. ² Present addresses: The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, California 92037, USA (J.R.E.); Exelixis, Inc., 170 Harborway, P.O. Box 511, South San Francisco, California 94083-0511, USA (N.A.F) European Union Chromosome 4 and 5 Sequencing Consortium: R. Wambutt1, G. Murphy2, A. DuÈsterhoÈft3, W. Stiekema4, T. Pohl5, K.-D. Entian6, N. Terryn7 & G. Volckaert8 1, AGOWA GmbH, Glienicker Weg 185, D-12489 Berlin, Germany; 2, John Innes Centre, Colney Lane, Norwich NR4 7UH, UK; 3, QIAGEN GmbH, Max-Volmer-Str. 4, D-40724 Hilden, Germany; 4, Greenomics, Plant Research International, Droevendaalsesleeg 1, NL 6700, AA Wageningen, The Netherlands; 5, GATC GmbH, FritzArnold Strasse 23, D-78467 Konstanz, Germany; 6, SRD GmbH, Oberurseler Str. 43, Oberursel 61440, Germany; 7, Department for Plant Genetics, (VIB), University of Gent, K.L. Ledeganckstraat 35, B-9000 Gent, Belgium; 8, Katholieke Universiteit Leuven, Laboratory of Gene Technology, Kardinaal Mercierlaan 92, B-3001 Leuven, Belgium European Union Chromosome 3 Sequencing Consortium: M. Salanoubat1, N. Choisne1, M. Rieger2, W. Ansorge3, M. Unseld4, B. Fartmann5, G. Valle6, F. Artiguenave1, J. Weissenbach1 & F. Quetier1 1, Genoscope and CNRS FRE2231, 2 rue G. CreÂmieux, 91057 Evry Cedex, France; 2, Genotype GmbH Angelhofweg 39, D-69259 Wilhemlsfeld, Germany; 3, European Molecular Biology Laboratory, Biochemical Instrumentation Program, Meyerhoftstr. 1, D-69117 Heidelberg, Germany; 4, LION Bioscience AG, Im Neuenheimer Feld 515-517, 69120 Heidelberg, Germany; 5, MWG-Biotech AG, Anzinger Strasse 7a, 85560 Ebersberg, Germany; 6, CRIBI, UniversitaÁ di Padova, via G. Colombo 3, Padova 35131, Italy The Cold Spring Harbor and Washington University Genome Sequencing Center Consortium: Richard K. Wilson1, Melissa de la Bastide2, M. Sekhon1, Emily Huang2, Lori Spiegel2, Lidia Gnoj2, K. Pepin1, J. Murray1, D. Johnson1, Kristina Habermann2, Neilay Dedhia2, Larry Parnell2, Raymond Preston2, L. Hillier1, Ellson Chen3, M. Marra2, Robert Martienssen4 & W. Richard McCombie2 1, Washington University Genome Sequencing Center, Washington University in St Louis School of Medicine, 4444 Forest Park Blvd., St. Louis, Missouri 63108 USA; 2, Lita Annenberg Hazen Genome Center, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; 3, Celera Genomics, 850 Lincoln Center Drive, Foster City, California 94494, USA; 4, Plant Biology Group, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA Genome Analysis Group Klaus Mayer1*, Owen White2*, Michael Bevan3, Kai Lemcke1, Todd H. Creasy2, Cord Bielke2, Brian Haas1, Dirk Haase1, Rama Maiti2, Stephen Rudd1, Jeremy Peterson2, Heiko Schoof1, Dimitrij Frishman1, Burkhard Morgenstern1, Paulo Zaccaria1, Maria Ermolaeva 2, Mihaela Pertea2, John Quackenbush2, Natalia Volfovsky2, Dongying Wu2, Todd M. Lowe4, Steven L. Salzberg 2 & Hans-Werner Mewes1 1, GSF-Forschungszentrum f. Umwelt u. Gesundheit, Munich Information Center for Protein Sequences, am Max-Planck-Institut f. Biochemie, Am Klopferspitz 18a, D-82152, Germany; 2, The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA; 3, Molecular Genetics Deartment, John Innes Centre, Colney Lane, Norwich NR4 7UH, UK; 4, Dept Genetics, Stanford University Medical School, Stanford, California 94305-5120, USA. * These authors contributed equally to this work Contributing Authors Comparative analysis of the genomes of A. thaliana accessions. S. Rounsley, D. Bush, S. Subramaniam, I. Levin & S. Norris Cereon Genomics LLC, 45 Sidney St, Cambridge, Massachussetts 02139, USA Comparative analysis of the genomes of A. thaliana and other genera. R. Schmidt1, A. Acarkan1 & I. Bancroft2 1, Max-DelbruÈck-Laboratorium in der Max-Planck-Gesellschaft, Carl-von-LinneÂ-Weg 10, 50829 Cologne, Germany; 2, Brassicas and Oilseeds Research Department, John Innes Centre, Norwich NR4 7UJ, UK 814 © 2000 Macmillan Magazines Ltd NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com articles Integration of the three genomes in the plant cell: the extent of protein and nucleic acid traf®c between nucleus, plastids and mitochondria. F. Quetier1, A. Brennicke2 & J. A. Eisen3. 1, Genoscope, Centre Nationale de Sequencage, 2 rue Gaston Cremieux, CP 5706, 91057 Evry Cedex, France; 2, Molekulare Botanik, UniversitaÈt Ulm, 89069 Ulm, Germany; 3, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA Transposable elements. T. Bureau1, B.-A. Legault1, Q.-H. Le1, N. Agrawal1, Z. Yu1 & R. Martienssen2 1, McGill University, Dept of Biology, 1205 rue Dr Pen®eld, Montreal, Quebec, H3A 1B1, Canada; 2, Plant Biology Group, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA rDNA, telomeres and centromeres. G. P. Copenhaver1, S. Luo1, C. S. Pikaard2 & D. Preuss1 1, Howard Hughes Medical Institute, The University of Chicago, 1103 East 57th Street, Chicago, Illiois, USA; 2, Biology Department, Washington University in St Louis, St Louis, Missouri 63130, USA Membrane transport. I. T. Paulsen1 & M. Sussman2 1, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA; 2, University of Wisconsin Biotechnology Center, 425 Henry Mall, Madison, Wisconsin 53706, USA DNA repair and recombination. A. B. Britt1 & J. A. Eisen2 1, Section of Plant Biology, University of California, Davis, California 95616, USA; 2, The Institute for Genomic Research, 9712 Medical Centre Drive, Rockville, Maryland 20850, USA Gene regulation. D. A. Selinger1, R. Pandey1, D. W. Mount2, V. L. Chandler1, R. A. Jorgensen1 & C. Pikaard3 1, Department of Plant Sciences, University of Arizona, 303 Forbes Hall; and 2, Department of Molecular and Cellular Biology, University of Arizona, Tucson, Arizona 85721, USA; 3, Biology Department, Washington University in St Louis, St Louis, Missouri 63130, USA Cellular organization. G. Juergens Entwicklungsgenetik, ZMBP-Centre for Plant Molecular Biology, auf der Morgenstelle 1, Tuebingen D-72076, Germany Development. E. M. Meyerowitz. Division of Biology, California Institute of Biology, Pasadena, California 91125, USA Signal transduction. J. R. Ecker1 & A. Theologis2. 1, The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, California 92037, USA; 2, Plant Gene Expression Center/USDA-UC Berkeley, 800 Buchanan Street, Albany, California 94710, USA Recognition of and response to pathogens. J. Dangl1 & J. D. G. Jones2 1, Biology Department, Coker Hall, University of North Carolina, Chapel Hill, North Carolina 27599, USA; 2, Sainsbury Laboratory, John Innes Centre, Colney Lane, Norwich NR4 7UJ, UK Photomorphogenesis and photosynthesis. M. Chen & J. Chory Howard Hughes Medical Institute and Plant Biology Laboratory, The Salk Institute, 10010 North Torrey Pines Road, La Jolla, California 92037, USA Metabolism. C. Somerville Carnegie Institution, 260 Panama Street, Stanford, California 94305, USA NATURE | VOL 408 | 14 DECEMBER 2000 | www.nature.com © 2000 Macmillan Magazines Ltd 815 Vol 436|11 August 2005|doi:10.1038/nature03895 ARTICLES The map-based sequence of the rice genome International Rice Genome Sequencing Project* Rice, one of the world’s most important food plants, has important syntenic relationships with the other cereal species and is a model plant for the grasses. Here we present a map-based, finished quality sequence that covers 95% of the 389 Mb genome, including virtually all of the euchromatin and two complete centromeres. A total of 37,544 nontransposable-element-related protein-coding genes were identified, of which 71% had a putative homologue in Arabidopsis. In a reciprocal analysis, 90% of the Arabidopsis proteins had a putative homologue in the predicted rice proteome. Twenty-nine per cent of the 37,544 predicted genes appear in clustered gene families. The number and classes of transposable elements found in the rice genome are consistent with the expansion of syntenic regions in the maize and sorghum genomes. We find evidence for widespread and recurrent gene transfer from the organelles to the nuclear chromosomes. The map-based sequence has proven useful for the identification of genes underlying agronomic traits. The additional single-nucleotide polymorphisms and simple sequence repeats identified in our study should accelerate improvements in rice production. Rice (Oryza sativa L.) is the most important food crop in the world and feeds over half of the global population. As the first step in a systematic and complete functional characterization of the rice genome, the International Rice Genome Sequencing Project (IRGSP) has generated and analysed a highly accurate finished sequence of the rice genome that is anchored to the genetic map. Our analysis has revealed several salient features of the rice genome: . We provide evidence for a genome size of 389 Mb. This size estimation is ,260 Mb larger than the fully sequenced dicot plant model Arabidopsis thaliana. We generated 370 Mb of finished sequence, representing 95% coverage of the genome and virtually all of the euchromatic regions. . A total of 37,544 non-transposable-element-related protein-coding sequences were detected, compared with ,28,000–29,000 in Arabidopsis, with a lower gene density of one gene per 9.9 kb in rice. A total of 2,859 genes seem to be unique to rice and the other cereals, some of which might differentiate monocot and dicot lineages. . Gene knockouts are useful tools for determining gene function and relating genes to phenotypes. We identified 11,487 Tos17 retrotransposon insertion sites, of which 3,243 are in genes. . Between 0.38 and 0.43% of the nuclear genome contains organellar DNA fragments, representing repeated and ongoing transfer of organellar DNA to the nuclear genome. . The transposon content of rice is at least 35% and is populated by representatives from all known transposon superfamilies. . We have identified 80,127 polymorphic sites that distinguish between two cultivated rice subspecies, japonica and indica, resulting in a high-resolution genetic map for rice. Single-nucleotide polymorphism (SNP) frequency varies from 0.53 to 0.78%, which is 20 times the frequency observed between the Columbia and Landsberg erecta ecotypes of Arabidopsis. . A comparison between the IRGSP genome sequence and the 6.3 £ indica and 6 £ japonica whole-genome shotgun sequence assemblies revealed that the draft sequences provided coverage of 69% by indica and 78% by japonica relative to the map-based sequence. Rice has played a central role in human nutrition and culture for the past 10,000 years. It has been estimated that world rice production must increase by 30% over the next 20 years to meet projected demands from population increase and economic development1. Rice grown on the most productive irrigated land has achieved nearly maximum production with current strains1 . Environmental degradation, including pollution, increase in night time temperature due to global warming2, reductions in suitable arable land, water, labour and energy-dependent fertilizer provide additional constraints. These factors make steps to maximize rice productivity particularly important. Increasing yield potential and yield stability will come from a combination of biotechnology and improved conventional breeding. Both will be dependent on a highquality rice genome sequence. Rice benefits from having the smallest genome of the major cereals, dense genetic maps and relative ease of genetic transformation3. The discovery of extensive genome colinearity among the Poaceae4 has established rice as the model organism for the cereal grasses. These properties, along with the finished sequence and other tools under development, set the stage for a complete functional characterization of the rice genome. The International Rice Genome Sequencing Project The IRGSP, formally established in 1998, pooled the resources of sequencing groups in ten nations to obtain a complete finished quality sequence of the rice genome (Oryza sativa L. ssp. japonica cv. Nipponbare). Finished quality sequence is defined as containing less than one error in 10,000 nucleotides, having resolved ambiguities, and having made all state-of-the-art attempts to close gaps. The IRGSP released a high-quality map-based draft sequence in *Lists of participants and affiliations appear at the end of the paper © 2005 Nature Publishing Group 793 ARTICLES NATURE|Vol 436|11 August 2005 December 2002. Three completely sequenced chromosomes have been published5–7, as well as two completely sequenced centromeres8–10. As the IRGSP subscribed to an immediate-release policy, high-quality map-based sequence has been public for some time. This has permitted rice geneticists to identify several genes underlying traits, and revealed very large and previously unknown segmental duplications that comprise 60% of the genome11–13. The public sequence has also revealed new details about the syntenic relationships and gene mobility between rice, maize and sorghum13–15. Physical maps, sequencing and coverage The IRGSP sequenced the genome of a single inbred cultivar, Oryza sativa ssp. japonica cv. Nipponbare, and adopted a hierarchical cloneby-clone method using bacterial and P1 artificial chromosome clones (BACs and PACs, respectively). This strategy used a high-density genetic map16, expressed-sequence tags (ESTs)17, yeast artificial chromosome (YAC)- and BAC-based physical maps18–20, BAC-end sequences21 and two draft sequences22,23. A total of 3,401 BAC/PAC clones (Table 1) were sequenced to approximately tenfold sequence coverage, assembled, ordered and finished to a sequence quality of less than one error per 10,000 bases. A majority of physical gaps in the BAC/PAC tiling path were bridged using a variety of substrates, including PCR fragments, 10-kb plasmids and 40-kb fosmid clones. A total of 62 unsequenced physical gaps, including nine centromere and 17 telomere gaps, remain on the 12 chromosomes (Table 2). Chromosome arm and telomere gaps were measured, and the nine centromere gaps were estimated on the basis of CentO satellite DNA content. The remaining gaps are estimated to total 18.1 Mb. Ninety-seven percent of the BAC/PACs and gap sequences (3,360) have been submitted as finished quality in the PLN division of GenBank/DDBJ/EMBL. These and the remaining draft-sequenced clones were used to construct pseudomolecules representing the 12 chromosomes of rice (Fig. 1). The total nucleotide sequence of the 12 pseudomolecules is 370,733,456 bp, with an N-average continuous sequence length of 6.9 Mb (see Table 1 for a definition of N-average length). Sequence quality was assessed by comparing 1.2 Mb of overlapping sequence produced by different laboratories. The overall accuracy was calculated as 99.99% (Supplementary Table 2). The statistics of sequenced PAC/BAC clones and pseudomolecules for each chromosome are shown in Table 1. The genome size of rice (O. sativa ssp. japonica cv. Nipponbare) was reported to have a haploid nuclear DNA content of 394 Mb on the basis of flow cytometry24, and 403 Mb on the basis of lengths of anchored BAC contigs and estimates of gap sizes20. Table 2 shows the calculated size for each chromosome and the estimated coverage. Adding the estimated length of the gaps to the sum of the nonoverlapping sequence, the total length of the rice nuclear genome was calculated to be 388.8 Mb. Therefore, the pseudomolecules are expected to cover 95.3% of the entire genome and an estimated 98.9% of the euchromatin. An independent measure of genome coverage represented by the pseudomolecules was obtained by searching for unique EST markers19; of 8,440 ESTs, 8,391 (99.4%) were identified in the pseudomolecules. Centromere location Typical eukaryotic centromeres contain repetitive sequences, including satellite DNA at the centre and retrotransposons and transposons in the flanking regions. All rice centromeres contain the highly repetitive 155–165 bp CentO satellite DNA, together with centromere-specific retrotransposons25,26. The CentO satellites are located within the functional domain of the rice centromere10,26. Complete sequencing of the centromeres of rice chromosomes 4 and 8 revealed that they consist of 59 kb and 69 kb of clustered CentO repeats (respectively)8–10, tandemly arrayed head-to-tail within the clusters. Numerous retrotransposons, including the centromere-specific 794 RIRE7, are found between and around the CentO repeats. CentO clusters show differences in length and orientation for the two centromeres. BLASTN analysis of the pseudomolecules indicated that about 0.9 Mb of CentO repeats (corresponding to more than 5,800 copies of the satellite) were sequenced and found to be associated with centromere-specific retroelements. Locations of all CentO sequences correspond to genetically identified centromere regions (Supplementary Table 3). Our pseudomolecules cover the centromere regions on chromosomes 4, 5 and 8, and portions of the centromeres on the remaining chromosomes (Fig. 1). Gene content, expression and distribution We masked the pseudomolecules for repetitive sequences and used the ab initio gene finder FGENESH to identify only non-transposable-element-related genes. A total of 37,544 non-transposableelement protein-coding sequences were predicted, resulting in a density of one gene per 9.9 kb (Supplementary Tables 4 and 5). As the ability to identify unannotated and transposable-element-related genes improves, the true protein-coding gene number in rice will doubtless be revised. Full-length complementary DNA sequences are available for rice27, and provide a powerful resource for improving gene model structure derived from ab initio gene finders28. Of the 37,544 non-transposable-element-related FGENESH models, 17,016 could be supported by a total of 25,636 full-length cDNAs (Supplementary Table 6). A total of 22,840 (61%) genes had a high identity match with a rice ESTor full-length cDNA. On average, about 10.7 EST sequences were present for each expressed rice gene. A total of 2,927 genes aligned well with ESTs from other cereal species, and 330 of these genes matched only with a non-rice cereal EST (Supplementary Fig. 1). Except for the short arms of chromosomes 4, 9 and 10, which are known to be highly heterochromatic, the density of expressed genes is greater on the distal portions of the chromosome arms compared with the regions around the centromeres (Supplementary Fig. 2). A total of 19,675 proteins had matches with entries in the SwissProt database; of these, 4,500 had no expression support. Domain searches revealed a minimum of one motif or domain present in 63% of the predicted proteins, with a total of 3,328 different domains present in the predicted rice proteome. The five most abundant domains were associated with protein kinases (Supplementary Table 7). Fifty-one per cent of the predicted proteins could be associated with a biological process (Supplementary Fig. 3a), with metabolism (29.1%) and cellular physiological processes (11.9%) representing the two most abundant classes. Approximately 71% (26,837) of the predicted rice proteins have a homologue in the Arabidopsis proteome (Supplementary Fig. 4). In a reciprocal search, 89.8% (26,004) of the proteins from the Arabidopsis genome have a homologue in the rice proteome. Of the 23,170 rice genes with rice EST, cereal EST, or full-length cDNA support, 20,311 (88%) have a homologue in Arabidopsis. Fewer putative homologues were found in other model species: 38.1% in Drosophila, 40.8% in human, 36.5% in Caenorhabditis elegans, 30.2% in yeast, 17.6% in Synechocystis and 10.2% in Escherichia coli. There are profound differences in plant architecture and biochemistry between monocotyledonous and dicotyledonous angiosperms. Only 2,859 rice genes with evidence of transcription lack homologues in the Arabidopsis genome. We investigated these to learn what functions they encoded. The vast majority had no matches, or most closely matched unknown or hypothetical proteins. The grasses have a class of seed storage proteins called prolamins that is not found in dicots. There are also families of hormone response proteins and defence proteins, such as proteinase inhibitors, chitinases, pathogenesis-related proteins and seed allergens, many of which are tandemly repeated (Supplementary Table 8). Nevertheless, with a large number of proteins of unknown function, the most interesting © 2005 Nature Publishing Group ARTICLES NATURE|Vol 436|11 August 2005 differences between the genome content of these two groups of angiosperms remain to be discovered. Tos17 is an endogenous copia-like retrotransposon in rice that is inactive under normal growth conditions. In tissue culture, it becomes activated, transposes and is stably inherited when the plant is regenerated29. There are only two copies of Tos17 in the rice cultivar Nipponbare. These features, together with its preferential insertion into gene-rich regions, make Tos17 uniquely suitable for the functional analysis of rice genes by gene disruption. About 50,000 Tos17-insertion lines carrying 500,000 insertions have been produced30. A total of 11,487 target loci were mapped on the 12 pseudomolecules (Supplementary Fig. 5), with at least one insertion detected in 3,243 genes. The density of Tos17 insertions is higher in euchromatic regions of the genome30, in contrast to the distribution of high-copy retrotransposons, which are more frequently found in pericentromeric regions. A similar target site preference has been reported for T-DNA insertions in Arabidopsis31. Tandem gene families One surprising outcome of the Arabidopsis genome analysis was the large percentage (17%) of genes arranged in tandem repeats32. When performing a similar analysis with rice, the percentage was comparable (14%). However, manual curation on rice chromosome 10 showed one gene family encoding a glycine-rich protein with 27 copies and one encoding a TRAF/BTB domain protein with 48 copies33. These tandemly repeated families are interrupted with other genes and are not included in strictly defined tandem repeats. We therefore screened for all tandemly arranged genes in 5-Mb intervals. Using these criteria, 29% of the genes (10,837) are amplified at least once in tandem, and 153 rice gene arrays contained 10– 134 members (Supplementary Fig. 6). Sixty five per cent of the tandem arrays with over 27 members, and 33% of all the arrays with over 10 members, contain protein kinase domains (Supplementary Table 9). Non-coding RNA genes The nucleolar organizer, consisting of 17S–5.8S–25S ribosomal DNA coding units, is found at the telomeric end of the short arm of chromosome 9 (ref. 34) in O. sativa ssp. japonica, and is estimated to comprise 7 Mb (ref. 35). A second 17S–5.8S–25S rDNA locus is found at the end of the short arm of chromosome 10 in O. sativa ssp. indica34. A single 5S cluster is present on the short arm of chromosome 11 in the vicinity of the centromere36, and encompasses 0.25 Mb. A total of 763 transfer RNA genes, including 14 tRNA pseudogenes were detected in the 12 pseudomolecules. In comparison, a total of 611 tRNA genes were detected in Arabidopsis32. Supplementary Fig. 7 shows the distribution of these tRNA genes in each chromosome. Chromosome 4 has a single tRNA cluster6, and chromosome 10 has two large clusters derived from inserted chloroplast DNA7. Except for regions of intermediate density on chromosomes 1, 2, 8 and 12, there seem to be no other large clusters. MicroRNAs (miRNAs), a class of eukaryotic non-coding RNAs, are believed to regulate gene expression by interacting with the target messenger RNA37. miRNAs have been predicted from Arabidopsis38 and rice39, and we mapped 158 miRNAs onto the rice pseudomolecules (Supplementary Table 10). Among other non-coding RNAs, we identified 215 small nucleolar RNA (snoRNA) and 93 spliceosomal RNA genes, both showing biased chromosomal distributions, in the rice genome (Supplementary Table 11). Organellar insertions in the nuclear genome Mitochondria and chloroplasts originated from alpha-proteobacteria and cyanobacteria endosymbionts. A continuous transfer of organellar DNA to the nucleus has resulted in the presence of chloroplast and mitochondrial DNA inserted in the nuclear chromosomes. Although the endosymbionts probably contained genomes of several Mb at the time they were internalized, the organellar genomes diminished so that the present size of the mitochondrial genome is less than 600 kb, and that of the chloroplast is only 150 kb. Homology searches detected 421–453 chloroplast insertions and 909–1,191 mitochondrial insertions, depending upon the stringency adopted (Supplementary Fig. 8 and Supplementary Table 12). Thus, chloroplast and mitochondrial insertions contribute 0.20–0.24% and 0.18–0.19% of the nuclear genome of rice, respectively, and correspond to 5.3 chloroplast and 1.3 mitochondrial genome equivalents. The distribution of chloroplast and mitochondrial insertions over the 12 chromosomes indicates that mitochondrial and chloroplast transfers occurred independently. Two chromosomes harbour more insertions than the others (Supplementary Fig. 8 and Supplementary Table 12), with chromosome 12 containing nearly 1% mitochondrial DNA and chromosome 10 containing approximately 0.8% chlor- Table 1 | Classification and distribution of sequenced PAC and BAC clones* on the 12 rice chromosomes Chr Sequencing laboratory† PAC BAC OSJNBa/b OJ OSJNO Others‡ Total§ Pseudomolecule (bp) N-average lengthk (bp) Accession no. 1 2 3 4 5 6 7 8 9 10 11 12 RGP, KRGRP RGP, JIC ACWW, TIGR NCGR ASPGC RGP RGP RGP RGP, KRGRP, BIOTEC, BRIGI ACWW, TIGR, PGIR ACWW, TIGR, IIRGS, PGIR, Genoscope Genoscope Total 251 117 1 2 67 169 102 113 72 1 10 2 907 77 16 8 7 11 20 19 23 24 5 6 6 222 42 80 263 275 113 78 68 56 72 172 236 179 1634 23 142 47 7 87 14 97 83 50 6 3 79 638 4 4 1 0 0 0 0 2 5 0 2 0 18 0 0 10 0 0 0 0 0 0 21 1 2 34 397 359 330 291 278 281 286 277 223 205 258 268 3453 43,260,640 35,954,074 36,189,985 35,489,479 29,733,216 30,731,386 29,643,843 28,434,680 22,692,709 22,683,701 28,357,783 27,561,960 370,733,456 9,688,259 7,793,366 5,196,992 1,427,419 3,086,418 8,669,608 14,923,781 14,872,702 5,219,517 2,124,647 1,087,274 7,600,514 6,928,182 AP008207 AP008208 AP008209 AP008210 AP008211 AP008212 AP008213 AP008214 AP008215 AP008216 AP008217 AP008218 Chr, chromosome. * PAC, Rice Genome Research Program PAC; BAC, Rice Genome Research Program BAC; OSJNBa/b, Clemson University Genomics Institute BAC; OJ, Monsanto BAC; OSJNO, Arizona Genomics Institute fosmid (http://www.genome.arizona.edu/orders/direct.html?library ¼ OSJNOa); Others, artificial gap-filling clones designated as OSJNA and OJA. †ACWW (Arizona Genomics Institute, Cold Spring Harbor Laboratory, Washington University Genome Sequencing Center, University of Wisconcin) Rice Genome Sequencing Consortium; ASPGC, Academia Sinica Plant Genome Center; BIOTEC, National Center for Genetic Engineering and Biotechnology; BRIGI, Brazilian Rice Genome Initiative; IIRGS, Indian Initiative for Rice Genome Sequencing; JIC, John Innes Centre; KRGRP, Korea Rice Genome Research Program; NCGR, National Center for Gene Research; PGIR, Plant Genome Initiative at Rutgers; RGP, Rice Genome Research Program; TIGR, The Institute for Genomic Research. ‡ Constructs derived by joining (mostly from the clone gap regions) sequence from PCR fragments, Monsanto or Syngenta sequences and the neighbouring clone sequences. §A total of 2,494 BAC and 907 PAC clones were used for draft and finished sequencing. Monsanto draft-sequenced BACs underlie 638 finished clones. The Syngenta draft sequence contributed to the assemblies of 140 IRGSP clone sequences. Thirty-four sequence submissions are artificial constructs derived by joining a regional sequence (mostly from the clone gap regions) from PCR fragments, Monsanto or Syngenta sequences with the neighbouring clone sequences. This also includes 93 clones submitted as phase 1 or phase 2 to the HTG section of GenBank. kN-average length: the average length of a contiguous segment (without sequence or physical gaps) containing a randomly chosen nucleotide. © 2005 Nature Publishing Group 795 ARTICLES NATURE|Vol 436|11 August 2005 oplast DNA. It is clear that several successive transfer events have occurred, as insertions of less than 10 kb have heterogeneous identities. The longest insertions, however, systematically show .98.5% identity to organellar DNA (Supplementary Table 13), indicating recent insertions for both chloroplast and mitochondrial genomes. Transposable elements The rice genome is populated by representatives from all known transposon superfamilies, including elements that cannot be easily classified into either class I or II (ref. 40). Previous estimates of the transposon content in the rice genome range from 10 to 25% (refs 21, 40). However, the increased availability of transposon query sequences and the use of profile hidden Markov models allow the identification of more divergent elements41 and indicate that the transposon content of the O. sativa ssp. japonica genome is at least 35% (Table 3). Chromosomes 8 and 12 have the highest transposon content (38.0% and 38.3%, respectively), and chromosomes 1 (31.0%), 2 (29.8%) and 3 (29.0%) have the lowest proportion of transposons. Conversely, elements belonging to the IS5/Tourist and IS630/Tc1/mariner superfamilies, which are generally correlated with gene density, are prevalent on the first three chromosomes and least frequent on chromosomes 4 and 12. Class II elements, characterized by terminal inverted-repeats and including the hAT, CACTA, IS256/Mutator, IS5/Tourist, and IS630/ Tc1/mariner superfamilies, outnumber class I elements, which include long terminal-repeat (LTR) retrotransposons (Ty1/copia, Ty3/gypsy and TRIM) and non-LTR retrotransposons (LINEs and SINEs, or long- and short-interspersed nucleotide elements, respectively), by more than twofold (Table 3). However, the nucleotide contribution of class I is greater than that of class II, due mostly to the large size of LTR retrotransposons and the small size of IS5/Tourist and IS630/Tc1/mariner elements. The inverse is the case for maize, for which class I elements outnumber class II elements42. Given their larger sizes, differential amplification of LTR elements in maize compared with rice is consistent with the genomic expansion found between orthologous regions of rice and maize15,33. Most class I elements are concentrated in gene-poor, heterochromatic regions such as the centromeric and pericentromeric regions (Supplementary Table 14). In contrast, members of some transposon superfamilies, including IS5/Tourist, IS630/Tc1/mariner and LINEs, have a significant positive correlation with both recombination rate and gene density. There is an effect of average element length associated with these patterns: short elements generally show a positive correlation with recombination rate and gene density, and are under-represented in the centromere regions, whereas larger elements have higher centromeric and pericentromeric abundance. Intraspecific sequence polymorphism Map-based cloning to identify genes that are associated with agronomic traits is dependent on having a high frequency of polymorphic markers to order recombination events. In rice, most of the segregating populations are generated from crosses between the two major subspecies of cultivated rice, Oryza sativa ssp. japonica and O. sativa ssp. indica. Although several studies on the polymorphisms detected between japonica and indica subspecies have been reported6,43,44, the analysis reported here uses an approach that ensures comparison of orthologous sequences. O. sativa ssp. indica cv. Kasalath and O. sativa ssp. japonica cv. Nipponbare are the parents of the most densely mapped rice population16. BAC-end sequences were obtained from a Kasalath BAC library of 47,194 clones. Only high quality, single-copy sequences were mapped to the Nipponbare pseudomolecules, and only paired inverted sequences that mapped within 200 kb were considered. A total of 26,632 paired Kasalath BAC-end sequences were mapped to the 12 rice pseudomolecules (Supplementary Table 15). Kasalath BAC clones spanned 308 Mb or 79% of the Nipponbare genome. Sequence alignments with a PHRED quality value of 30 covered 12,319,100 bp (3%) of the total rice genome. A total of 80,127 sites differed in the corresponding regions in Nipponbare and Kasalath. The frequency of SNPs varied between chromosomes (0.53–0.78%). Insertions and deletions were also detected. The ratio of small insertion/deletion site nucleotides (1– 14 bases) against the alignment length (0.20–0.27%) was similar among the different chromosomes, and there was no preference for the direction of insertions or deletions. The main patterns of base substitutions observed between Nipponbare and Kasalath are shown in Supplementary Table 16. Transitions (70%) were the most prominent substitutions; this is a substantially higher fraction than found between Arabidopsis ecotypes Columbia and Landsberg erecta32. Class 1 simple sequence repeats in the rice genome Class 1 simple sequence repeats (SSRs) are perfect repeats .20 nucleotides in length45 that behave as hypervariable loci, providing a rich source of markers for use in genetics and breeding. A total of 18,828 Class 1 di, tri and tetra-nucleotide SSRs, representing 47 distinctive motif families, were identified and annotated on the rice genome (Supplementary Fig. 9). Supplementary Table 17 provides information about the physical positions of all Class 1 SSRs in relation to widely used restriction-fragment length polymorphisms (RFLPs)16,46 and previously published SSRs45. There was an average of 51 hypervariable SSRs per Mb, with the highest density of markers occurring on chromosome 3 (55.8 SSR Mb21) and the lowest occurring on chromosome 4 (41.0 SSR Mb21). A summary of information about the Class 1 SSRs identified in the rice pseudomolecules appears Table 2 | Size of each chromosome based on sequence data and estimated gaps Chr Sequenced bases (bp) 1 2 3 4 5 6 7 8 9 10 11 12 All 43,260,640 35,954,074 36,189,985 35,489,479 29,733,216 30,731,386 29,643,843 28,434,680 22,692,709 22,683,701 28,357,783 27,561,960 370,733,456 Gaps on arm regions No. Length (Mb) 5 3 4 3 6 1 1 1 4 4 4 0 36 0.33 0.10 0.96 0.46 0.22 0.02 0.31 0.09 0.13 0.68 0.21 0.00 3.51 Telomeric gaps* (Mb) 0.06 0.01 0.04 0.20 0.05 0.03 0.01 0.05 0.14 0.13 0.04 0.05 0.81 Centromeric gap† (Mb) rDNA‡ (Mb) 1.40 0.72 0.18 0.82 0.32 0.62 0.47 1.90 0.16 6.59 * Estimated length including the telomeres, calculated with the average value of 3.2 kb for each chromosome24. †Estimated length of centromere-specific CentO repeats on each chromosome26. ‡ Represents the estimated length of the17S–5.8S–25S rDNA cluster on Chr 9 (ref. 35) and the 5S cluster on Chr 11 (ref. 24). §Coverage of the pseudomolecules for the euchromatic regions in each chromosome. kCoverage of the pseudomolecules over the full length of each chromosome. 796 © 2005 Nature Publishing Group 6.95 0.25 7.20 Total (Mb) 45.05 36.78 37.37 36.15 30.00 31.60 30.28 28.57 30.53 23.96 30.76 27.77 388.82 Coverage§ (%) Coveragek (%) 99.1 99.7 97.3 98.7 99.3 99.8 98.9 99.7 98.8 96.6 99.1 99.8 98.9 96.0 97.7 96.8 98.2 99.1 97.2 97.9 99.5 74.3 94.7 92.2 99.2 95.3 ARTICLES NATURE|Vol 436|11 August 2005 Figure 1 | Maps of the twelve rice chromosomes. For each chromosome (Chr 1–12), the genetic map is shown on the left and the PAC/BAC contigs on the right. The position of markers flanking the PAC/BAC contigs (green) is indicated on the genetic map. Physical gaps are shown in white and the nucleolar organizer on chromosome 9 is represented with a dotted green line. Constrictions in the genetic maps and arrowheads to the right of physical maps represent the chromosomal positions of centromeres for which rice CentO satellites are sequenced. The maps are scaled to genetic distances in centimorgans (cM) and the physical maps are depicted in relative physical lengths. Please refer to Table 2 for estimated lengths of the chromosomes. in Supplementary Table 18. Several thousand of these SSRs have already been shown to amplify well and be polymorphic in a panel of diverse cultivars45, and thus are of immediate use for genetic analysis. that a substantial portion of the contigs from each assembly were non-homologous, misaligned or provided duplicate coverage. Indeed, the whole-genome shotgun assembly differed by 0.05% base-pair mismatches for the two aligned regions from the same Nipponbare cultivar. The two assemblies were further examined for the presence of the CentO sequence (Supplementary Table 21). Sixtyeight per cent of the copies observed in the 93-11 assembly and 32% of the CentO-containing contigs in the whole-genome shotgun Nipponbare assembly were found outside the centromeric regions. In contrast, the CentO repeats were restricted to the centromeric regions in the IRGSP pseudomolecules. It is unlikely that there are dispersed centromeres in indica rice; misassembly of the wholegenome shotgun sequences is a more likely explanation for dispersed CentO repeats. These observations indicate that the draft sequences, although providing a useful preliminary survey of the genome, might not be adequate for gene annotation, functional genomics or the identification of genes underlying agronomic traits. Genome-wide comparison of draft versus finished sequences Two whole-genome shotgun assemblies of draft-quality rice sequence have been published23,47, and reassemblies of both have just appeared48. One of these is an assembly of 6.28 £ coverage of O. sativa ssp. indica cv. 93-11. The second sequence is a ,6 £ coverage of O. sativa ssp. japonica cv. Nipponbare23,48. These assemblies predict genome sizes of 433 Mb for japonica and 466 Mb for. indica, which differ from our estimation of a 389 Mb japonica genome. Contigs from the whole-genome shotgun assembly of 93-11 and Nipponbare48 were aligned with the IRGSP pseudomolecules. Nonredundant coverage of the pseudomolecules by the indica assembly varied from 78% for chromosome 3 to 59% for chromosome 12, with an overall coverage of 69% (Supplementary Table 19). When genes supported by full-length cDNA coverage were aligned to the covered regions, we found that 68.3% were completely covered by the indica sequences. The average size of the indica contigs is 8.2 kb, so it is not surprising that many did not completely cover the gene models defined here. The coverage of the Nipponbare whole-genome shotgun assembly varied from 68–82%, with an overall coverage of 78% of the genome, and 75.3% of the full-length cDNAs supported gene models. We undertook a detailed comparison of the first Mb of these assemblies on 1S (the short arm of chromosome 1) with the IRGSP chromosome 1 (Supplementary Fig. 10 and Supplementary Table 20). The numbers from this comparison agree with the wholegenome comparison described above. In addition, we observed Concluding remarks The attainment of a complete and accurate map-based sequence for rice is compelling. We now have a blueprint for all of the rice chromosomes. We know, with a high level of confidence, the distribution and location of all the main components—the genes, repetitive sequences and centromeres. Substantial portions of the map-based sequence have been in public databases for some time, and the availability of provisional rice pseudomolecules based on this sequence has provided the scientific community with numerous opportunities to evaluate the genome, as indicated by the number of publications in rice biology and genetics over the past few years. Furthermore, the wealth of SNP and SSR information provided here © 2005 Nature Publishing Group 797 ARTICLES NATURE|Vol 436|11 August 2005 and elsewhere will accelerate marker-assisted breeding and positional cloning, facilitating advances in rice improvement. The syntenic relationships between rice and the cereal grasses have long been recognized4. Comparing genome organization, genes and intergenic regions between cereal species will permit identification of regions that are highly conserved or rapidly evolving. Such regions are expected to yield crucial insights into genome evolution, speciation and domestication. METHODS Physical map and sequencing. Nine genomic libraries from Oryza sativa ssp. japonica cultivar Nipponbare were used to establish the physical map of rice chromosomes by polymerase chain reaction (PCR) screening19, fingerprinting20 and end-sequencing21. The PAC, BAC and fosmid clones on the physical map were subjected to random shearing and shotgun sequencing to tenfold redundancy, using both universal primers and the dye-terminator or dye-primer methods. The sequences were assembled using PHRED (http://www.genome.washington.edu/UWGC/analysistools/Phred.cfm) and PHRAP (http://www.genome.washington.edu/UWGC/analysistools/Phrap.cfm) software packages or using the TIGR Assembler (http://www.tigr.org/software/assembler/). Sequence gaps were resolved by full sequencing of gap-bridge clones, PCR fragments or direct sequencing of BACs. Sequence ambiguities (indicated by PHRAP scores less than 30) were resolved by confirming the sequence data using alternative chemistries or different polymerases. We empirically determined that a PHRAP score of 30 or above exceeds the standard of less than one error in 10,000 bp. BAC and PAC assemblies were tested for accuracy by comparing computationally derived fingerprint patterns with experimentally determined patterns of restriction enzyme digests. Sequence quality was also evaluated by comparing independently obtained overlapping sequences. Small physical gaps were filled by long-range PCR. Remaining physical gaps were measured using fluorescence in situ hybridization analysis. We used the length of CentO arrays26 to estimate the size of each of the remaining centromere gaps. Annotation and bioinformatics. Gene models were predicted using FGENESH (http://www.softberry.com/berry.phtml?topic ¼ fgenesh) using the monocot trained matrix on the native and repeat-masked pseudomolecules. Gene models with incomplete open reading frames, those encoding proteins of less than 50 amino acids, or those corresponding to organellar DNA were omitted from the final set. The coordinates of transposable elements, excluding MITEs (miniature inverted-repeat transposable elements), were used to mask the pseudomolecules. Conserved domain/motif searches and association with gene ontologies were performed using InterproScan (http://www.ebi.ac.uk/InterProScan/) in combination with the Interpro2Go program. For biological processes, the number of detected domains was re-calculated as number of non-redundant proteins. The predicted rice proteome was searched using BLASTP against the proteomes of several model species for which a complete genome sequence and deduced protein set was available. Each rice chromosome was searched against the TIGR rice gene index (http://www.tigr.org/tdb/tgi/ogi/) and against gene index entries that aligned to gene models corresponding to expressed genes. In addition, five cereal gene indices (http://www.tigr.org/tdb/tgi/) were searched Table 3 | Transposons in the rice genome Copy no. ( £ 103) Class I LINEs SINEs Ty1/copia Ty3/gypsy Other class I Total class I Class II hAT CACTA IS630/Tc1/mariner IS256/Mutator IS5/Tourist Other class II Total class II Other TEs Total TEs 9.6 1.8 11.6 23.5 15.4 61.9 1.1 10.8 67.0 8.8 57.9 18.2 163.8 23.6 249.3 Coverage (kb) Fraction of genome (%) 4161.3 209.9 14266.7 40363.3 12733.3 71734.4 1.12 0.06 3.85 10.90 3.43 19.35 1405.9 9987.3 8388.3 13485.7 12095.8 2703.6 48066.6 6797.7 129019.3* 0.38 2.69 2.26 3.64 3.26 0.73 12.96 1.80 34.79 TE, transposable element. * Total length; corrected for 2420.7 kb in overlaps of multiple, non-nested elements. 798 against the rice chromosomes, and gene index matches were recorded. We searched the Oryza sativa ssp. japonica cv. Nipponbare collection of full-length cDNAs (ftp://cdna01.dna.affrc.go.jp/pub/data/), after first removing the transposable-element-related sequences, against the FGENESH models. Gene models with rice full-length cDNA, EST or cereal EST matches but without identifiable homologues in the Arabidopsis genome were searched for conserved domains/motifs using InterproScan, and for homologues in the Swiss-Prot database (http://us.expasy.org/sprot/) using BLASTP. All proteins with positive blast matches were further compared with the nr database (http:// www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html#protein_databases), using BLASTP to eliminate truncated proteins and those with matches to other dicots. Tandem gene families. The rice genome was subjected to a BLASTP search as previously described32. The search was also performed by permitting more than one unrelated gene within the arrays, and the limit of the search was set to 5-Mb intervals to exclude large chromosomal duplications. Non-coding RNAs. Transfer-RNA genes were detected by the program tRNAscan SE (http://www.genetics.wustl.edu/eddy/tRNAscan-SE/). The miRNA registry in the Rfam database (http://www.sanger.ac.uk/Software/Rfam/) was used as a reference database for miRNAs. In addition, experimentally validated miRNAs of other species, excluding Arabidopsis miRNAs, were used for BLASTN queries against the pseudomolecules. Spliceosomal and snoRNAs were retrieved from the Rfam database and used for queries. BLASTN was used to find the location of snoRNAs and spliceosomal RNAs in the pseudomolecules. Organellar insertions. Oryza sativa ssp. japonica Nipponbare chloroplast (GenBank NC_001320) and mitochondrial (GenBank BA000029) sequences were aligned with the pseudomolecules using BLASTN and MUMmer49. Transposable elements. The TIGR Oryza Repeat Database, together with other published and unpublished rice transposable element sequences, was used to create RTEdb (a rice transposable element database)50 and determine transposable element coordinates on the rice pseudomolecules. In the case of hAT, IS256/ Mutator, IS5/Tourist and IS630/Tc1/mariner elements, family-specific profile hidden Markov models were applied using HMMER41 (http://hmmer.wustl.edu/). The remaining superfamilies were annotated using RepeatMasker (http:// www.repeatmasker.org/). Tos17 insertions. Flanking sequences of transposed copies of 6,278 Tos17 insertion lines were isolated by modified thermal asymmetric interlaced (TAIL)-PCR and suppression PCR, and screened against the pseudomolecule sequences. SNP discovery. BAC clones from an O. sativa ssp. indica var. Kasalath BAC library were end-sequenced. Sequence reads were omitted if they contained more than 50% nucleotides of low quality or high similarity to known repeats. The remaining sequences were subjected to BLASTN analysis against the pseudomolecules. Gaps within the alignments were classified as small insertions/ deletions. SSR loci. The Simple Sequence Repeat Identification Tool (http://www.gramene. org/) was used to identify simple sequence repeat motifs, and the physical position of all Class 1 SSRs was recorded. The copy number of SSR markers was estimated using electronic (e)-PCR to determine the number of independent hits of primer pairs on the pseudomolecules. Whole-genome shotgun assembly analysis. Contigs from the BGI 6.28 £ whole genome assembly of O. sativa ssp. indica 93-11 (GenBank/DDBJ/EMBL accession number AAAA02000001–AAAA02050231) and the Syngenta 6 £ whole genome assembly of O. sativa ssp. japonica cv. Nipponbare (AACV01000001–AACV01035047; ref. 48) were aligned with the pseudomolecules using MUMmer49. The number of IRGSP Nipponbare full-length cDNAsupported gene models completely covered by the aligned contigs was tabulated. The 155-bp CentO consensus sequence was used for BLAST analysis against the 93-11 and Nipponbare whole-genome shotgun contigs, and the coordinates of the positive hits recorded. Locations of centromeres for each indica chromosome were obtained with the CentO sequence positions on the IRGSP pseudomolecule of the corresponding chromosome. A detailed comparison of the BGI-assembled and -mapped Syngenta contigs (AACV01000001–AACV01000070) and the 9311 contigs (AAAA02000001–AAAA02000093) was obtained by BLAST analysis against the IRGSP chromosome 1 pseudomolecule. Detailed procedures for the analyses described above can be found in the Supplementary Information. Received 29 December 2004; accepted 25 May 2005. 1. 2. Peng, S., Cassman, K. G., Virmani, S. S., Sheehy, J. & Khush, G. S. Yield potential trends of tropical rice since the release of IR8 and the challenge of increasing rice yield potential. Crop Sci. 39, 1552–-1559 (1999). Peng, S. et al. Rice yields decline with higher night temperature from global warming. Proc. Natl Acad. Sci. USA 101, 9971–-9975 (2004). © 2005 Nature Publishing Group ARTICLES NATURE|Vol 436|11 August 2005 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. Sasaki, T. & Burr, B. International Rice Genome Sequencing Project: the effort to completely sequence the rice genome. Curr. Opin. Plant Biol. 3, 138–-141 (2000). Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Cereal genome evolution: Grasses, line up and form a circle. Curr. Biol. 5, 737–-739 (1995). Sasaki, T. et al. The genome sequence and structure of rice chromosome 1. Nature 420, 312–-316 (2002). Feng, Q. et al. Sequence and analysis of rice chromosome 4. Nature 420, 316–-320 (2002). Rice Chromosome 10 Sequencing Consortium, In-depth view of structure, activity, and evolution of rice chromosome 10. Science 300, 1566–-1569 (2003). Wu, J. et al. Composition and structure of the centromeric region of rice chromosome 8. Plant Cell 16, 967–-976 (2004). Zhang, Y. et al. Structural features of the rice chromosome 4 centromere. Nucleic Acids Res. 32, 2023–-2030 (2004). Nagaki, K. et al. Sequencing of a rice centromere uncovers active genes. Nature Genet. 36, 138–-145 (2004). Guyot, R. & Keller, B. Ancestral genome duplication in rice. Genome 47, 610–-614 (2004). Simillion, C., Vandepoele, K., Saeys, Y. & Van de Peer, Y. Building genomic profiles for uncovering segmental homology in the twilight zone. Genome Res. 14, 1095–-1106 (2004). Paterson, A. H., Bowers, J. E. & Chapman, B. A. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc. Natl Acad. Sci. USA 101, 9903–-9908 (2004). Salse, J., Piegu, B., Cooke, R. & Delseny, M. New in silico insight into the synteny between rice (Oryza sativa L.) and maize (Zea mays L.) highlights reshuffling and identifies new duplications in the rice genome. Plant J. 38, 396–-409 (2004). Lai, J. et al. Gene loss and movement in the maize genome. Genome Res. 14, 1924–-1931 (2004). Harushima, Y. et al. A high-density rice genetic linkage map with 2275 markers using a single F2 population. Genetics 148, 479–-494 (1998). Yamamoto, K. & Sasaki, T. Large-scale EST sequencing in rice. Plant Mol. Biol. 35, 135–-144 (1997). Saji, S. et al. A physical map with yeast artificial chromosome (YAC) clones covering 63% of the 12 rice chromosomes. Genome 44, 32–-37 (2001). Wu, J. et al. A comprehensive rice transcript map containing 6591 expressed sequence tag sites. Plant Cell 14, 525–-535 (2002). Chen, M. et al. An integrated physical and genetic map of the rice genome. Plant Cell 14, 537–-545 (2002). Mao, L. et al. Rice transposable elements: a survey of 73,000 sequencetagged-connectors. Genome Res. 10, 982–-990 (2000). Barry, G. F. The use of the Monsanto draft rice genome sequence in research. Plant Physiol. 125, 1164–-1165 (2001). Goff, S. A. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–-100 (2002). Ohmido, N., Kijima, K., Akiyama, Y., de Jong, J. H. & Fukui, K. Quantification of total genomic DNA and selected repetitive sequences reveals concurrent changes in different DNA families in indica and japonica rice. Mol. Gen. Genet. 263, 388–-394 (2000). Dong, F. et al. Rice (Oryza sativa) centromeric regions consist of complex DNA. Proc. Natl Acad. Sci. USA 95, 8135–-8140 (1998). Cheng, Z. et al. Functional rice centromeres are marked by a satellite repeat and a centromere-specific retrotransposon. Plant Cell 14, 1691–-1704 (2002). Kikuchi, S. et al. Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science 301, 376–-379 (2003). Castelli, V. et al. Whole genome sequence comparisons and “full-length” cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation. Genome Res. 14, 406–-413 (2004). Hirochika, H., Sugimoto, K., Otsuki, Y., Tsugawa, H. & Kanda, M. Retrotransposons of rice involved in mutations induced by tissue culture. Proc. Natl Acad. Sci. USA 93, 7783–-7788 (1996). Miyao, A. et al. Target site specificity of the Tos17 retrotransposon shows a preference for insertion within genes and against insertion in retrotransposonrich regions of the genome. Plant Cell 15, 1771–-1780 (2003). Alonso, J. M. et al. Genome-wide insertional mutagenesis of Arabidopsis thaliana. Science 301, 653–-657 (2003). Arabidopsis Genome Initiative, Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–-815 (2000). Song, R., Llaca, V. & Messing, J. Mosaic organization of orthologous sequences in grass genomes. Genome Res. 12, 1549–-1555 (2002). Shishido, R., Sano, Y. & Fukui, K. Ribosomal DNAs: an exception to the conservation of gene order in rice genomes. Mol. Gen. Genet. 263, 586–-591 (2000). Oono, K. & Sugiura, M. Heterogeneity of the ribosomal RNA gene clusters in rice. Chromosoma 76, 85–-89 (1980). Kamisugi, Y. et al. Physical mapping of the 5S ribosomal RNA genes on rice chromosome 11. Mol. Gen. Genet. 245, 133–-138 (1994). 37. Bartel, D. P. MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell 116, 281–-297 (2004). 38. Wang, X. J., Reyes, J. L., Chua, N. H. & Gaasterland, T. Prediction and identification of Arabidopsis thaliana microRNAs and their mRNA targets. Genome Biol. 5, R65 (2004). 39. Wang, J. F., Zhou, H., Chen, Y. Q., Luo, Q. J. & Qu, L. H. Identification of 20 microRNAs from Oryza sativa. Nucleic Acids Res. 32, 1688–-1695 (2004). 40. Turcotte, K., Srinivasan, S. & Bureau, T. Survey of transposable elements from rice genomic sequences. Plant J. 25, 169–-179 (2001). 41. Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–-763 (1998). 42. Messing, J. et al. Sequence composition and genome organization of maize. Proc. Natl Acad. Sci. USA 101, 14349–-14354 (2004). 43. Shen, Y. J. et al. Development of genome-wide DNA polymorphism database for map-based cloning of rice genes. Plant Physiol. 135, 1198–-1205 (2004). 44. Feltus, F. A. et al. An SNP resource for rice genetics and breeding based on subspecies indica and japonica genome alignments. Genome Res. 14, 1812–-1819 (2004). 45. McCouch, S. R. et al. Development and mapping of 2240 new SSR markers for rice (Oryza sativa L.). DNA Res. 9, 257–-279 (2002). 46. Causse, M. A. et al. Saturated molecular map of the rice genome based on an interspecific backcross population. Genetics 138, 1251–-1274 (1994). 47. Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–-92 (2002). 48. Yu, J. et al. The genomes of Oryza sativa: A history of duplications. PLoS Biol. 3, e38 (2005). 49. Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369–-2376 (1999). 50. Juretic, N., Bureau, T. E. & Bruskiewich, R. M. Transposable element annotation of the rice genome. Bioinformatics 20, 155–-160 (2004). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements Work at the RGP was supported by the Ministry of Agriculture, Forestry and Fisheries of Japan. Work at TIGR was supported by grants to C.R.B. from the USDA Cooperative State Research, Education and Extension Service–National Research Initiative, the National Science Foundation and the US Department of Energy. Work at the NCGR was supported by the Chinese Ministry of Science and Technology, the Chinese Academy of Sciences, the Shanghai Municipal Commission of Science and Technology, and the National Natural Science Foundation of China. Work at Genoscope was supported by le Ministère de la Recherche, France. Funding for the work at the AGI and AGCoL was provided by grants to R.A.W. and C.S. from the USDA Cooperative State Research, Education and Extension Service–National Research Initiative, the National Science Foundation, the US Department of Energy and the Rockefeller Foundation. Work at CSHL was supported by grants from the USDA Cooperative State Research, Education and Extension Service–National Research Initiative and from the National Science Foundation. Work at the ASPGC was supported by Academia Sinica, National Science Council, Council of Agriculture, and Institute of Botany, Academia Sinica. The IIRGS acknowledges the Department of Biotechnology, Government of India, for financial assistance and the Indian Council of Agricultural Research, New Delhi, for support. Work at Rice Gene Discovery was supported by BIOTECH and the Princess Sirindhorn’s Plant Germplasm Conservation Initiative Program. Work at PGIR was supported by Rutgers University. The BRIGI was supported by Coordenação de Aperfeiçoamento de Pessoal de Nı́vel Superior (CAPES), Conselho Nacional de Desenvolvimento Cientı́fico e Tecnológico (CNPq), Financiadora de Estudos e Projetos - Ministério de Ciência e Tecnologia (FINEP-MCT), Fundação de Amparo a Pesquisa do Rio Grande do Sul (FAPERGS) and Universidade Federal de Pelotas (UFPel). Work at McGill and York Universities was supported by the National Science and Engineering Research Council of Canada and the Canadian International Development Agency. Funding for H.H. at the National Institute of Agrobiological Sciences was from the Ministry of Agriculture, Forestry, and Fisheries of Japan, and the Program for Promotion of Basic Research Activities for Innovative Biosciences. Funding at Brookhaven National Laboratory was from The Rockefeller Foundation and the Office of Basic Energy Science of the United States Department of Energy. We would like to thank G. Barry and S. Goff for their help in negotiating agreements that permitted the sharing of materials and sequence with the IRGSP. We also acknowledge the work of G. Barry, S. Goff and their colleagues in facilitating the transfer of sequence information and supporting data. Author Information The genomic sequence is available under accession numbers AP008207–AP008218 in international databases (DDBJ, GenBank and EMBL). Reprints and permissions information is available at npg.nature.com/ reprintsandpermissions. The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to Takuji Sasaki ([email protected]). © 2005 Nature Publishing Group 799 ARTICLES NATURE|Vol 436|11 August 2005 International Rice Genome Sequencing Project (Participants are arranged by area of contribution and then by institution.) Physical Maps and Sequencing: Rice Genome Research Program (RGP) Takashi Matsumoto1, Jianzhong Wu1, Hiroyuki Kanamori1, Yuichi Katayose1, Masaki Fujisawa1, Nobukazu Namiki1, Hiroshi Mizuno1, Kimiko Yamamoto1, Baltazar A. Antonio1, Tomoya Baba1, Katsumi Sakata1, Yoshiaki Nagamura1, Hiroyoshi Aoki1, Koji Arikawa1, Kohei Arita1, Takahito Bito1, Yoshino Chiden1, Nahoko Fujitsuka1, Rie Fukunaka1, Masao Hamada1, Chizuko Harada1, Akiko Hayashi1, Saori Hijishita1, Mikiko Honda1, Satomi Hosokawa1, Yoko Ichikawa1, Atsuko Idonuma1, Masumi Iijima1, Michiko Ikeda1, Maiko Ikeno1, Kazue Ito1, Sachie Ito1, Tomoko Ito1, Yuichi Ito1, Yukiyo Ito1, Aki Iwabuchi1, Kozue Kamiya1, Wataru Karasawa1, Kanako Kurita1, Satoshi Katagiri1, Ari Kikuta1, Harumi Kobayashi1, Noriko Kobayashi1, Kayo Machita1, Tomoko Maehara1, Masatoshi Masukawa1, Tatsumi Mizubayashi1, Yoshiyuki Mukai1, Hideki Nagasaki1, Yuko Nagata1, Shinji Naito1, Marina Nakashima1, Yuko Nakama1, Yumi Nakamichi1, Mari Nakamura1, Ayano Meguro1, Manami Negishi1, Isamu Ohta1, Tomoya Ohta1, Masako Okamoto1, Nozomi Ono1, Shoko Saji1, Miyuki Sakaguchi1, Kumiko Sakai1, Michie Shibata1, Takanori Shimokawa1, Jianyu Song1, Yuka Takazaki1, Kimihiro Terasawa1, Mika Tsugane1, Kumiko Tsuji1, Shigenori Ueda1, Kazunori Waki1, Harumi Yamagata1, Mayu Yamamoto1, Shinichi Yamamoto1, Hiroko Yamane1, Shoji Yoshiki1, Rie Yoshihara1, Kazuko Yukawa1, Huisun Zhong1, Masahiro Yano1, Takuji Sasaki (Principal Investigator)1 ; The Institute for Genomic Research (TIGR) Qiaoping Yuan2, Shu Ouyang2, Jia Liu2, Kristine M. Jones2, Kristen Gansberger2, Kelly Moffat2, Jessica Hill2, Jayati Bera2, Douglas Fadrosh2, Shaohua Jin2, Shivani Johri2, Mary Kim2, Larry Overton2, Matthew Reardon2, Tamara Tsitrin2, Hue Vuong2, Bruce Weaver2, Anne Ciecko2, Luke Tallon2, Jacqueline Jackson2, Grace Pai2, Susan Van Aken2, Terry Utterback2, Steve Reidmuller2, Tamara Feldblyum2, Joseph Hsiao2, Victoria Zismann2, Stacey Iobst2, Aymeric R. de Vazeille2, C. Robin Buell (Principal Investigator)2; National Center for Gene Research Chinese Academy of Sciences (NCGR) Kai Ying3, Ying Li3, Tingting Lu3, Yuchen Huang3, Qiang Zhao3, Qi Feng3, Lei Zhang3, Jingjie Zhu3, Qijun Weng3, Jie Mu3, Yiqi Lu3, Danlin Fan3, Yilei Liu3, Jianping Guan3, Yujun Zhang3, Shuliang Yu3, Xiaohui Liu3, Yu Zhang3, Guofan Hong3, Bin Han (Principal Investigator)3; Genoscope Nathalie Choisne4, Nadia Demange4, Gisela Orjeda4, Sylvie Samain4, Laurence Cattolico4, Eric Pelletier4, Arnaud Couloux4, Beatrice Segurens4, Patrick Wincker4, Angelique D’Hont5, Claude Scarpelli4, Jean Weissenbach4, Marcel Salanoubat4, Francis Quetier (Principal Investigator)4; Arizona Genomics Institute (AGI) and Arizona Genomics Computational Laboratory (AGCol) Yeisoo Yu6, Hye Ran Kim6, Teri Rambo6, Jennifer Currie6, Kristi Collura6, Meizhong Luo6, Tae-Jin Yang6, Jetty S. S. Ammiraju6, Friedrich Engler6, Carol Soderlund6, Rod A. Wing (Principal Investigator)6; Cold Spring Harbor Laboratory (CSHL) Lance E. Palmer7, Melissa de la Bastide7, Lori Spiegel7, Lidia Nascimento7, Theresa Zutavern7, Andrew O’Shaughnessy7, Sujit Dike7, Neilay Dedhia7, Raymond Preston7, Vivekanand Balija7, W. Richard McCombie (Principal Investigator)7; Academia Sinica Plant Genome Center (ASPGC) Teh-Yuan Chow8, Hong-Hwa Chen9, Mei-Chu Chung8, Ching-San Chen8, Jei-Fu Shaw8, Hong-Pang Wu8, Kwang-Jen Hsiao10, Ya-Ting Chao8, Mu-kuei Chu8, Chia-Hsiung Cheng8, Ai-Ling Hour8, Pei-Fang Lee8, Shu-Jen Lin8, Yao-Cheng Lin8, John-Yu Liou8, Shu-Mei Liu8, Yue-Ie Hsing (Principal Investigator)8; Indian Initiative for Rice Genome Sequencing (IIRGS), University of Delhi South Campus (UDSC) S. Raghuvanshi11, A. Mohanty11, A. K. Bharti11,13, A. Gaur11, V. Gupta11, D. Kumar11, V. Ravi11, S. Vij11, A. Kapur11, Parul Khurana11, Paramjit Khurana11, J. P. Khurana11, A. K. Tyagi (Principal Investigator)11; Indian Initiative for Rice Genome Sequencing (IIRGS), Indian Agricultural Research Institute (IARI) K. Gaikwad12, A. Singh12, V. Dalal12, S. Srivastava12, A. Dixit12, A. K. Pal12, I. A. Ghazi12, M. Yadav12, A. Pandit12, A. Bhargava12, K. Sureshbabu12, K. Batra12, T. R. Sharma12, T. Mohapatra12, N. K. Singh (Principal Investigator)12; Plant Genome Initiative at Rutgers (PGIR) Joachim Messing (Principal Investigator)13, Amy Bronzino Nelson13, Galina Fuks13, Steve Kavchok13, Gladys Keizer13, Eric Linton Victor Llaca13, Rentao Song13, Bahattin Tanyolac13, Steve Young13; Korea Rice Genome Research Program (KRGRP) Kim Ho-Il14, Jang Ho Hahn (Principal Investigator)14; National Center for Genetic Engineering and Biotechnology (BIOTEC) G. Sangsakoo15, A. Vanavichit (Principal Investigator)15; Brazilian Rice Genome Initiative (BRIGI) Luiz Anderson Teixeira de Mattos16, Paulo Dejalma Zimmer16, Gaspar Malone16, Odir Dellagostin16, Antonio Costa de Oliveira (Principal Investigator)16; John Innes Centre (JIC) Michael Bevan17, Ian Bancroft17; Washington University School of Medicine Genome Sequencing Center Pat Minx18, Holly Cordum18, Richard Wilson18; University of Wisconsin–Madison Zhukuan Cheng19, Weiwei Jin19, Jiming Jiang19, Sally Ann Leong20 Annotation and Analysis: Hisakazu Iwama21, Takashi Gojobori21,22, Takeshi Itoh22,23, Yoshihito Niimura24, Yasuyuki Fujii25, Takuya Habara25, Hiroaki Sakai23,25, Yoshiharu Sato22, Greg Wilson26, Kiran Kumar27, Susan McCouch26, Nikoleta Juretic28, Douglas Hoen28, Stephen Wright29, Richard Bruskiewich30, Thomas Bureau28, Akio Miyao23, Hirohiko Hirochika23, Tomotaro Nishikawa23, Koh-ichi Kadowaki23 & Masahiro Sugiura31 Coordination: Benjamin Burr32 Affiliations for participants: 1National Institute of Agrobiological Sciences/Institute of the Society for Techno-innovation of Agriculture, Forestry and Fisheries, 2-1-2 Kannondai, Tsukuba, Ibaraki 305-8602, Japan. 2The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA. 3Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences (CAS), 500 Caobao Road, Shanghai 200233, China. 4Centre National de Séquençage, INRA-URGV, and CNRS UMR-8030, 2, rue Gaston Crémieux, CP 5706, 91057 EVRY Cedex, France. 5UMR PIA, Cirad-Amis, TA40-03 avenue Agropolis, 34398 Montpellier Cedex 05, France. 6Department of Plant Sciences, BIO5 Institute, The University of Arizona, Tucson, Arizona 85721, USA. 7Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11723, USA. 8Institute of Botany, Academia Sinica, 128, Sec. 2, Yen-Chiu-Yuan Rd, Nankang, Taipei 11529, Taiwan. 9National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan 701, Taiwan. 10National Yang-Ming University, 155, Sec. 2, Li-Nong St, Peitou, Taipei 112, Taiwan. 11Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110021, India. 12National Research Centre on Plant Biotechnology, Indian Agricultural Research Institute, New Delhi 110012, India. 13Waksman Institute, Rutgers University, Piscataway, New Jersey 08854, USA. 14National Institute of Agricultural Science and Technology, RDA, Suwon, 441-707 Republic of Korea. 15Rice Gene Discovery Unit, Kasetsart University, Nakron Pathom 73140, Thailand. 16Centro de Genomica e Fitomelhoramento, UFPel, Pelotas, RS, l 96001-970, Brazil. 17John Innes Centre, Norwich Research Park, Colney, Norwich NR4 7UH, UK. 18Washington University Genome Sequencing Center, 3333 Forest Park Boulevard, St. Louis, Missouri 63108, USA. 19University of Wisconsin, Department of Horticulture, Madison, Wisconsin 53706, USA. 20University of Wisconsin, Department of Plant Pathology, Madison, Wisconsin 53706, USA. 21Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima 411-8540, Japan. 22Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan. 23National Institute of Agrobiological Sciences, Tsukuba, Ibaraki 305-8602, Japan. 24Medical Research Institute, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo 113-8510, Japan. 25Japan Biological Information Research Center, Japan Biological Informatics Consortium, Koto-ku, Tokyo 1350064, Japan. 26Plant Breeding Dept, Cornell University, Ithaca, New York 14850-1901, USA. 27Cold Spring Harbor Laboratory, PO Box 100, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA. 28Department of Biology, McGill University, 1205 Dr Penfield Avenue, Montreal, Quebec H3A 1B1, Canada. 29Department of Biology, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, Canada. 30Biometrics and Bioinformatics Unit, International Rice Research Institute, DAPO Box 7777, Metro Manila, Philippines. 31 Graduate School of Natural Sciences, Nagoya City University, Nagoya 467-8501, Japan. 32Biology Department, Brookhaven National Laboratory, Upton, New York 11973, USA. 800 © 2005 Nature Publishing Group RESEARCH ARTICLES G. A. Tuskan,1,3* S. DiFazio,1,4† S. Jansson,5† J. Bohlmann,6† I. Grigoriev,9† U. Hellsten,9† N. Putnam,9† S. Ralph,6† S. Rombauts,10† A. Salamov,9† J. Schein,11† L. Sterck,10† A. Aerts,9 R. R. Bhalerao,5 R. P. Bhalerao,12 D. Blaudez,13 W. Boerjan,10 A. Brun,13 A. Brunner,14 V. Busov,15 M. Campbell,16 J. Carlson,17 M. Chalot,13 J. Chapman,9 G.-L. Chen,2 D. Cooper,6 P. M. Coutinho,19 J. Couturier,13 S. Covert,20 Q. Cronk,7 R. Cunningham,1 J. Davis,22 S. Degroeve,10 A. Déjardin,23 C. dePamphilis,18 J. Detter,9 B. Dirks,24 I. Dubchak,9,25 S. Duplessis,13 J. Ehlting,7 B. Ellis,6 K. Gendler,26 D. Goodstein,9 M. Gribskov,27 J. Grimwood,28 A. Groover,29 L. Gunter,1 B. Hamberger,7 B. Heinze,30 Y. Helariutta,12,31,33 B. Henrissat,19 D. Holligan,21 R. Holt,11 W. Huang,9 N. Islam-Faridi,34 S. Jones,11 M. Jones-Rhoades,35 R. Jorgensen,26 C. Joshi,15 J. Kangasjärvi,32 J. Karlsson,5 C. Kelleher,6 R. Kirkpatrick,11 M. Kirst,22 A. Kohler,13 U. Kalluri,1 F. Larimer,2 J. Leebens-Mack,21 J.-C. Leplé,23 P. Locascio,2 Y. Lou,9 S. Lucas,9 F. Martin,13 B. Montanini,13 C. Napoli,26 D. R. Nelson,36 C. Nelson,37 K. Nieminen,31 O. Nilsson,12 V. Pereda,13 G. Peter,22 R. Philippe,6 G. Pilate,23 A. Poliakov,25 J. Razumovskaya,2 P. Richardson,9 C. Rinaldi,13 K. Ritland,8 P. Rouzé,10 D. Ryaboy,25 J. Schmutz,28 J. Schrader,38 B. Segerman,5 H. Shin,11 A. Siddiqui,11 F. Sterky,39 A. Terry,9 C.-J. Tsai,15 E. Uberbacher,2 P. Unneberg,39 J. Vahala,32 K. Wall,18 S. Wessler,21 G. Yang,21 T. Yin,1 C. Douglas,7‡ M. Marra,11‡ G. Sandberg,12‡ Y. Van de Peer,10‡ D. Rokhsar9,24‡ We report the draft genome of the black cottonwood tree, Populus trichocarpa. Integration of shotgun sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome. More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome revealed a whole-genome duplication event; about 8000 pairs of duplicated genes from that event survived in the Populus genome. A second, older duplication event is indistinguishably coincident with the divergence of the Populus and Arabidopsis lineages. Nucleotide substitution, tandem gene duplication, and gross chromosomal rearrangement appear to proceed substantially more slowly in Populus than in Arabidopsis. Populus has more protein-coding genes than Arabidopsis, ranging on average from 1.4 to 1.6 putative Populus homologs for each Arabidopsis gene. However, the relative frequency of protein domains in the two genomes is similar. Overrepresented exceptions in Populus include genes associated with lignocellulosic wall biosynthesis, meristem development, disease resistance, and metabolite transport. orests cover 30% (about 3.8 billion ha) of Earth_s terrestrial surface, harbor substantial biodiversity, and provide humanity with benefits such as clean air and water, lumber, fiber, and fuels. Worldwide, one-quarter of all industrial feedstocks have their origins in forestbased resources (1). Large and long-lived forest trees grow in extensive wild populations across continents, and they have evolved under selective pressures unlike those of annual herbaceous plants. Their growth and development involves extensive secondary growth, coordinated signaling and distribution of water and nutrients over great distances, and strategic storage and redistribution of metabolites in concordance with interannual climatic cycles. Their need to survive and thrive in fixed locations over centuries under continually changing physical and biotic stresses also sets them apart from short-lived plants. Many of the features that distinguish trees from other organisms, especially their large sizes and long-generation times, present challenges to the study of the cellular and molecular mechanisms that underlie their unique biology. To enable and facilitate such investigations in a relatively well-studied model F 1596 tree, we describe here the draft genome of black cottonwood, Populus trichocarpa (Torr. & Gray), and compare it to other sequenced plant genomes. P. trichocarpa was selected as the model forest species for genome sequencing not only because of its modest genome size but also because of its rapid growth, relative ease of experimental manipulation, and range of available genetic tools (2, 3). The genus is phenotypically diverse, and interspecific hybrids facilitate the genetic mapping of economically important traits related to growth rate, stature, wood properties, and paper quality. Dozens of quantitative trait loci have already been mapped (4), and methods of genetic transformation have been developed (5). Under appropriate conditions, Populus can reach reproductive maturity in as few as 4 to 6 years, permitting selective breeding for large-scale sustainable plantation forestry. Finally, rapid growth of trees coupled with thermochemical or biochemical conversion of the lignocellulosic portion of the plant has the potential to provide a renewable energy resource with a concomitant reduction of greenhouse gases (6–8). 15 SEPTEMBER 2006 VOL 313 SCIENCE 1 Environmental Sciences Division, 2Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA. 3Plant Sciences Department, University of Tennessee, TN 37996, USA. 4 Department of Biology, West Virginia University, Morgantown, WV 26506, USA. 5Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, SE-901 87, Umeå, Sweden. 6 Michael Smith Laboratories, 7Department of Botany, 8Department of Forest Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada. 9U.S. Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA. 10 Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB), Ghent University, B-9052 Ghent, Belgium. 11Genome Sciences Centre, 100-570 West 7th Avenue, Vancouver, BC V5Z 4S6, Canada. 12Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 83 Umeå, Sweden. 13Tree-Microbe Interactions Unit, Institut National de la Recherche Agronomique (INRA)–Université Henri Poincaré, INRA-Nancy, 54280 Champenoux, France. 14Department of Forestry, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA. 15Biotechnology Research Center, School of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI 49931, USA. 16Department of Cell and Systems Biology, University of Toronto, 25 Willcocks Street, Toronto, Ontario, M5S 3B2 Canada. 17School of Forest Resources and Huck Institutes of the Life Sciences, 18Department of Biology, Institute of Molecular Evolutionary Genetics, and Huck Institutes of Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA. 19Architecture et Fonction des Macromolécules Biologiques, UMR6098, CNRS and Universities of Aix-Marseille I and II, case 932, 163 avenue de Luminy, 13288 Marseille, France. 20Warnell School of Forest Resources, 21Department of Plant Biology, University of Georgia, Athens, GA 30602, USA. 22 School of Forest Resources and Conservation, Genetics Institute, and Plant Molecular and Cellular Biology Program, University of Florida, Gainesville, FL 32611, USA. 23INRAOrléans, Unit of Forest Improvement, Genetics and Physiology, 45166 Olivet Cedex, France. 24Center for Integrative Genomics, University of California, Berkeley, CA 94720, USA. 25Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. 26Department of Plant Sciences, University of Arizona, Tucson, AZ 85721, USA. 27Department of Biological Sciences, Purdue University, West Lafayette, IN 47907, USA. 28 The Stanford Human Genome Center and the Department of Genetics, Stanford University School of Medicine, Palo Alto, CA 94305, USA. 29Institute of Forest Genetics, United States Department of Agriculture, Forest Service, Davis, CA 95616, USA. 30Federal Research Centre for Forests, Hauptstrasse 7, A-1140 Vienna, Austria. 31Plant Molecular Biology Laboratory, Institute of Biotechnology, 32Department of Biological and Environmental Sciences, University of Helsinki, FI-00014 Helsinki, Finland. 33Department of Biology, 200014, University of Turku, FI-20014 Turku, Finland. 34Southern Institute of Forest Genetics, United States Department of Agriculture, Forest Service and Department of Forest Science, Texas A&M University, College Station, TX 77843, USA. 35Whitehead Institute for Biomedical Research and Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02142, USA. 36Department of Molecular Sciences and Center of Excellence in Genomics and Bioinformatics, University of Tennessee, Memphis, TN 38163, USA. 37Southern Institute of Forest Genetics, United States Department of Agriculture, Forest Service, Saucier, MS 39574, USA. 38Developmental Genetics, University of Tübingen, D-72076 Tübingen, Germany. 39Department of Biotechnology, KTH, AlbaNova University Center, SE-106 91 Stockholm, Sweden. *To whom correspondence should be addressed. E-mail: [email protected] †These authors contributed equally to this work as second authors. ‡These authors contributed equally to this work as senior authors. www.sciencemag.org Downloaded from www.sciencemag.org on February 28, 2010 The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) Sequencing and Assembly A single female genotype, ‘‘Nisqually 1,’’ was selected and used in a whole-genome shotgun RESEARCH ARTICLES Table 1. Characterization of polymorphisms according to their positions relative to predicted coding sequences, introns, and untranslated regions (UTRs). Rate shows the percentage of potential sites of each class that were polymorphic. Most indels within exons resulted in frame shifts, but we could not quantify this due to difficulties with assembly and sequencing of regions containing indels. Nonsense mutations created stop codons within predicted exons. Source Noncoding INTRON 3¶UTR 5¶UTR Exon Within exons: Indels Nonsense Nonsynonymous Number of loci Rate (%) 1,027,322 141,199 6,731 3,306 62,656 0.32 0.25 0.25 0.24 0.14 2,722 926 32,207 0.01 0.02 0.10 ing to chloroplast (fig. S5) and mitochondrial genomes were assembled into circular genomes of 157 and 803 kb, respectively (9). We anchored the 410 Mb of assembled scaffolds to a sequence-tagged genetic map (fig. S3). In total, 356 microsatellite markers were used to assign 155 scaffolds (335 Mb of sequence) to the 19 P. trichocarpa chromosome-scale linkage groups (13). The vast majority (91%) of the mapped microsatellite markers were colinear with the sequence assembly. At the extremes, the smallest chromosome, LGIX [79 centimorgans (cM)], is covered by two scaffolds containing 12.5 Mb of assembled sequence, whereas the largest chromosome, LGI (265 cM), contains 21 scaffolds representing 35.5 Mb (fig. S3). We also generated a physical map based on bacterial artificial chromosome (BAC) fingerprint contigs using a Nisqually-1 BAC library representing an estimated 9.5-fold genome coverage (fig. S2). Paired BACend sequences from most of the physical map were linked to the large-scale assembly, permitting 2460 of the physical map contigs to be positioned on the genome assembly. Combining the genetic and physical map, nearly 385 Mb of the 410 Mb of assembled sequence are placed on a linkage group. Unlike Arabidopsis, where predominantly self-fertilizing ecotypes maintain low levels of allelic polymorphism, Populus species are predominantly dioecious, which results in obligate outcrossing. This compulsory outcrossing, along with wind pollination and wind-dispersed plumose seeds, results in high levels of gene flow and high levels of heterozygosity (that is, within individual genetic polymorphisms). Within the heterozygous Nisqually-1 genome, we identified 1,241,251 single-nucleotide polymorphisms (SNPs) or small insertion/deletion polymorphisms (indels) for an overall rate of approximately 2.6 polymorphisms per kilobase. Of these polymorphisms, the overwhelming majority (83%) occurred in noncoding portions of the genome (Table 1). Short indels and SNPs within exons resulted in frameshifts and nonsense stop codons within predicted exons, respectively, suggesting that null alleles of these genes exist in one of the haplotypes. Some of the polymorphisms may be artifacts from the assembly process, although these errors were minimized by using stringent criteria for SNP identification (9). Gene Annotation We tentatively identified a first-draft reference set of 45,555 protein-coding gene loci in the Populus nuclear genome (www.jgi.doe.gov/poplar) using a variety of ab initio, homology-based, and expressed sequence tag (EST)–based methods (14–17) (table S5). Similarly, 101 and 52 genes were annotated in the chloroplast and mitochondrial genomes, respectively (9). To aid the annotation process, 4664 full-length sequences, from full-length enriched cDNA libraries from Nisqually 1, were generated and used in training the gene-calling algorithms. Before gene prediction, repetitive sequences were characterized (fig. S15 and table S14) and masked; additional putative transposable elements were identified and subsequently removed from the reference gene set (9). Given the current draft nature of the genome, we expect that the gene set in Populus will continue to be refined. About 89% of the predicted gene models had homology [expectation (E) value e 1 10j8] to the nonredundant (NR) set of proteins from the National Center for Biotechnology Information, including 60% with extensive homology that spans 75% of both model and NR protein lengths. Nearly 12% (5248) of the predicted Populus genes had no detectable similarity to Arabidopsis genes (E value e 1 10j3); conversely, in the more refined Arabidopsis set, only 9% (2321) of the predicted genes had no similarity to the Populus reference set. Of the 5248 Populus genes without Arabidopsis similarity, 1883 have expression evidence from the manually curated Populus EST data set, and of these, 274 have no hits (E value Q 1 10j3) to the NR database (9). Whole-genome oligonucleotide microarray analysis provided evidence of tissue-based expression for 53% of the reference gene models (Fig. 1). In addition, a signal was detected from 20% of genes that were initially annotated and excluded from the reference set, suggesting that as many as 4000 additional genes (or gene fragments) may be present. Within the reference gene set, we identified 13,019 pairs of orthologs between Downloaded from www.sciencemag.org on February 28, 2010 sequence and assembly strategy (9). Roughly 7.6 million end-reads representing 4.2 billion highquality (i.e., Q20 or higher) base pairs were assembled into 2447 major scaffolds containing an estimated 410 megabases (Mb) of genomic DNA (tables S1 and S2). On the basis of the depth of coverage of major scaffolds (È7.5 depth) and the total amount of nonorganellar shotgun sequence that was generated, the Populus genome size was estimated to be 485 T 10 Mb (TSD), in rough agreement with previous cytogenetic estimates of about 550 Mb (10). The near completeness of the shotgun assembly in protein-coding regions is supported by the identification of more than 95% of known Populus cDNA in the assembly. The È75 Mb of unassembled genomic sequence is consistent with cytogenetic evidence that È30% of the genome is heterochromatic (9). The amount of euchromatin contained within the Populus genome was estimated in parallel by subtraction on the basis of direct measurements of 4¶,6¶-diamidino-2-phenylindole–stained prophase and metaphase chromosomes (fig. S4). On average, 69.5 T 0.3% of the genome consisted of euchromatin, with a significantly lower proportion of euchromatin in linkage group I (LGI) (66.4 T 1.1%) compared with the other 18 chromosomes (69.7 T 0.03%, P e 0.05). In contrast, Arabidopsis chromosomes contain roughly 93% euchromatin (11). The unassembled shotgun sequences were derived from variants of organellar DNA, including recent nuclear translocations; highly repetitive genomic DNA; haplotypic segments that were redundant with short subsegments of the major scaffolds (separated as a result of extensive sequence polymorphism and allelic variants); and contaminants of the template DNA, such as endophytic microbes inhabiting the leaf and root tissues used for template preparation (12) (fig. S1 and table S3). The end-reads correspond- Fig. 1. Whole-genome oligonucleotide microarray expression data for all predicted gene models in P. trichocarpa. Values represent the proportion of genes expressed above negative controls at a 5% false discovery rate. The x axis represents the subsets of predicted genes that were analyzed for the annotated and promoted P. trichocarpa gene set (42,373 genes), chloroplast gene set (49 genes), mitochondria gene set (49 genes), annotated, nonpromoted gene set (10,875 genes), and microRNAs (48 miRNAs). www.sciencemag.org SCIENCE VOL 313 15 SEPTEMBER 2006 1597 RESEARCH ARTICLES genes in Populus and Arabidopsis using the best bidirectional Basic Local Alignment Search Tool (BLAST) hits, with average mutual coverage of these alignments equal to 93%; 11,654 pairs of orthologs had greater than 90% alignment of gene lengths, whereas only 156 genes had less than 50% coverage. As of 1 June 2006, È10% (4378) gene models have been manually validated and curated. Genome Organization Genome duplication in the Salicaceae. Populus and Arabidopsis lineages diverged about 100 to 120 million years ago (Ma). Analysis of the Populus genome provided evidence of a more recent duplication event that affected roughly 92% of the Populus genome. Nearly 8000 pairs of paralogous genes of similar age (excluding tandem or local duplications) were identified (Fig. 2). The relative age of the duplicate genes was estimated by the accumulated nucleotide divergence at fourfold synonymous third-codon transversion position (4DTV) values. A sharp peak in 4DTV values, corrected for multiple substitutions, representing a burst of gene duplication, is evident at 0.0916 T 0.0004 (Fig. 3A). Comparison of 1825 Populus and Salix orthologous genes derived from Salix EST suggests that both genera share this whole-genome duplication event (Fig. 3B). Moreover, the parallel karyotypes and collinear genetic maps (18) of Salix and Populus also support the conclusion that both lineages share the same large-scale genome history. If we naively calibrated the molecular clock using synonymous rates observed in the Brassicaceae (19) or derived from the Arabidopsis-Oryza divergence (20), we would conclude that the genome duplication in Populus is very recent [8 to 13 Ma, as reported by Sterk (21)]. Yet the fossil record shows that the Populus and Salix lineages diverged 60 to 65 Ma (22–25). Thus, the molecular clock in Populus must be ticking at only Downloaded from www.sciencemag.org on February 28, 2010 Fig. 2. Chromosome-level reorganization of the most recent genome-wide duplication event in Populus. Common colors refer to homologous genome blocks, presumed to have arisen from the salicoid-specific genome duplication 65 Ma, shared by two chromosomes. Chromosomes are indicated by their linkage group number (I to XIX). The diagram to the left uses the same color coding and further illustrates the chimeric nature of most linkage groups. Fig. 3. (A) The 4DTV metrics for paralogous gene pairs in PopulusPopulus and Populus-Arabidopsis. Three separate genome-wide duplications events are detectable, with the most recent event contained within 1598 15 SEPTEMBER 2006 VOL 313 the Salicaceae and the middle event apparently shared among the Eurosids. (B) Percent identity distributions for mutual best EST hit to Populus trichocarpa CDS. SCIENCE www.sciencemag.org one-sixth the estimated rate for Arabidopsis (that is, 8 to 13 Ma divided by 60 to 65 Ma). Qualitatively similar slowing of the molecular clock is found in the Populus chloroplast and mitochondrial genomes (9). Because Populus is a long-lived vegetatively propagated species, it has the potential to successfully contribute gametes to multiple generations. A single Populus genotype can persist as a clone on the landscape for millennia (26), and we propose that recurrent contributions of ‘‘ancient gametes’’ from very old individuals could account for the markedly reduced rate of sequence evolution. As a result of the slowing of the molecular clock, the Populus genome most likely resembles the ancestral eurosid genome. To test whether the burst of gene creation 60 to 65 Ma was due to a single whole-genome event or to independent but near-synchronous gene duplication events, we used a variant of the algorithm of Hokamp et al. (27) to identify segments of conserved synteny within the Populus genome. The longest conserved syntenic block from the 4DTV È0.09 epoch spanned 765 pairs of paralogous genes. In total, 32,577 genes were contained within syntenic blocks from the salicoid epoch; half of these genes were contained in segments longer than 142 paralogous pairs. The same algorithm, when applied to randomly shuffled genes, typically yields duplicate blocks with fewer than 8 to 9 genes, indicating that the Populus gene duplications occurred as a single genome-wide event. We refer to this duplication event as the ‘‘salicoid’’ duplication event. Nearly every mapped segment of the Populus genome had a parallel ‘‘paralogous’’ segment elsewhere in the genome as a result of the salicoid event (Fig. 2). The pinwheel patterns can be understood as a whole-genome duplication followed by a series of reciprocal tandem terminal fusions between two separate sets of four chromosomes each—the first involving LGII, V, VII, and XIV and the second involving LGI, XI, IV, and IX. In addition, several chromosomes appear to have experienced minor reorganizational exchanges. Furthermore, LGI appears to be the result of multiple rearrangements involving three major tandem fusions. These results suggest that the progenitor of Populus had a base chromosome number of 10. After the whole-genome duplication event, this base chromosome number experienced a genome-wide reorganization and diploidization of the duplicated chromosomes into four pairs of complete paralogous chromosomes (LGVI, VIII, X, XII, XIII, XV, XVI, XVIII, and XIX); two sets of four chromosomes, each containing a terminal translocation (LGI, II, IV, V, VII, IX, and XI); and one chromosome containing three terminally joined chromosomes (LGIII with I or XVII with VII). The colinearity of genetic maps among multiple Populus species suggests that the genome reorganization occurred before the evolution of the modern taxa of Populus. Genome duplication in a common ancestor of Populus and Arabidopsis. The distribution of 4DTV values for paralogous pairs of genes also shows that a large fraction of the Populus genome falls in a set of duplicated segments anchored by gene pairs with 4DTV at 0.364 T 0.001, representing the residue of a more ancient, large-scale, apparently synchronous duplication event (Fig. 3A). This relatively older duplication event covers about 59% of the Populus genome with 16% of genes in these segments present in two copies. Because this duplication preceded and is therefore superimposed upon the salicoid event, each genomic region is potentially covered by four such segments. Similarly, the Arabidopsis genome experienced an older ‘‘beta’’ duplication that preceded the Brassicaceae-specific ‘‘alpha’’ event (28–32). We next asked whether the Arabidopsis ‘‘beta’’ (30, 32) and Populus 4DTV È0.36 duplication events were (i) independent genomewide duplications that occurred after the split from the last common eurosid ancestor (H1) or (ii) a single shared duplication event that occurred in an ancestral lineage (i.e., before the divergence of eurosid lineages I and II) (H2). These two hypotheses have very different implications for the interpretation of homology between Populus and Arabidopsis. Under H1, each genomic segment in one species is homologous to four segments in the other; whereas under H2, each segment is homologous to only two segments in the other species. These hypotheses were tested by comparing the relative distances between gene pairs sampled within and between Populus and Arabidopsis. H2 was generally supported (9), but we could not reject H1. We can only conclude that the Populus genome duplication occurred very close to the time of divergence of the eurosid I and II lineages (9), with slight support for a shared duplication. This coincident timing raises the possibility of a causal link between this duplication and rapid diversification early in eurosid (and perhaps core eudicot) history. We refer to this older Populus/Arabidopsis duplication event as the ‘‘eurosid’’ duplication event. We note that the salicoid duplication occurred independently of the eurosid duplication observed in the Arabidopsis genome. Gene Content Although Populus has substantially more protein-coding genes than Arabidopsis, the relative frequency of domains represented in protein databases (Prints, Prosite, Pfam, ProDom, and SMART) in the two genomes is similar (9). However, the most common domains occur in Populus compared with Arabidopsis in a ratio ranging from 1.4:1 to 1.8:1. Noteworthy outliers in Populus include genes and gene domains associated with disease and insect resistance (such as, in Populus versus Arabidopsis, respectively: leucine-rich repeats, 1271 versus 527; NB-ARC domain, 302 versus 141; and thaumatin, 55 versus 24), meristem development (such as NAC transcription factors, 157 versus 100, respectively), and metabolite and nutrient transport [such as oligopeptide transporter of the proton- www.sciencemag.org SCIENCE VOL 313 dependent oligopeptide transporter (POT) and oligopeptide transporter (OPT) families, 129 versus 61, and potassium transporter, 30 versus 13, respectively]. Some domains were underrepresented in Populus compared with Arabidopsis. For example, the F-box domain was twice as prevalent in Arabidopsis as in Populus (624 versus 303, respectively). The F-box domain is involved in diverse and complex interactions involving protein degradation through the ubiquitin-26S proteosome pathway (33). Many of the ubiquitinassociated domains are underrepresented in Populus compared with Arabidopsis (for example, the Ulp1 protease family and the C-terminal catalytic domain, 10 versus 63, respectively). Moreover, the RING-finger domains are nearly equally present in both genomes (503 versus 407, respectively), suggesting that protein degradation pathways in the two organisms are metabolically divergent. The common eurosid gene set. The Populus and Arabidopsis gene sets were compared to infer the conserved gene complement of their common eurosid ancestor, integrating information from nucleotide divergence, synteny, and mutual best BLAST-hit analysis (9). The ancestral eurosid genome contained at least 11,666 protein-coding genes, along with an undetermined number that were either lost in one or both of the lineages or whose homology could not be detected. These ancestral genes were the progenitors of gene families of typically one to four descendents in each of the complete plant genomes and account for 28,257 Populus and 17,521 Arabidopsis genes. Gene family lists are accessible at www. phytozome.net. The gene predictions in these two genomes that could not be accounted for in the eurosid clusters were often fragmentary or difficult to categorize, and we could not confidently assign orthology to them. They may include previously unidentified or rapidly evolving genes in the Populus and/or Arabidopsis lineages, as well as poorly predicted genes. Noncoding RNAs. Based on a series of publicly available RNA detection algorithms (34), including tRNAScan-SE, INFERNAL, and snoScan, we identified 817 putative tRNAs; 22 U1, 26 U2, 6 U4, 23 U5, and 11 U6 spliceosomal small nuclear RNAs (snRNAs); 339 putative C/D small nucleolar RNAs (snoRNAs); and 88 predicted H/ACA snoRNAs in the Populus assembly. All 57 possible anticodon tRNAs were found. One selenocysteine tRNA was detected and two possible suppressor tRNAs (anticodons that bind stop codons) were also identified. Populus has nearly 1.3 times as many tRNA genes as Arabidopsis. In contrast to Arabidopsis (fig. S7A), the copy number of tRNA in Populus was significantly and positively correlated with amino acid occurrence in predicted gene models (fig. S7B). The ratio of the number of snRNAs in Populus compared with the number in Arabidopsis is 1.3 to 1.0, yet U1, U2, and U5 are overrepresented in Populus, whereas U4 is underrepresented. Further- 15 SEPTEMBER 2006 Downloaded from www.sciencemag.org on February 28, 2010 RESEARCH ARTICLES 1599 more, U14 was not detected in Arabidopsis. The snRNAs and snoRNAs have not been experimentally verified in Populus. There are 169 identified microRNA (miRNA) genes representing 21 families in Populus (table S7). In Arabidopsis, these 21 families contain 91 miRNA genes, representing a 1.9X expansion in Populus, primarily in miR169 and miR159/319. All 21 miRNA families have regulatory targets that appear to be conserved among Arabidopsis and Populus (table S8). Similar to the miRNA genes themselves, the number of predicted targets for these miRNA is expanded in Populus (147) compared with Arabidopsis (89). Similarly, the genes that mediate RNA interference (RNAi) are also overrepresented in Populus (21) compared to Arabidopsis (11) [e.g., AGO1 class, 7 versus 3; RNA helicase 2 versus 1; HEN, 2 versus 1; HYL1-like (double-stranded RNA binding proteins), 9 versus 5, respectively]. Tandem duplications. In Populus there were 1518 tandemly duplicated arrays of two or more genes based on a Smith-Waterman alignment E value e 10j25 and a 100-kb window. The total number of genes in such arrays was 4839 and the total length of tandemly duplicated segments in Populus was 47.9 Mb, or 15.6% of the genome (fig. S8). By the same criteria, there are 1366 tandemly duplicated segments in Arabidopsis, covering 32.4 Mb, or 27% of the genome. By far the most common number of genes within a single array was two, with 958 such arrays in Populus and 805 in Arabidopsis. Arabidopsis had a larger number of arrays containing six or more genes than did Populus. Tandem duplications thus appear to be relatively more common in Arabidopsis than in Populus. This may in part be due to difficulties in assembling tandem repeats from a whole-genome shotgun sequencing approach, particularly when tandemly duplicated genes are highly conserved. Alternatively, the Populus genome may be undergoing rearrangements at a slower rate than the Arabidopsis genome, which is consistent with our observations of reduced chromosomal rearrangements and slower nucleotide substitution rates in Populus. In some cases, genes were highly duplicated in both species, and some tandem duplications predated the Populus-Arabidopsis split (9). The largest number of tandem repeats in Populus in a single array was 24 and contained genes with high homology to S locus–specific glycoproteins. Genes of this class also occur as tandem repeats in Arabidopsis, with the largest segments containing 14 tandem duplicates on chromosome 1. One of the InterPro domains in this protein, IPR008271, a serine/threonine protein kinase active site, was the most frequent domain in tandemly repeated genes in both species (fig. S8). Other common domains in both species were the leucine-rich repeat (IPR007090, primarily from tandem repeats of 1600 disease resistance genes), the pentatricopeptide repeat RNA-binding proteins (IPR002885), and the uridine diphosphate (UDP)-glucuronosyl/ UDP-glucosyltransferase domain (IPR002213) (table S9). In contrast, some genes were highly expanded in tandem duplicates in one genome and not in the other (fig. S8). For example, one of the most frequent classes of tandemly duplicated genes in Arabidopsis was F-box genes, with a total of 342 involved in tandem duplications, the largest segment of which contained 24 F-box genes. Populus contains only 37 F-box genes in tandem duplications, with the largest segment containing only 3 genes. Postduplication Gene Fate Functional expression divergence. In Populus, 20 of the 66 salicoid-event duplicate gene pairs contained in 19 Populus EST libraries (2.3% of the total) showed differential expression (9) [displaying significant deviation in EST frequencies per library (Fig. 4)]. Out of 18 eurosid-event duplicate gene pairs (2.7% of the total), 11 also displayed significant deviation in EST frequencies per library. Many of the duplicate gene pairs that displayed significant overrepresentation in one or more of the 19 sampled libraries were involved in protein-protein interactions (such as annexin) or protein folding (such as cyclophilins). In the eurosid set, there was a greater divergence in the best BLAST hit among pairwise sets of genes. These results support the premise of functional expression divergence among some duplicated gene pairs in Populus. To further test for variation in gene expression among duplicated genes, we examined whole-genome oligonucleotide microarray data containing the 45,555 promoted genes (9). There was significantly lower differential expression in the salicoid duplicated pairs of genes (mean 0 5%) relative to eurosid duplications (mean 0 11%), again suggesting that differential expression patterns for retained paralogous gene pairs is an ongoing process that has had more time to occur in eurosid pairs (Fig. 5). This difference could also be due to absolute expression level, which may vary systematically between the two duplication events. Moreover, differential expression was more evident in the wood-forming organs. Almost 14 and 13% (2632 pairs of genes) of eurosid duplicated genes in the nodes and internodes, respectively, displayed differential expression, compared with 8% or less in roots and young leaves (Fig. 5). Single-nucleotide polymorphisms. Populus is a highly polymorphic taxon and substantial numbers of SNPs are present even within a single individual (Table 1). The ratio of nonsynonymous to synonymous substitution rate (w 0 dN/dS) was calculated as an index of selective constraints for alleles of individual genes (9). The overall average dN across all genes was 0.0014, whereas the dS value was 0.0035, for a total w of 0.40, suggesting that the majority of coding regions in the Populus genome are subject to purifying selection. There was a significant, negative correlation between w and the 4DTV distance to the most closely related paralog (r 0 –0.034, P 0 0.028), which is consistent with the expectation of higher levels of nonsynonymous polymorphism in recently duplicated genes as a result of functional redundancy (20, 35). Similarly, genes with recent tandem duplicates (4DTV e 0.2) had significantly higher w than did genes with no recent tandem duplicates (Wilcoxon rank sum Z 0 8.65, P e 0.0001) (table S10). The results for tandemly duplicated genes were consistent with expectations for accelerated evolution of duplicated genes (20). However, this expectation was not upheld for paralogous pairs of genes from the whole-genome duplication events. Relative rates of nonsynonymous substitution were actually lower for genes with paralogs from the salicoid and eurosid whole-genome duplication events than for genes with no paralogs (table S11). One possible explanation for this Fig. 4. KolmogorovSmirnov (K-S) test for differential expression for 5methyltetrahydropteroyltriglutamate-homocysteine S-methyltransferase genes [for descriptions of the EST data set, see Sterky et al. (79)]. Results suggest that the duplicated genes in Populus are differentially expressed in alternate tissues. Tissue types include: cambial zone (1), young leaves (2), flower buds (3), tension wood (4), senescing leaves (5), apical shoot (6), dormant cambium (7), active cambium (8), cold stressed leaves (9), roots (10), bark (11), shoot meristem (12), male catkins (13), dormant buds (14), female catkins (15), petioles (16), wood cell death (17), imbibed seeds (18) and infected leaves (19). 15 SEPTEMBER 2006 VOL 313 SCIENCE www.sciencemag.org Downloaded from www.sciencemag.org on February 28, 2010 RESEARCH ARTICLES discrepancy is that the apparent single-copy genes have a corresponding overrepresentation of rapidly evolving pseudogenes. However, this does not appear to be the case, as demonstrated by an analysis of gene size, synonymous substitution rate, and minimum genetic distance to the closest paralog as covariates in an analysis of variance with w as the response variable (table S11). Therefore, genes with no paralogs from the salicoid and eurosid duplication events seem to be under lower selective constraints, and purifying selection is apparently stronger for genes with paralogs retained from the whole-genome duplications. Chapman et al. (36) have recently proposed the concept of functional buffering to account for similar reduction in detected mutations in paralogs from whole-genome duplications in Arabidopsis and Oryza. The vegetative propagation habit of Populus may also favor the conservation of nucleotide sequences among duplicated genes, in that complementation among duplicate pairs of genes would minimize loss of gene function associated with the accumulation of deleterious somatic mutations. Gene family evolution. The expansion of several gene families has contributed to the evolution of Populus biology. Lignocellulosic wall formation. Among the processes unique to tree biology, one of the most obvious is the yearly development of secondary xylem from the vascular cambium. We identified Populus orthologs of the approximately 20 Arabidopsis genes and gene families involved in or associated with cellulose biosynthesis. The Populus genome has 93 cellulose synthesis– related genes compared with 78 in Arabidopsis. The Arabidopsis genome encodes 10 CesA genes belonging to six classes known to participate in cellulose microfibril biosynthesis (37). Populus has 18 CesA genes (38), including duplicate copies of CesA7 and CesA8 homologs. Populus homologs of Arabidopsis CesA4, CesA7, and CesA8 are coexpressed during xylem development and tension wood formation (39). Furthermore, one pair of CesA genes appears unique to Populus, with no homologs found in Arabi- dopsis (40). Many other types of genes associated with cellulose biosynthesis, such as KOR, SuSY, COBRA, and FRA2, occur in duplicate pairs in Populus relative to single-copy Arabidopsis genes (39). For example, COBRA, a regulator of cellulose biogenesis (41), is a single-copy gene in Arabidopsis, but in Populus there are four copies. The repertoire of acknowledged hemicellulose biosynthetic genes in Populus is generally similar to that in Arabidopsis. However, Populus has more genes encoding a-L-fucosidases and fewer genes encoding a-L-fucosyltransferases than does Arabidopsis, which is consistent with the lower xyloglucan fucose content (42) in Populus relative to Arabidopsis. Lignin, the second most abundant cell wall polymer after cellulose, is a complex polymer of monolignols (hydroxycinnamyl alcohols) that encrusts and interacts with the cellulose/hemicellulose matrix of the secondary cell wall (43). The full set of 34 Populus phenylpropanoid and lignin biosynthetic genes (table S13) was identified by sequence alignment to the known Arabidopsis phenylpropanoid and lignin genes (44, 45). The size of the Populus gene families that encode these enzymes is generally larger than in Arabidopsis (34 versus 18, respectively). The only exception is cinnamyl alcohol dehydrogenase (CAD), which is encoded by a single gene in Populus and two genes in Arabidopsis (Fig. 6C); CAD is also encoded by only a single gene in Pinus taeda (46, 47). Two lignin-related Populus C4H genes are strongly coexpressed in tissues related to wood formation, whereas the three Populus C3H genes show reciprocally exclusive expression patterns (48) (Fig. 6, A and B). Secondary metabolism. Populus trees produce a broad array of nonstructural, carbon-rich secondary metabolites that exhibit wide variation in abundance, stress inducibility, and effects on tree growth and host-pest interactions (49–53). Shikimate-phenylpropanoid–derived phenolic esters, phenolic glycosides, and condensed tannins and their flavonoid precursors comprise Fig. 5. Proportion of eurosid and salicoid duplicated gene sets differentially expressed in stems (nodes and internode), leaves (young and mature), and whole roots. Samples from four biological replicates collected from the reference genotype Nisqually 1 were individually hybridized to whole-genome oligonucleotide microarrays containing three 60oligomer oligonucleotide probes for each gene. Differential expression between duplicated genes was evaluated in t tests and declared significant at a 5% false discovery rate (9). www.sciencemag.org SCIENCE VOL 313 the largest classes of these metabolites. Phenolic glycosides and condensed tannins alone can constitute up to 35% leaf dry weight and are abundant in buds, bark, and roots of Populus (50, 54, 55). The flavonoid biosynthetic genes are well annotated in Arabidopsis (56) and almost all (with the exception of flavonol synthase) are encoded by single-copy genes. In contrast, all but three such enzymes (chalcone isomerase, flavonoid 3¶-hydroxylase, and flavanone 3hydroxylase) are encoded by multiple genes in Populus (53). For example, the chalcone synthase, controlling the committed step to flavonoid biosynthesis, has expanded to at least six genes in Populus. In addition, Populus contains two genes each for flavone synthase II (cytochrome accession number CYP98B) and flavonoid 3¶,5¶hydroxylase (CYP75A12 and CYP75A13), both of which are absent in Arabidopsis. Furthermore, three Populus genes encode leucoanthocyanidin reductase, required for the synthesis of condensed tannin precursor 2,3-trans-flavan-3-ols, a stereochemical configuration also lacking in Arabidopsis (57). In contrast to the 32 terpenoid synthase (TPS) genes of secondary metabolism identified in the Arabidopsis genome (58), the Populus genome contains at least 47 TPS genes, suggesting a wide-ranging capacity for the formation of terpenoid secondary metabolites. A number of phenylpropanoid-like enzymes have been annotated in the Arabidopsis genome (44, 45, 59–61). One example is the family encoding CAD. In addition to the single Populus CAD gene involved in lignin biosynthesis, several other clades of CAD-like (CADL) genes are present, most of which fall within larger subfamilies containing enzymes related to multifunctional alcohol dehydrogenases (Fig. 6). This comparative analysis makes it clear that there has been selective expansion and retention of Populus CADL gene families. For example, Populus contains seven CADL genes (PoptrCADL1 to PoptrCADL7; Fig. 6C) encoding enzymes related to the Arabidopsis BAD1 and BAD2 enzymes with apparent benzyl alcohol dehydrogenase activities (62). BAD1 and BAD2 are known to be pathogen inducible, suggesting that this group of Populus genes, including the Populus SAD gene, previously characterized as encoding a sinapaldehyde-specific CAD enzyme (63), may be involved in chemical defense. Disease resistance. The likelihood that a perennial plant will encounter a pathogen or herbivore before reproduction is near unity. The long-generation intervals for trees make it difficult for such plants to match the evolutionary rates of a microbial or insect pest. Aside from the formation of thickened cell walls and the synthesis of secondary metabolites that constitute a first line of defense against microbial and insect pests, plants use a variety of disease-resistance (R) genes. The largest class of characterized R genes encodes intracellular proteins that contain a 15 SEPTEMBER 2006 Downloaded from www.sciencemag.org on February 28, 2010 RESEARCH ARTICLES 1601 RESEARCH ARTICLES 1602 families, coding for adenosine 5¶-triphosphate– binding cassette proteins (ABC transporters, 226 gene models), major facilitator superfamily proteins (187 genes), drug/metabolite transporters (108 genes), amino acid/auxin permeases (95 genes), and POT transporters (90 genes), accounted for more than 40% of the total number of transporter gene models (fig. S14). Some large families such as those encoding POT (4.3X relative to Arabidopsis), glutamate-gated ion channels (3.7X), potassium uptake permeases (2.3X), and ABC transporters (1.9X) are expanded in Populus. We identified a subfamily of five putative aquaporins, lacking in the Arabidopsis. Populus also harbors seven transmembrane re- ceptor genes that have previously only been found in fungi, and two genes, identified as mycorrhizal-specific phosphate transporters, that confirm that the mycorrhizal symbiosis may have an impact on the mineral nutrition of this longlived species. This expanded inventory of transporters could conceivably play a role in adaptation to nutrient-limited forest soils, long-distance transport and storage of water and metabolites, secretion and movement of secondary metabolites, and/or mediation of resistance to pathogenproduced secondary metabolites or other toxic compounds. Phytohormones. Both physiological and molecular studies have indicated the importance of Fig. 6. Phylogenetic analysis of gene families in Populus, Arabidopsis, and Oryza encoding selected lignin biosynthetic and related enzymes. (A) Cinnamate-4-hydroxylase (C4H) gene family. (B) 4-coumaroyl-shikimate/quinate-3-hydroxlase (C3H) gene family. (C) Cinnamyl alcohol dehydrogenase (CAD) and related multifunctional alcohol dehydrogenase gene family. Arabidopsis gene names are the same as those in Ehlting et al. (80). Populus and Oryza gene names were arbitrarily assigned; corresponding gene models are listed in table S13. Genes encoding enzymes for which biochemical data are available are highlighted with a green flash. Yellow circles indicate monospecific clusters of gene family members. Table 2. Numbers of genes that encode domains similar to plant R proteins in Populus, Arabidopsis (81), and Oryza (82). *, BED finger and/or DUF1544 domain; CC, coiled coil; –, not detected. Predicted protein domains Letter code Populus Arabidopsis Oryza TN TNL TNLT TNLN NLT TCNL CN CNL BN NB BNL NL N – 10 64 13 1 1 2 19 119 5 1 24 90 49 – 398 21 83 – – – – 4 51 – – – 6 1 41 207 – – – – – – 7 159 – – – 40 45 284 535 TIR-NBS TIR-NBS-LRR TIR-NBS-LRR-TIR TIR-NBS-LRR-NBS NBS-LRR-TIR TIR-CC-NBS-LRR CC-NBS CC-NBS-LRR BED/DUF1544*-NBS NBS-BED/DUF1544* BED/DUF1544*-NBS-LRR NBS-LRR NBS Others Total NBS genes 15 SEPTEMBER 2006 VOL 313 SCIENCE www.sciencemag.org Downloaded from www.sciencemag.org on February 28, 2010 nucleotide-binding site (NBS) and carboxyterminal leucine-rich-repeats (LRR) (64). The NBS-coding R gene family is one of the largest in Populus, with 399 members, approximately twice as high as in Arabidopsis. The NBS family can be divided into multiple subfamilies with distinct domain organizations, including 64 TIRNBS-LRR genes, 10 truncated TIR-NBS that lack an LRR, 233 non–TIR-NBS-LRR genes, and 17 unusual TIR-NBS–containing genes that have not been identified previously in Arabidopsis (TNLT, TNLN, or TCNL domains) (Table 2). Five gene models coding for TNL proteins contained a predicted N-terminal nuclear localization signal (65). The number of non–TIR-NBS-LRR genes in Populus is also much higher than that in Arabidopsis (209 versus 57, respectively). Notably, 40 non–TIR-NBS genes, not found in Arabidopsis, carry an Nterminal BED DNA-binding zinc finger domain that was also found in the Oryza Xa1 gene. These findings suggest that domain cooption occurred in Populus. Most NBS-LRR (about 65%) in Populus occur as singletons or in tandem duplications, and the distribution of pairwise genetic distances among these genes suggests a recent expansion of this family. That is, only 10% of the NBS-LRR genes are associated with the eurosid and salicoid duplication events, compared with 55% of the extracellular LRR receptor-like kinase genes (for example, fig. S10). Several conserved signaling components such as RAR1, EDS1, PAD4, and NPR1, known to be recruited by R genes, also contain multiple homologs in Populus. For example, two copies of the PAD4 gene, which functions upstream of salicylic acid accumulation, and five copies of the NPR1 gene, an important regulator of responses downstream of salicylic acid, are found in Populus. Nearly all genes known to control disease resistance signaling in Arabidopsis have putative orthologs in Populus. Populus has a larger number of b-1,3-glucanase and chitinase genes than does Arabidopsis (131 versus 73, respectively). In summary, the structural and genetic diversity that exists among R genes and their signaling components in Populus is remarkable and suggests that unlike the rest of the genome, contemporary diversifying selection has played an important role in the evolution of disease resistance genes in Populus. Such diversification suggests that enhanced ability to detect and respond to biotic challenges through R gene–mediated signaling may be critical over a decades-long life span of this genus. Membrane transporters. Attributes of Populus biology such as massive interannual, seasonal, and diurnal metabolic shifts and redeployment of carbon and nitrogen may require an elaborate array of transporters. Investigation of gene families coding for transporter proteins (http:// plantst.genomics.purdue.edu/) in the Populus genome revealed a general expansion relative to Arabidopsis (1722 versus 959, in Populus versus Arabidopsis, respectively) (table S12). Five gene RESEARCH ARTICLES gene families have been implicated in the negative regulation of cytokinin signaling (67, 77), which is consistent with the idea of increased complexity in regulation of cytokinin signal transduction in Populus. Populus and Arabidopsis genomes contain almost identical numbers of genes for the three enzymes of ethylene biosynthesis, whereas the number of genes for proteins involved in ethylene perception and signaling is higher in Populus. For example, Populus has seven predicted genes for ethylene receptor proteins and Arabidopsis has five; the constitutive triple response kinase that acts just downstream of the receptor is encoded by four genes in Populus and only one in Arabidopsis (78). The number of ethylene-responsive element binding factor (ERF) proteins (a subfamily of the AP2/ERF family) is higher in Populus than in Arabidopsis (172 versus 122, respectively). The increased variation in the number of ERF transcription factors may be involved in the ethylene-dependent processes specific to trees, such as tension wood formation (68) and the establishment of dormancy (71). Conclusions Our initial analyses provide a flavor of the opportunities for comparative plant genomics made possible by the generation of the Populus genome sequence. A complex history of whole-genome duplications, chromosomal rearrangements and tandem duplications has shaped the genome that we observe today. The differences in gene content between Populus and Arabidopsis have provided some tantalizing insights into the possible molecular bases of their strongly contrasting life histories, although factors unrelated to gene content (such as regulatory elements, miRNAs, posttranslational modification, or epigenetic modifications) may ultimately be of equal or greater importance. With the sequence of Populus, researchers can now go beyond what could be learned from Arabidopsis alone and explore hypotheses to linking genome sequence features to wood development, nutrient and water movement, crown development, and disease resistance in perennial plants. The availability of the Populus genome sequence will enable continuing comparative genomics studies among species that will shed new light on genome reorganization and gene family evolution. Furthermore, the genetics and population biology of Populus make it an immense source of allelic variation. Because Populus is an obligate outcrossing species, recessive alleles tend to be maintained in a heterozygous state. Informatics tools enabled by the sequence, assembly, and annotation of the Populus genome will facilitate the characterization of allelic variation in wild Populus populations adapted to a wide range of environmental conditions and gradients over large portions of the Northern Hemisphere. Such variants represent a rich reservoir of molecular resources useful in biotechnological applications, www.sciencemag.org SCIENCE VOL 313 development of alternative energy sources, and mitigation of anthropogenic environmental problems. Finally, the keystone role of Populus in many ecosystems provides the first opportunity for the application of genomics approaches to questions with ecosystem-scale implications. References and Notes 1. Food and Agricultural Organization of the United Nations, State of the World’s Forests 2003 (FAO, Rome, 2003). 2. R. F. Stettler, H. D. Bradshaw Jr., in Biology of Populus and Its Implications for Management and Conservation, R. F. Stettler, H. D. Bradshaw Jr., P. E. Heilman, T. M. Hinckley, Eds. (NRC Research Press, Ottawa, 1996), pp. 1–7. 3. G. A. Tuskan, S. P. DiFazio, T. Teichmann, Plant Biol. 6, 2 (2004). 4. T. M. Yin, S. P. DiFazio, L. E. Gunter, D. Riemenschneider, G. A. Tuskan, Theor. Appl. Genet. 109, 451 (2004). 5. R. Meilan, C. Ma, in Agrobacterium Protocols, vol. 344 of Methods in Molecular Biology, K. Wang, Ed. (Humana Press, Totowa, NJ, 2006), pp. 143–151. 6. G. A. Tuskan, Biomass Bioenerg. 14, 307 (1998). 7. G. A. Tuskan, M. Walsh, For. Chron. 77, 259 (2001). 8. S. Wullschleger et al., Can. J. For. Res. 35, 1779 (2005). 9. Materials and methods are available as supporting material on Science Online. 10. H. D. Bradshaw, R. F. Stettler, Theor. Appl. Genet. 86, 301 (1993). 11. M. Koornneef, P. Fransz, H. de Jong, Chromosome Res. 11, 183 (2003). 12. O. Santamaria, J. J. Diez, For. Pathol. 35, 95 (2005). 13. G. A. Tuskan et al., Can. J. For. Res. 34, 85 (2004). 14. A. A. Salamov, V. V. Solovyev, Genome Res. 10, 516 (2000). 15. E. Birney, R. Durbin, Genome Res. 10, 547 (2000). 16. T. Schiex, A. Moisan, P. Rouzé, in Computational Biology: Selected Papers from JOBIM’2000, number 2066 in LNCS (Springer-Verlag, Heidelberg, Germany, 2001), pp. 118–133. 17. Y. Xu, E. C. Uberbacher, J. Comput. Biol. 4, 325 (1997). 18. S. J. Hanley, M. D. Mallott, A. Karp, Tree Genet. Genomes, in press. 19. M. A. Koch, B. Haubold, T. Mitchell-Olds, Mol. Biol. Evol. 17, 1483 (2000). 20. M. Lynch, J. S. Conery, Science 290, 1151 (2000). 21. L. Sterck et al., New Phytol. 167, 165 (2005). 22. L. A. Dode, Bull. Soc. Hist. Nat. Autun 18, 161 (1905). 23. R. Regnier, in Revue des Sociétés Savantes de Normandie (Rouen, France, 1956), vol. 1, pp. 1–36. 24. M. E. Collinson, Proc. R. Soc. Edinburgh B Bio. Sci. 98, 155 (1992). 25. J. E. Eckenwalder, in Biology of Populus and Its Implications for Management and Conservation, R. F. Stettler, H. D. Bradshaw Jr., P. E. Heilman, T. M. Hinckley, Eds. (NRC Research Press, Ottawa, 1996), chap. 1. 26. J. B. Mitton, M. C. Grant, Bioscience 46, 25 (1996). 27. K. Hokamp, A. McLysaght, K. H. Wolfe, J. Struct. Funct. Genomics 3, 95 (2003). 28. J. E. Bowers, B. A. Chapman, J. K. Rong, A. H. Paterson, Nature 422, 433 (2003). 29. L. M. Zahn, J. Leebens-Mack, C. W. dePamphilis, H. Ma, G. Theissen, J. Hered. 96, 225 (2005). 30. S. De Bodt, S. Maere, Y. Van de Peer, Trends Ecol. Evol. 20, 591 (2005). 31. K. L. Adams, J. F. Wendel, Trends Genet. 21, 539 (2005). 32. G. Blanc, K. Hokamp, K. H. Wolfe, Genome Res. 13, 137 (2003). 33. B. A. Schulman et al., Nature 408, 381 (2000). 34. S. Griffiths-Jones et al., Nucleic Acids Res. 33, D121 (2005). 35. S. Lockton, B. S. Gaut, Trends Genet. 21, 60 (2005). 36. B. A. Chapman, J. E. Bowers, F. A. Feltus, A. H. Paterson, Proc. Natl. Acad. Sci. U.S.A. 103, 2730 (2006). 37. T. A. Richmond, C. R. Somerville, Plant Physiol. 124, 495 (2000). 38. S. Djerbi, M. Lindskog, L. Arvestad, F. Sterky, T. T. Teeri, Planta 221, 739 (2005). 15 SEPTEMBER 2006 Downloaded from www.sciencemag.org on February 28, 2010 hormonal regulation underlying plant development. Auxin, gibberellin, cytokinin, and ethylene responses are of particular interest in tree biology. Many auxin responses (66–71) are controlled by auxin response factor (ARF) transcription factors, which work together with cognate AUX/IAA repressor proteins to regulate auxinresponsive target genes (72, 73). A phylogenetic analysis using the known and predicted ARF protein sequences showed that Populus and Arabidopsis ARF gene families have expanded independently since they diverged from their common ancestor. Six duplicate ARF genes in Populus encode paralogs of ARF genes that are single-copy Arabidopsis genes, including ARF5 (MONOPTEROS), an important gene required for auxin-mediated signal transduction and xylem development. Furthermore, five Arabidopsis ARF genes have four or more predicted Populus ARF gene paralogs. In contrast to ARF genes, Populus does not contain a notably expanded repertoire of AUX/IAA genes relative to Arabidopsis (35 versus 29, respectively) (74). Interestingly, there is a group of four Arabidopsis AUX/IAA genes with no apparent Populus orthologs, suggesting Arabidopsisspecific functions. Gibberellins (GAs) are thought to regulate multiple processes during wood and root development, including xylem fiber length (75). Among all gibberellin biosynthesis and signaling genes, the Populus GA20-oxidase gene family is the only family with approximately two times the number of genes relative to Arabidopsis, indicating that most of the duplicated genes that arose from the salicoid duplication event have been lost. GA20-oxidase appears to control flux in the biosynthetic pathway leading to the bioactive gibberellins GA1 and GA4. The higher complement of GA20-oxidase genes may have biological importance in Populus with respect to secondary xylem and fiber cell development. Cytokinins are thought to control the identity and proliferation of cell types relevant for wood formation as well as general cell division (67). The total number of members in gene families encoding cytokinin homeostasis related isopentenyl transferases (IPT) and cytokinin oxidases is roughly similar between Populus and Arabidopsis, although there appears to be lineage-specific expansion of IPT subfamilies. The cytokinin signal transduction pathway represents a two-component phosphorelay system, in which a two-component hybrid receptor initiates a phosphotransfer by means of histidinecontaining phosphotransmitters (HPt) to phosphoaccepting response regulators (RR). One family of genes, encoding the two-component receptors (such as CKI1), is notably expanded in Populus (four versus one in Populus and Arabidopsis, respectively) (76). Gene families coding for recently identified pseudo-HPt and atypical RR are overrepresented in Populus relative to Arabidopsis (2.5- and 4.0-fold increase in Populus, respectively). Both of these 1603 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. C. P. Joshi et al., New Phytol. 164, 53 (2004). A. Samuga, C. P. Joshi, Gene 334, 73 (2004). F. Roudier et al., Plant Cell 17, 1749 (2005). R. M. Perrin et al., Science 284, 1976 (1999). R. W. Whetten, J. J. Mackay, R. R. Sederoff, Annu. Rev. Plant Physiol. Plant Mol. Biol. 49, 585 (1998). J. Ehlting et al., Plant J. 42, 618 (2005). J. Raes, A. Rohde, J. H. Christensen, Y. Van de Peer, W. Boerjan, Plant Physiol. 133, 1051 (2003). D. M. O’Malley, S. Porter, R. R. Sederoff, Plant Physiol. 98, 1364 (1992). J. J. Mackay, W. W. Liu, R. Whetten, R. R. Sederoff, D. M. Omalley, Mol. Gen. Genet. 247, 537 (1995). J. Schrader et al., Plant Cell 16, 2278 (2004). S. Whitham, S. McCormick, B. Baker, Proc. Natl. Acad. Sci. U.S.A. 93, 8776 (1996). G. M. Gebre, T. J. Tschaplinski, G. A. Tuskan, D. E. Todd, Tree Physiol. 18, 645 (1998). G. Arimura, D. P. W. Huber, J. Bohlmann, Plant J. 37, 603 (2004). D. J. Peters, C. P. Constabel, Plant J. 32, 701 (2002). C.-J. Tsai, S. A. Harding, T. J. Tschaplinski, R. L. Lindroth, Y. Yuan, New Phytol. 172, 47 (2006). M. M. De Sá, R. Subramaniam, F. E. Williams, C. J. Douglas, Plant Physiol. 98, 728 (1992). R. L. Lindroth, S. Y. Hwang, Biochem. Syst. Ecol. 24, 357 (1996). B. Winkel-Shirley, Curr. Opin. Plant Biol. 5, 218 (2002). G. J. Tanner et al., J. Biol. Chem. 278, 31647 (2003). S. Aubourg, A. Lecharny, J. Bohlmann, Mol. Genet. Genomics 267, 730 (2002). M. A. Costa et al., Phytochemistry 64, 1097 (2003). 60. D. Cukovic, J. Ehlting, J. A. VanZiffle, C. J. Douglas, Biol. Chem. 382, 645 (2001). 61. J. M. Shockey, M. S. Fulda, J. Browse, Plant Physiol. 132, 1065 (2003). 62. I. E. Somssich, P. Wernert, S. Kiedrowski, K. Hahlbrock, Proc. Natl. Acad. Sci. U.S.A. 93, 14199 (1996). 63. L. Li et al., Plant Cell 13, 1567 (2001). 64. B. C. Meyers, S. Kaushik, R. S. Nandety, Curr. Opin. Plant Biol. 8, 129 (2005). 65. L. Deslandes et al., Proc. Natl. Acad. Sci. U.S.A. 99, 2404 (2002). 66. E. J. Mellerowicz, M. Baucher, B. Sundberg, W. Boerjan, Plant Mol. Biol. 47, 239 (2001). 67. A. P. Mähönen et al., Science 311, 94 (2006). 68. S. Andersson-Gunneras et al., Plant J. 34, 339 (2003). 69. J. M. Hellgren, K. Olofsson, B. Sundberg, Plant Physiol. 135, 212 (2004). 70. M. G. Cline, K. Dong-Il, Ann. Bot. (London) 90, 417 (2002). 71. R. Ruonala, P. Rinne, M. Baghour, H. Tuominen, J. Kangasjärvi, Plant J., in press. 72. R. Moyle et al., Plant J. 31, 675 (2002). 73. D. Weijers et al., EMBO J. 24, 1874 (2005). 74. G. Hagen, T. Guilfoyle, Plant Mol. Biol. 49, 373 (2002). 75. M. E. Eriksson, M. Israelsson, O. Olsson, T. Moritz, Nat. Biotechnol. 18, 784 (2000). 76. T. Kakimoto, Science 274, 982 (1996). 77. T. Kiba, K. Aoki, H. Sakakibara, T. Mizuno, Plant Cell Physiol. 45, 1063 (2004). 78. T. Nakano, K. Suzuki, T. Fujimura, H. Shinshi, Plant Physiol. 140, 411 (2006). 79. F. Sterky et al., Proc. Natl. Acad. Sci. U.S.A. 101, 13951 (2004). 80. J. Ehlting et al., Plant J. 42, 618 (2005). Opposing Activities Protect Against Age-Onset Proteotoxicity Ehud Cohen,1* Jan Bieschke,2* Rhonda M. Perciavalle,1 Jeffery W. Kelly,2 Andrew Dillin1† Aberrant protein aggregation is a common feature of late-onset neurodegenerative diseases, including Alzheimer’s disease, which is associated with the misassembly of the Ab1-42 peptide. Aggregation-mediated Ab1-42 toxicity was reduced in Caenorhabiditis elegans when aging was slowed by decreased insulin/ insulin growth factor–1–like signaling (IIS). The downstream transcription factors, heat shock factor 1, and DAF-16 regulate opposing disaggregation and aggregation activities to promote cellular survival in response to constitutive toxic protein aggregation. Because the IIS pathway is central to the regulation of longevity and youthfulness in worms, flies, and mammals, these results suggest a mechanistic link between the aging process and aggregation-mediated proteotoxicity. ate-onset human neurodegenerative diseases including Alzheimer_s (AD), Huntington_s, and Parkinson_s diseases are genetically and pathologically linked to aberrant protein aggregation (1, 2). In AD, formation of aggregation-prone peptides, particularly Ab1-42, by endoproteolysis of the amyloid precursor protein (APP) is associated with the disease through an unknown mechanism (3, 4). Whether intracellular accumulation or extracellular deposition of Ab1-42 initiates the pathological process is a key unanswered question (5). Typically, individuals who carry AD-linked mutations present with clinical symptoms during their fifth or sixth decade, whereas sporadic cases appear after the seventh decade. Why aggregation-mediated toxicity emerges late in life and whether it is mechanistically linked to the aging process remain unclear. Perhaps the most prominent pathway that regulates life span and youthfulness in worms, L 1604 flies, and mammals is the insulin/insulin growth factor (IGF)–1–like signaling (IIS) pathway (6). In the nematode Caenorhabditis elegans, the sole insulin/IGF-1 receptor, DAF-2 (7), initiates the transduction of a signal that causes the phosphorylation of the FOXO transcription factor, DAF-16 (8, 9), preventing its translocation to the nucleus (10). This negative regulation of DAF-16 compromises expression of its target genes, decreases stress resistance, and shortens the worm_s life span. Thus, inhibition of daf-2 expression creates long-lived, youthful, stressresistant worms (11). Similarly, suppression of the mouse DAF-2 ortholog, IGF1-R, creates longlived mice (12). Recent studies indicate that, in worms, life-span extension due to reduced daf-2 activity is also dependent upon heat shock factor 1 (HSF-1). Moreover, increased expression of hsf-1 extends worm life span in a daf-16–dependent manner (13). That the DAF-16 and HSF-1 tran- 15 SEPTEMBER 2006 VOL 313 SCIENCE 81. B. C. Meyers, S. Kaushik, R. S. Nandety, Curr. Opin. Plant Biol. 8, 129 (2005). 82. M. W. Jones-Rhoades, D. P. Bartel, Mol. Cell 14, 787 (2004). 83. We thank the U.S. Department of Energy, Office of Science for supporting the sequencing and assembly portion of this study; Genome Canada and the Province of British Columbia for providing support for the BAC end, BAC genotyping, and full-length cDNA portions of this study; the Umeå University and the Royal Technological Institute (KTH) in Stockholm for supporting the EST assembly and annotation portion of this study; the membership of the International Populus Genome Consortium for supplying genetic and genomics resources used in the assembly and annotation of the genome; the NSF Plant Genome Program for supporting the development of Web-based tools; T. H. D. Bradshaw and R. Stettler for input and reviews on draft copies of the manuscript; J. M. Tuskan for guidance and input during the analysis and writing of the manuscript; and the anonymous reviewers who provided critical input and recommendations on the manuscript. GenBank Accession Number: AARH00000000. Supporting Online Material www.sciencemag.org/cgi/content/full/313/5793/1596/DC1 Materials and Methods Figs. S1 to S15 Tables S1 to S14 References 13 April 2006; accepted 9 August 2006 10.1126/science.1128691 scriptomes result in the expression of numerous chaperones (13, 14) suggests that the integrity of protein folding could play a key role in life-span determination and the amelioration of aggregation-associated proteotoxicity. Indeed, amelioration of Huntington-associated proteotoxicity by slowing the aging process in worms has been reported (13, 15, 16). Reduced IIS activity lowers Ab1-42 toxicity. One hypothesis to explain late-onset aggregationassociated toxicity posits that the deposition of toxic aggregates is a stochastic process, governed by a nucleated polymerization and requiring many years to initiate disease. Alternatively, aging could enable constitutive aggregation to become toxic as a result of declining detoxification activities. To distinguish between these two possibilities, we asked what role the aging process plays in Ab1-42 aggregation-mediated toxicity in a C. elegans model featuring intracellular Ab1-42 expression (17). If Ab1-42 toxicity results from a non-age-related nucleated polymerization, animals that express Ab1-42 and whose life span has been extended would be expected to succumb to Ab1-42 toxicity at the same rate as those with a natural life span. However, if the aging process plays a role in detoxifying an ongoing protein aggregation process, alteration of the aging program 1 Molecular and Cell Biology Laboratory, Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, CA 92037, USA. 2Department of Chemistry and Skaggs Institute of Chemical Biology, Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, USA. *These authors contributed equally to this work. †To whom correspondence should be addressed E-mail: [email protected] www.sciencemag.org Downloaded from www.sciencemag.org on February 28, 2010 RESEARCH ARTICLES Vol 449 | 27 September 2007 | doi:10.1038/nature06148 LETTERS The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla The French–Italian Public Consortium for Grapevine Genome Characterization* The analysis of the first plant genomes provided unexpected evidence for genome duplication events in species that had previously been considered as true diploids on the basis of their genetics1–3. These polyploidization events may have had important consequences in plant evolution, in particular for species radiation and adaptation and for the modulation of functional capacities4–10. Here we report a high-quality draft of the genome sequence of grapevine (Vitis vinifera) obtained from a highly homozygous genotype. The draft sequence of the grapevine genome is the fourth one produced so far for flowering plants, the second for a woody species and the first for a fruit crop (cultivated for both fruit and beverage). Grapevine was selected because of its important place in the cultural heritage of humanity beginning during the Neolithic period11. Several large expansions of gene families with roles in aromatic features are observed. The grapevine genome has not undergone recent genome duplication, thus enabling the discovery of ancestral traits and features of the genetic organization of flowering plants. This analysis reveals the contribution of three ancestral genomes to the grapevine haploid content. This ancestral arrangement is common to many dicotyledonous plants but is absent from the genome of rice, which is a monocotyledon. Furthermore, we explain the chronology of previously described whole-genome duplication events in the evolution of flowering plants. All grapevine varieties are highly heterozygous; preliminary data showed that there was as much as 13% sequence divergence between alleles, which would hinder reliable contig assembly when a wholegenome shotgun strategy was used for sequencing. Our consortium therefore selected the grapevine PN40024 genotype for sequencing. This line, originally derived from Pinot Noir, has been bred close to full homozygosity (estimated at about 93%) by successive selfings, permitting a high-quality whole-genome shotgun assembly. A total of 6.2 million end-reads were produced by our consortium, representing an 8.4-fold coverage of the genome. Within the assembly, performed with Arachne12, 316 supercontigs represent putative allelic haplotypes that constitute 11.6 million bases (Mb). These values are in good fit with the 7% residual heterozygosity of PN40024 assessed by using genetic markers. When considering only one of the haplotypes in each heterozygous region, the assembly (Table 1a) consists of 19,577 contigs (N50 5 65.9 kilobases (kb), where N50 corresponds to the size of the shorter supercontig or contig in a subset representing half of the assembly size) and 3,514 supercontigs (N50 5 2.07 Mb) totalling 487 Mb. This value is close to the 475 Mb previously reported for the grapevine genome size13. Using a set of 409 molecular markers from the reference grapevine map14, 69% of the assembled 487 Mb, arranged into 45 ultracontigs Table 1 | Global statistics on the genome of Vitis vinifera (a) Assembly Contigs Supercontigs Status Number N50 (kb) Longest (kb) Size (Mb) Percentage of the assembly All All Anchored on chromosomes Anchored on chromosomes and oriented 19,577 3,514 191 143 65.9 2,065 3,189 3,827 557 12,675 12,675 12,675 467.5 487.1 335.6 296.9 – 100 68.9 60.9 Number Median size (bp) Total length (Mb) Percentage of the genome %GC 30,434 149,351 118,917 30,453 600 164 3,399 130 213 3,544 73 103.5 225.6 33.6 178.6 261.5 0.04 0.002 46.3 6.9 36.7 34.7 NS NS 36.2 44.5 34.7 33.0 43.0 35.9 Number of orthologous proteins Mean identity (%) 12,996 11,404 9,731 10,547 8,121 72.7 65.5 59.8 (b) Annotation Gene Exons CDS Introns CDS Intergenic tRNA* miRNA{ (c) Orthology P. trichocarpa A. thaliana O. sativa Common to eudicotyledons{ Common to Magnoliophyta1 * Transfer RNA (tRNA) values were computed on exons. { Micro RNAs (miRNAs) are members of known conserved miRNA families. { Eudicotyledons are represented by P. trichocarpa and A. thaliana. 1 Magnoliophyta (most flowering plants) are represented by P. trichocarpa, A. thaliana and O. sativa. *A list of participants and their affiliations appears at the end of the paper. 463 ©2007 Nature Publishing Group LETTERS NATURE | Vol 449 | 27 September 2007 and 51 single supercontigs, were anchored along the 19 linkage groups. Thirty-seven ultracontigs and 22 single supercontigs were oriented, representing 61% of the genome assembly (Supplementary Tables 2 and 3). This assembly has been annotated by using a combination of evidence. The major features of the genome annotation are presented in Table 1b. The 8.4-fold draft sequence of the grapevine genome contains a set of 30,434 protein-coding genes (an average of 372 codons and 5 exons per gene). This value is considerably lower than the 45,555 protein-coding genes reported for the poplar (Populus trichocarpa) genome, which has a similar size, at 485 Mb (ref. 1), and even lower than the 37,544 protein-coding genes identified in the 389 Mb of the rice genome2. Three different approaches revealed that 41.4% (average value) of the grapevine genome is composed of repetitive/transposable elements (TEs), a slightly higher proportion than that identified in the rice genome, which has a somewhat smaller size2. The distribution of repeats and TEs along the chromosomes is quite uneven (see below). All classes and superfamilies of TEs are represented in the grapevine genome, with a large prevalence of class I elements over class II and helitrons (rolling-circle transposons) (Supplementary Table 7). An analysis of the distribution of the repetitive elements in the different fractions of the grapevine genome based on the current annotation shows that introns are quite rich in repeats and TEs (data not shown). In addition, 12.4% of the intron sequence contains transposons as determined using our set of manually annotated elements, most of which (75%) correspond to LINE (long interspersed element) retrotransposons, which therefore seem to have contributed specifically to the intron size observed in grapevine (Supplementary Table 8). In eukaryotes with large genomes, the coding and repeated elements are distributed over the chromosomes and may be more or less interlaced, hence defining gene-poor and gene-rich regions. It has previously been noticed that the distribution of the genes along the chromosomes of rice and Arabidopsis thaliana is fairly homogeneous2,3. In contrast, we observe large regions that alternate between high and low gene density in V. vinifera (Supplementary Figs 2 and 3). As expected, the density of TEs reflects a pattern substantially complementary to gene density. We observe a similar characteristic in the genome sequence of poplar, therefore indicating a dynamic for the invasion of TEs that is shared with the grapevine (Supplementary Fig. 3). A striking feature of the grapevine proteome lies in the existence of large families related to wine characteristics, which have a higher gene copy number than in the other sequenced plants. Stilbene synthases (STSs) drive the synthesis of resveratrol, the grapevine phytoalexin that has been associated with the health benefits associated with moderate consumption of red wine15,16. The family of genes encoding STSs has a noticeable expansion: 43 genes have been identified. Of these, 20 have previously been shown to be expressed after infection by Plasmopara viticola, thus confirming that they are likely to be functional. The terpene synthases (TPSs) drive the synthesis of terpenoids; these secondary metabolites are major components of resins, essential oils and aromas (their relative abundance is directly correlated with the aromatic features of wines17) and are involved in plant–environment interactions. In comparison with the 30–40 genes of this family in Arabidopsis, rice and poplar, the grapevine TPS family is more than twice as large, with 89 functional genes and 27 pseudogenes. Classification based on known plant homologues reveals that the subclass of putative monoterpene synthases represents only 15% of the Arabidopsis TPS family18 whereas this subclass represents 40% of the grapevine TPS family. This result suggests a high diversification of grapevine monoterpene synthases that specifically produce C10 terpenoids present in aroma (such as geraniol, linalool, cineole and a-terpineol). Furthermore, the grapevine genome annotation has also revealed genes encoding homologues to the two forms of geranyl diphosphate synthases (GPPSs), the enzymes that produce the substrate for monoterpene synthases: both the homodimeric GPPS and the heterodimeric form are present; the latter is present only in plants such as Mentha piperita and Clarkia breweri, which produce large quantities of monoterpenes19. Most of the STS and TPS genes occur as 20 clusters, including up to 33 paralogous genes located in a 680-kb stretch. Because global duplication events seem to be a frequent event in plant evolution20, we searched the genome of V. vinifera for paralogous regions by using protein sequence similarity. Paralogous regions are defined as chromosome fragments in which homologous genes are present in clusters. Statistical analysis21 of these clusters reveals that 94.5% have high probability of being paralogous (P , 1024; Supplementary Table 11). Most Vitis gene regions have two different paralogous regions, which we have grouped together as triplets (Supplementary Fig. 5; coverage details in Supplementary Table 10). We conclude that the present-day grapevine haploid genome originated from the contribution of three ancestral genomes. It is yet to be demonstrated whether this content came from a true hexaploidization event or through successive genome duplications. The resulting plant had a diploid content that corresponds to the three full diploid contents of the three ancestors; it may therefore be described as a ‘palaeo-hexaploid’ organism. A number of rearrangements have affected the original three complements after the formation of the palaeo-hexaploid state. However, the gene order has been sufficiently conserved to permit the alignment of most regions with their two siblings. We explored the time of formation of the palaeo-hexaploid arrangement by comparing grapevine gene regions with those of other completely sequenced plant genomes. If the palaeo-hexaploid complement is present in another species, it should result in a onefor-one pairing of gene regions between the two species considered. In contrast, if another species’s genome evolved before palaeohexaploid formation, it should result in a one-to-three relationship between the other species and the grapevine genome. The available genome sequences were those of poplar1, Arabidopsis3 and rice (Oryza sativa2), of which poplar is considered to be most closely related to grapevine. All clusters constructed between the orthologues in the three comparisons have P , 1024 (Table 1c). When the gene order in poplar is compared with that in grapevine, there are two clear distributions. First, the grapevine regions align with two poplar segments, as would be expected from a recent whole-genome duplication (WGD) in the poplar lineage1. Second, each of the three grapevine regions that form a homologous triplet recognizes different pairs of poplar segments (Fig. 1a and Supplementary Fig. 6). This shows that the palaeo-hexaploidy observed in grapevine was already present in its common ancestor with poplar. Poplar belongs to the Eurosid I clade. The sister clade to Eurosid I is that of Eurosid II, which contains the model species Arabidopsis. Its gene order was compared with that in the grapevine genome. Two distributions appear: first, most grapevine regions correspond to four Arabidopsis segments (Supplementary Fig. 7); second, each component of a triplicated group in grapevine recognizes four different regions in Arabidopsis (Fig. 1b). This shows that the grapevine palaeo-hexaploidy was present in the common ancestor to Arabidopsis and grapevine, and therefore that it is a trait common to all Eurosids. This is confirmed by the homology level distribution between paralogues of the grapevine, indicating a lower conservation than between Vitis/Arabidopsis orthologues (Supplementary Fig. 4). The Eurosid group contains many economically important flowering plants such as legumes, cotton and Brassicaceae. Our present results establish these species as having a palaeo-hexaploid common ancestor. The grapevine/Arabidopsis comparison also reveals that the Arabidopsis lineage underwent two WGDs after its separation from the Eurosid I clade21–24. This contradicts some models based on more indirect evidence that placed the most ancient of these two duplications at the base of the Eurosid group, or even earlier4,20–22. Some studies had also suggested a possible third duplication event in the distant past of the Arabidopsis lineage, potentially at the base of 464 ©2007 Nature Publishing Group LETTERS NATURE | Vol 449 | 27 September 2007 the angiosperm radiation. The controversy about this third event is now resolved by the Vitis genome comparisons: this event corresponds to the palaeo-hexaploidy formation that remains evident in the grapevine genome but has been difficult to characterize in Arabidopsis and poplar because of the more recent WGDs. In particular, the Arabidopsis genome lineage has undergone many rearrangements and chromosome fusions such that the ancestral gene order is particularly difficult to deduce from this species (Fig. 2). Grapevines, like Arabidopsis and poplar, are dicotyledonous plants that diverged from monocotyledons about 130–240 Myr ago25,26. a Because rice is a monocotyledon, we assessed the presence or absence of palaeo-hexaploidy in its genome sequence. The observed pattern is the opposite of that seen for Arabidopsis and poplar: constituents of a grapevine triplet are generally orthologous to the same group of rice regions (Fig. 1c and Supplementary Fig. 11). Because rice and grapevine are phylogenetically distant, it is more difficult to detect relations of orthology across the two whole genomes: rearrangements, duplication and gene loss have affected the gene orders differently in the two lineages (Supplementary Fig. 10). Even with this limitation, we observed numerous cases of one-to-three relationships between b c Figure 1 | Comparison between three paralogous Vitis genomic regions and their orthologues in P. trichocarpa, A. thaliana and O. sativa. Orthologous gene pairs are joined with a different colour for each of the three paralogous grapevine chromosomes 6 (green), 8 (blue) and 13 (red). a, Orthologous regions in the poplar genome are different for each of the three Vitis chromosomes, showing that the triplication predates the poplar/Vitis separation. One Vitis region recognizes two poplar segments because of a WGD in the poplar lineage after the separation. b, Orthologous regions with Arabidopsis are different for each of the three Vitis chromosomes. This shows that the Arabidopsis/Vitis ancestor had the same palaeo-hexaploid content. One Vitis region corresponds to four Arabidopsis segments, indicating the presence of two WGDs in the Arabidopsis lineage after separation from the Vitis lineage. c, Orthologous regions in rice are the same for the three paralogous chromosomes. This indicates that the triplication was not present in the common ancestor of monocotyledons and dicotyledons. The presence in rice of different homologous blocks is due to global duplications in the rice lineage after divergence from dicotyledons. 465 ©2007 Nature Publishing Group LETTERS NATURE | Vol 449 | 27 September 2007 a b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 c 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 V. vinifera P. trichocarpa 1 2 3 4 5 A. thaliana Figure 2 | Schematic representation of paralogous regions derived from the three ancestral genomes in the karyotypes of V. vinifera, P. trichocarpa and A. thaliana. Each colour corresponds to a syntenic region between the three ancestral genomes that were defined by their occurrence as linked clusters in grapevine, independently of intrachromosomal rearrangements. The V. vinifera genome (a) is by far the closest to the ancestral arrangement, whereas that of Arabidopsis (c) is thoroughly rearranged, and P. trichocarpa (b) presents an intermediate situation. The seven colours probably correspond to linkage groups at the time of the palaeo-hexaploid ancestor. rice and grapevine (Supplementary Figs 8, 9 and 11); 23% of orthologous blocks include the paralogous regions that originate from the grapevine palaeo-hexaploidy. For Arabidopsis, this number is as low as 1.4% (this difference is significant at 5%: x2 5 8.9; Supplementary Table 12), despite the fact that the Arabidopsis genome has suffered many gene losses since its two WGDs. These gene losses would be expected to obscure the orthologous relations with the grapevine genome, but they are clearly insufficient to explain the high number of one-to-three relationships observed in the rice–grapevine comparison. The most probable explanation for this excess is that the rice ancestor did not exhibit the palaeo-hexaploidy observed in the grapevine, poplar and Arabidopsis. These findings are summarized in Fig. 3: the triplicated arrangement is apparent after the separation of the monocotyledons and dicotyledons and before the spread of the Eurosid clade. Future genome sequencing projects for other clades of dicotyledons, such as Solanaceae or basal eudicots, will help in situating the triplication event more precisely, and eventually in establishing its precise nature (hexaploidization or genome duplications at distant times). Public access to the grapevine genome sequence will help in the identification of genes underlying the agricultural characteristics of this species, including domestication traits. A selective amplification of genes belonging to the metabolic pathways of terpenes and tannins has occurred in the grapevine genome, in contrast with other plant genomes. This suggests that it may become possible to trace the diversity of wine flavours down to the genome level. Grapevine is also a crop that is highly susceptible to a large diversity of pathogens including powdery mildew, oidium and Pierce disease. Other Vitis species such as V. riparia or V. cinerea, which are known to be resistant to several of these pathogens, are interfertile with V. vinifera and can be used for the introduction of resistance traits by advanced backcrosses27 or by gene transfer. Access to the Vitis sequence and the exploitation of synteny will speed up this process of introgression of pathogen resistance traits. As a consequence of this, it is hoped that it will also prompt a strong decrease in pesticide use. The high quality of the assembly, due mainly to the highly homozygous nature of the PN40024 line, enables the discovery of three ancestral genomes constituting the diploid content of grapevine. The Greek historian Thucydides wrote that Mediterranean people began to emerge from ignorance when they learnt to cultivate olives and grapes. This first characterization of the grapevine genome, with its indication of a palaeo-hexaploid ancestral genome for many dicotyledonous plants, addresses fundamental questions related to the origin and importance of this event in the history of flowering plants. Future work may help in correlating the differential fates of the three gene complements with phenotypic traits of dicotyledonous species. Monocotyledons Dicotyledons Eurosids I O. sativa Eurosids II P. trichocarpa V. vinifera A. thaliana METHODS SUMMARY Gene annotation. Protein-coding genes were predicted by combining ab initio models, V. vinifera complementary DNA alignments, and alignments of proteins and genomic DNA from other species. The integration of the data was performed with GAZE28. Details are given in Supplementary Information. Paralogous and orthologous gene sets. Statistical testing of homologous regions was performed as described in ref. 21. ? Formation of the palaeo-hexaploid genome Full Methods and any associated references are available in the online version of the paper at www.nature.com/nature. Received 5 April; accepted 7 August 2007. Published online 26 August 2007. 1. Flowering plants 2. Figure 3 | Positions of the polyploidization events in the evolution of plants with a sequenced genome. Each star indicates a WGD (tetraploidization) event on that branch. The question mark indicates that ancient events are visible in the rice genome that would require other monocotyledon genome sequences to be resolved. The formation of the palaeo-hexaploid ancestral genome occurred after divergence from monocotyledons and before the radiation of the Eurosids. 3. 4. 5. Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 1596–1604 (2006). International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature 436, 793–800 (2005). Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000). De Bodt, S., Maere, S. & Van de Peer, Y. Genome duplication and the origin of angiosperms. Trends Ecol. Evol. 20, 591–597 (2005). Scannell, D. R., Byrne, K. P., Gordon, J. L., Wong, S. & Wolfe, K. H. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature 440, 341–345 (2006). 466 ©2007 Nature Publishing Group LETTERS NATURE | Vol 449 | 27 September 2007 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. Jaillon, O. et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431, 946–957 (2004). Aury, J. M. et al. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444, 171–178 (2006). Maere, S. et al. Modeling gene and genome duplications in eukaryotes. Proc. Natl Acad. Sci. USA 102, 5454–5459 (2005). Blanc, G. & Wolfe, K. H. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16, 1679–1691 (2004). Seoighe, C. & Gehring, C. Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends Genet. 20, 461–464 (2004). McGovern, P. E., Hartung, U., Badler, V., Glusker, D. L. & Exner, L. J. The beginnings of wine making and viniculture in the anciant Near East and Egypt. Expedition 39, 3–21 (1997). Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003). Lodhi, M. A., Daly, M. J., Ye, G. N., Weeden, N. F. & Reisch, B. I. A molecular marker based linkage map of Vitis. Genome 38, 786–794 (1995). Doligez, A. et al. An integrated SSR map of grapevine based on five mapping populations. Theor. Appl. Genet. 113, 369–382 (2006). Baur, J. A. et al. Resveratrol improves health and survival of mice on a high-calorie diet. Nature 444, 337–342 (2006). Baur, J. A. & Sinclair, D. A. Therapeutic potential of resveratrol: the in vivo evidence. Nature Rev. Drug Discov. 5, 493–506 (2006). Mateo, J. J. & Jimenez, M. Monoterpenes in grape juice and wines. J. Chromatogr. A 881, 557–567 (2000). Aubourg, S., Lecharny, A. & Bohlmann, J. Genomic analysis of the terpenoid synthase (AtTPS) gene family of Arabidopsis thaliana. Mol. Genet. Genomics 267, 730–745 (2002). Tholl, D. et al. Formation of monoterpenes in Antirrhinum majus and Clarkia breweri flowers involves heterodimeric geranyl diphosphate synthases. Plant Cell 16, 977–992 (2004). Adams, K. L. & Wendel, J. F. Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 8, 135–141 (2005). Simillion, C., Vandepoele, K., Van Montagu, M. C., Zabeau, M. & Van de Peer, Y. The hidden duplication past of Arabidopsis thaliana. Proc. Natl Acad. Sci. USA 99, 13627–13632 (2002). Bowers, J. E., Chapman, B. A., Rong, J. & Paterson, A. H. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422, 433–438 (2003). Vision, T. J., Brown, D. G. & Tanksley, S. D. The origins of genomic duplications in Arabidopsis. Science 290, 2114–2117 (2000). Blanc, G., Hokamp, K. & Wolfe, K. H. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res. 13, 137–144 (2003). Wolfe, K. H., Gouy, M., Yang, Y. W., Sharp, P. M. & Li, W. H. Date of the monocot–dicot divergence estimated from chloroplast DNA sequence data. Proc. Natl Acad. Sci. USA 86, 6201–6205 (1989). Crane, P. R., Friis, E. M. & Pedersen, K. R. The origin and early diversification of angiosperms. Nature 374, 27–33 (1995). Eshed, Y. & Zamir, D. An introgression line population of Lycopersicon pennellii in the cultivated tomato enables the identification and fine mapping of yieldassociated QTL. Genetics 141, 1147–1162 (1995). Howe, K. L., Chothia, T. & Durbin, R. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res. 12, 1418–1427 (2002). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements The sequencing of the grapevine genome was launched and carried out after a scientific cooperation agreement between the Ministry of Agriculture in France and the Ministry of Agriculture in Italy, involving l’Institut National de la Recherche Agronomique (INRA), Consiglio per la Ricerca e Sperimentazione in Agricoltura (CRA) and Friuli Venezia Giulia Region. This work was financially supported by Consortium National de Recherche en Génomique, Agence Nationale de la Recherche, INRA, and by MiPAF (VIGNA-CRA), Friuli Innovazione, Università di Udine, Federazione BCC, Fondazione CRUP, Fondazione Carigo, Fondazione CRT, Vivai Cooperativi Rauscedo, Eurotech, Livio Felluga, Marco Felluga, Venica e Venica, Le Vigne di Zamò (IGA). We thank S. Cure for correcting the manuscript; F. Câmara and R. Guigo for the calibration of the GeneID gene prediction software, and the Centre Informatique National de l’Enseignement Supérieur for computing resources. Author Information The final assembly and annotation are deposited in the EMBL/ Genbank/DDBJ databases under accession numbers CU459218–CU462737 (for all scaffolds) and CU462738–CU462772 (for chromosome reconstitutions and unanchored scaffolds). An annotation browser and further information on the project are available from http://www.genoscope.cns.fr/vitis, http:// www.vitisgenome.it/ and http://www.appliedgenomics.org/. Reprints and permissions information is available at www.nature.com/reprints. The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to P.W. ([email protected]). The French-Italian Public Consortium for Grapevine Genome Characterization Olivier Jaillon1*, Jean-Marc Aury1*, Benjamin Noel1, Alberto Policriti2,3, Christian Clepet4, Alberto Casagrande2,5, Nathalie Choisne1,4, Sébastien Aubourg4, Nicola Vitulo6,15, Claire Jubin1, Alessandro Vezzi6,15, Fabrice Legeai7, Philippe Hugueney8, Corinne Dasilva1, David Horner9,15, Erica Mica9,15, Delphine Jublot4, Julie Poulain1, Clémence Bruyère4, Alain Billault1, Béatrice Segurens1, Michel Gouyvenoux1, Edgardo Ugarte1, Federica Cattonaro2, Véronique Anthouard1, Virginie Vico1, Cristian Del Fabbro2,3, Michaël Alaux7, Gabriele Di Gaspero2,5,Vincent Dumas8, Nicoletta Felice2,5, Sophie Paillard4, Irena Juman2,5, Marco Moroldo4, Simone Scalabrin2,3, Aurélie Canaguier4, Isabelle Le Clainche4, Giorgio Malacrida6,15, Eléonore Durand7, Graziano Pesole10,11,15, Valérie Laucou12, Philippe Chatelet13, Didier Merdinoglu8, Massimo Delledonne14,15, Mario Pezzotti15,16, Alain Lecharny4, Claude Scarpelli1, François Artiguenave1, M. Enrico Pè9,15, Giorgio Valle6,15, Michele Morgante2,5, Michel Caboche4, Anne-Françoise Adam-Blondon4, Jean Weissenbach1, Francis Quétier1 & Patrick Wincker1 *These authors contributed equally to this work. Affiliations for participants: 1Genoscope (CEA) and UMR 8030 CNRS-Genoscope-Université d’Evry, 2 rue Gaston Crémieux, BP5706, 91057 Evry, France. 2Istituto di Genomica Applicata, Parco Scientifico e Tecnologico di Udine, Via Linussio 51, 33100 Udine, Italy. 3Dipartimento di Matematica ed Informatica, Università degli Studi di Udine, via delle Scienze 208, 33100 Udine, Italy. 4URGV, UMR INRA 1165, CNRS-Université d’Evry Genomique Végétale, 2 rue Gaston Crémieux, BP5708, 91057 Evry cedex, France. 5Dipartimento di Scienze Agrarie ed Ambientali, Università degli Studi di Udine, via delle Scienze 208, 33100 Udine, Italy. 6CRIBI, Università degli Studi di Padova, viale G. Colombo 3, 35121 Padova, Italy. 7URGI, UR1164 Génomique Info, 523, Place des Terrasses, 91034 Evry Cedex, France. 8UMR INRA 1131, Université de Strasbourg, Santé de la Vigne et Qualité du Vin, 28 rue de Herrlisheim, BP20507, 68021 Colmar, France. 9Dipartimento di Scienze Biomolecolari e Biotecnologie, Università degli Studi di Milano, via Celoria 26, 20133 Milano, Italy. 10Dipartimento di Biochimica e Biologia Molecolare, Università degli Studi di Bari, via Orabona 4, 70125 Bari, Italy. 11 Istituto Tecnologie Biomediche, Consiglio Nazionale delle Ricerche, via Amendola 122/ D, 70125 Bari, Italy. 12UMR INRA 1097, IRD-Montpellier SupAgro-Univ. Montpellier II, Diversité et Adaptation des Plantes Cultivées, 2 Place Pierre Viala, 34060 Montpellier Cedex 1, France. 13UMR INRA 1098, IRD-Montpellier SupAgro-CIRAD, Développement et Amélioration des Plantes, 2 Place Pierre Viala, 34060 Montpellier Cedex 1, France. 14 Dipartimento Scientifico e Tecnologico, Università degli Studi di Verona Strada Le Grazie 15 – Ca’ Vignal, 37134 Verona, Italy. 15Dipartimento di Scienze, Tecnologie e Mercati della Vite e del Vino, Università degli Studi di Verona, via della Pieve, 70 37029 S. Floriano (VR), Italy. 16VIGNA-CRA Initiative; Consorzio Interuniversitario Nazionale per la Biologia Molecolare delle Piante, c/o Università degli Studi di Siena, via Banchi di Sotto 55, 53100 Siena, Italy. 467 ©2007 Nature Publishing Group doi:10.1038/nature06148 METHODS Genome sequencing. The V. vinifera PN40024 genome was sequenced with the use of a whole-genome shotgun strategy. All data were generated by paired-end sequencing of cloned inserts using Sanger technology on ABI3730xl sequencers. Supplementary Table 2 gives the number of reads obtained per library. Genome assembly and chromosome anchoring. All reads were assembled with Arachne12. We obtained 20,784 contigs that were linked into 3,830 supercontigs of more than 2 kb. The contig N50 was 64 kb, and the supercontig N50 was 1.9 Mb. The total supercontig size was 498 Mb, remarkably close to the expected size of 475 Mb. This indicates that the PN40024 has retained few heterozygous regions. Remaining heterozygosity was assessed by aligning all supercontigs with each other. We first selected the supercontigs more than 30 kb in size that were covered over more than 40% of their length by another supercontig with more than 95% identity. After visual inspection of the alignments, we added to this list the supercontigs more than 10 kb in size that aligned at more than 40% of their length with supercontigs identified previously. All potential cases were then inspected visually to discard potential heterozygous regions (aligning relatively homogeneously across their complete length) and retained repeated regions (with more heterogeneous alignments). This treatment identified 11 Mb of potentially allelic supercontigs. We confirmed that in most cases their coverage was about half the average of the homozygous supercontigs. Only one supercontig of each allelic pair was therefore conserved in the final assembly, which consists of 3,514 supercontigs (N50 5 2 Mb) containing 19,577 contigs (N50 5 66 kb), totalling 487 Mb. If the haploid genome size of 475 Mb is considered correct, then our final assembly contains only about 12 Mb of remaining heterozygosity, or 2.6%. A set of 30,151 bacterial artificial chromosome (BAC) fingerprints of the BAC clones of a Cabernet–Sauvignon library29 were assembled into 1,763 contigs with FPC30, v. 8. In parallel, 1,981 markers were anchored on a subset of BAC clones31, among which 388 markers mapped onto the genetic map, and 77,237 BAC end sequences were obtained31. Blat32 alignments (90% identity on 80% of the length, fewer than five hits) were performed with BAC end sequences on the 3,830 supercontigs of sequences with lengths over 2 kb. The results were then filtered with homemade Perl scripts to keep only the occurrences in which two paired ends were matching at a distance of less than 300 kb and with a consistent orientation. Two supercontigs were considered linked to each other if two BAC links could be found or one BAC link and a BAC contig link. A total number of 111 ultracontigs were constructed with this procedure. Genome annotation. Several resources were used to build V. vinifera gene models automatically with GAZE28. We used predictions of repetitive regions by repeatscout33, conserved coding regions predicted by the exofish method34,35, genewise36 alignments of proteins from Uniprot37, Geneid38 and Snap39 ab initio gene predictions, and alignments of several cDNA resources (Supplementary Information). A weight was assigned to each resource to further reflect its reliability and accuracy in predicting gene models. This weight acts as a multiplier for the score of each information source, before being processed by GAZE. When applied to the entire assembled sequence, GAZE predicted 30,434 gene models. Paralogous and orthologous gene sets. We identified orthologous genes in six pairs of genomes from four species: A. thaliana, O. sativa, P. trichocarpa and V. vinifera. Each pair of predicted gene sets was aligned with the Smith– Waterman algorithm, and alignments with a score higher than 300 (BLOSUM62; gapo 5 10, gape 5 1) were retained. Two genes, A from genome GA and B from genome GB, were considered orthologues if B was the best match for gene A in GB and A was the best match for B in GA. For each orthologous gene set with V. vinifera, clusters of orthologous genes were generated. A single linkage clustering with a euclidean distance was used to group genes. The distances were calculated with the gene index in each chromosome rather than the genomic position. The minimal distance between two orthologous genes was adapted in accordance with the selected genomes. Finally, we retained only clusters that were composed of at least six genes for Arabidopsis and O. sativa, and eight genes for P. trichocarpa (Supplementary Table 10). To validate the clustering quality we used a method described previously21. For each cluster we computed the probability of finding this cluster in the gene homology matrix (Supplementary Table 11). This matrix was constructed from two compared chromosomes with genes numbered according to their position on each chromosome, with no reference to physical distances. Paralogous genes were computed by comparing all-against-all of V. vinifera proteins by using blastp, and alignments with an expected value of less than 0.1 were retained and realigned with the Smith–Waterman algorithm40. Two genes A and B were considered paralogues if B was the best match for gene A and A was the best match for B. Moreover, clusters of paralogous genes were constructed in the same fashion as orthologous clusters (Supplementary Table 10). 29. Adam-Blondon, A. F. et al. Construction and characterization of BAC libraries from major grapevine cultivars. Theor. Appl. Genet. 110, 1363–1371 (2005). 30. Soderlund, C., Humphray, S., Dunham, A. & French, L. Contigs built with fingerprints, markers, and FPC V4.7. Genome Res. 10, 1772–1787 (2000). 31. Lamoureux, D. et al. Anchoring of a large set of markers onto a BAC library for the development of a draft physical map of the grapevine genome. Theor. Appl. Genet. 113, 344–356 (2006). 32. Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002). 33. Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21 (Suppl. 1), i351–i358 (2005). 34. Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235–238 (2000). 35. Jaillon, O. et al. Genome-wide analyses based on comparative genomics. Cold Spring Harb. Symp. Quant. Biol. 68, 275–282 (2003). 36. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988–995 (2004). 37. Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005). 38. Parra, G., Blanco, E. & Guigo, R. GeneID in Drosophila. Genome Res. 10, 511–515 (2000). 39. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). 40. Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981). ©2007 Nature Publishing Group Vol 452 | 24 April 2008 | doi:10.1038/nature06856 LETTERS The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus) Ray Ming1,2*, Shaobin Hou3*, Yun Feng4,5*, Qingyi Yu1*, Alexandre Dionne-Laporte3, Jimmy H. Saw3, Pavel Senin3, Wei Wang4,6, Benjamin V. Ly3, Kanako L. T. Lewis3, Steven L. Salzberg7, Lu Feng4,5,6, Meghan R. Jones1, Rachel L. Skelton1, Jan E. Murray1,2, Cuixia Chen2, Wubin Qian4, Junguo Shen5, Peng Du5, Moriah Eustice1,8, Eric Tong1, Haibao Tang9, Eric Lyons10, Robert E. Paull11, Todd P. Michael12, Kerr Wall13, Danny W. Rice14, Henrik Albert15, Ming-Li Wang1, Yun J. Zhu1, Michael Schatz7, Niranjan Nagarajan7, Ricelle A. Acob1,8, Peizhu Guan1,8, Andrea Blas1,8, Ching Man Wai1,11, Christine M. Ackerman1, Yan Ren4, Chao Liu4, Jianmei Wang4, Jianping Wang2, Jong-Kuk Na2, Eugene V. Shakirov16, Brian Haas17, Jyothi Thimmapuram18, David Nelson19, Xiyin Wang9, John E. Bowers9, Andrea R. Gschwend2, Arthur L. Delcher7, Ratnesh Singh1,8, Jon Y. Suzuki15, Savarni Tripathi15, Kabi Neupane20, Hairong Wei21, Beth Irikura11, Maya Paidi1,8, Ning Jiang22, Wenli Zhang23, Gernot Presting8, Aaron Windsor24, Rafael Navajas-Pérez9, Manuel J. Torres9, F. Alex Feltus9, Brad Porter8, Yingjun Li2, A. Max Burroughs7, Ming-Cheng Luo25, Lei Liu18, David A. Christopher8, Stephen M. Mount7,26, Paul H. Moore15, Tak Sugimura27, Jiming Jiang23, Mary A. Schuler28, Vikki Friedman29, Thomas Mitchell-Olds24, Dorothy E. Shippen16, Claude W. dePamphilis13, Jeffrey D. Palmer14, Michael Freeling10, Andrew H. Paterson9, Dennis Gonsalves15, Lei Wang4,5,6 & Maqsudul Alam3,30 Papaya, a fruit crop cultivated in tropical and subtropical regions, is known for its nutritional benefits and medicinal applications. Here we report a 33 draft genome sequence of ‘SunUp’ papaya, the first commercial virus-resistant transgenic fruit tree1 to be sequenced. The papaya genome is three times the size of the Arabidopsis genome, but contains fewer genes, including significantly fewer disease-resistance gene analogues. Comparison of the five sequenced genomes suggests a minimal angiosperm gene set of 13,311. A lack of recent genome duplication, atypical of other angiosperm genomes sequenced so far2–5, may account for the smaller papaya gene number in most functional groups. Nonetheless, striking amplifications in gene number within particular functional groups suggest roles in the evolution of tree-like habit, deposition and remobilization of starch reserves, attraction of seed dispersal agents, and adaptation to tropical daylengths. Transgenesis at three locations is closely associated with chloroplast insertions into the nuclear genome, and with topoisomerase I recognition sites. Papaya offers numerous advantages as a system for fruit-tree functional genomics, and this draft genome sequence provides the foundation for revealing the basis of Carica’s distinguishing morpho-physiological, medicinal and nutritional properties. Papaya is an exceptionally promising system for the exploration of tropical-tree genomes and fruit-tree genomics. It has a relatively small genome of 372 megabases (Mb)6, diploid inheritance with nine pairs of chromosomes, a well-established transformation system7, a short generation time (9–15 months), continuous flowering throughout the year and a primitive sex-chromosome system8. It is a member of the Brassicales, sharing a common ancestor with Arabidopsis about 72 million years ago9. Papaya is ranked first on nutritional scores among 38 common fruits, based on the percentage of the United States Recommended Daily Allowance for vitamin A, vitamin C, potassium, folate, niacin, thiamine, riboflavin, iron and calcium, plus fibre. Consumption of its fruit is recommended for preventing vitamin A deficiency, a cause of childhood blindness in tropical and subtropical developing countries. The fruit, stems, leaves and roots of papaya are used in a wide range of medical applications, including production of papain, a valuable proteolytic enzyme. 1 Hawaii Agriculture Research Center, Aiea, Hawaii 96701, USA. 2Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 3Advanced Studies in Genomics, Proteomics and Bioinformatics, University of Hawaii, Honolulu, Hawaii 96822, USA. 4TEDA School of Biological Sciences and Biotechnology, Nankai University, Tianjin Economic-Technological Development Area, Tianjin 300457, China. 5Tianjin Research Center for Functional Genomics and Biochip, Tianjin Economic-Technological Development Area, Tianjin 300457, China. 6Key Laboratory of Molecular Microbiology and Technology of the Ministry of Education, College of Life Sciences, Nankai University, Tianjin 300071, China. 7Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland 20742, USA. 8Department of Molecular Bioscience and Bioengineering, University of Hawaii, Honolulu, Hawaii 96822, USA. 9Plant Genome Mapping Laboratory, University of Georgia, Athens, Georgia 30602, USA. 10Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA. 11Department of Tropical Plant and Soil Sciences, University of Hawaii, Honolulu, Hawaii 96822, USA. 12 Waksman Institute of Microbiology and Department of Plant Biology and Pathology, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA. 13Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA. 14Department of Biology, Indiana University, Bloomington, Indiana 47405, USA. 15USDA-ARS, Pacific Basin Agricultural Research Center, Hilo, Hawaii 96720, USA. 16Department of Biochemistry and Biophysics, 2128 TAMU, Texas A&M University, College Station, Texas 77843, USA. 17The Institute for Genomic Research, Rockville, Maryland 20850, USA. 18W.M. Keck Center for Comparative and Functional Genomics, University of Illinois at UrbanaChampaign, Urbana, Illinois 61801, USA. 19Department of Molecular Sciences, University of Tennessee, Memphis, Tennessee 38163, USA. 20Leeward Community College, University of Hawaii, Pearl City, Hawaii 96782, USA. 21Wicell Research Institute, Madison, Wisconsin 53707, USA. 22Department of Horticulture, Michigan State University, East Lansing, Michigan 48824, USA. 23Department of Horticulture, University of Wisconsin, Madison, Wisconsin 53706, USA. 24Department of Biology, Duke University, Durham, North Carolina 27708, USA. 25Department of Plant Sciences, University of California, Davis, California 95616, USA. 26Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland 20742, USA. 27Maui High Performance Computing Center, Kihei, Hawaii 96753, USA. 28Departments of Cell and Developmental Biology, Biochemistry and Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 29Applied Biosystems, 850 Lincoln Centre Drive, Foster City, California 94404, USA. 30 Department of Microbiology, University of Hawaii, Honolulu, Hawaii 96822, USA. *These authors contributed equally to this work. 991 ©2008 Nature Publishing Group LETTERS NATURE | Vol 452 | 24 April 2008 A total of 2.8 million whole-genome shotgun (WGS) sequencing reads were generated from a female plant of transgenic cultivar SunUp, which was developed through transformation of Sunset that had undergone more than 25 generations of inbreeding10. The estimated residual heterozygosity of SunUp is 0.06% (Supplementary Note 1). After excluding low-quality and organellar reads, 1.6 million high-quality reads were assembled into contigs containing 271 Mb and scaffolds spanning 370 Mb including embedded gaps (Supplementary Tables 1 and 2). Of 16,362 unigenes derived from expressed sequence tags (ESTs), 15,064 (92.1%) matched this assembly. Pairedend reads from 34,065 bacterial artificial chromosome (BAC) clones provided alignment to an fingerprinted contig (FPC)-based physical map (Supplementary Note 2). Among 706 BAC end and WGS sequence-derived simple sequence repeats on the genetic map, 652 (92.4%) could be used to anchor 167 Mb of contigs or 235 Mb of scaffolds, to the 12 papaya linkage groups in the current genetic map (Supplementary Fig. 1). Papaya chromosomes at the pachytene stage of meiosis are generally stained lightly by 49,6-diamidino-2-phenylindole (DAPI), revealing that the papaya genome is largely euchromatic. However, highly condensed heterochromatin knobs were observed on most chromosomes (Supplementary Fig. 2), concentrated in the centromeric and pericentromeric regions. The lengths of the pachytene bivalents that are heavily stained only account for approximately 17% of the genome. However, these cytologically distinct and highly condensed heterochromatic regions could represent 30–35% of the genomic DNA11. A large portion of the heterochromatic DNA was probably not covered by the WGS sequence. The 271 Mb of contig sequence should represent about 75% of the papaya genome and more than 90% of the euchromatic regions, which is similar to the 92.1% of the EST and 92.4% of genetic markers covered by the assembled genome and the theoretical 95% coverage by 33 WGS sequence12. Gene annotation was carried out using the TIGR Eukaryotic Annotation Pipeline. The assembled genome was masked based on similarity to known repeat elements in RepBase and the TIGR Plant Repeat Database, plus a de novo papaya repeat database (see Methods). Ab initio gene predictions were combined with spliced alignments of proteins and transcripts to produce a reference gene set of 28,629 gene models (Supplementary Table 3). A total of 21,784 (76.1%) of the predicted papaya genes with average length of 1,057 base pairs (bp) have similarity to proteins in the non-redundant database from the National Center for Biotechnology Information, with 9,760 (44.8%) of these supported by papaya unigenes. Among 6,845 genes with average length 309 bp that had no hits to the nonredundant proteins, only 515 (7.5%) were supported by papaya unigenes, implying that the number of predicted papaya-specific genes was inflated. If the 515 genes with unigene support represent 44.8% of the total, then 1,150 predicted papaya-specific genes may be real, and the number of predicted genes in the assembled papaya genome would be 22,934. Considering the assembled genome covers 92.1% of the unigenes and 92.4% of the mapped genetic markers, the number of predicted genes in the papaya genome could be 7.9% higher, or 24,746, about 11–20% less than Arabidopsis (based on either the Table 1 | Statistics of sequenced plant genomes Carica Arabidopsis Populus Oryza sativa Vitis papaya thaliana trichocarpa (japonica) vinifera Size (Mbp) Number of chromosomes G 1 C content total (%) Gene number Average gene length (bp per gene) Average intron length (bp) Transposons (%) 372 125 9 5 35.3 35.0 24,746 31,114* 2,373 2,232 479 165 51.9 14 485 19 33.3 45,555 2,300 379 42 389 487 12 19 43.0 36.2 37,544 30,434 2,821 3,399 412 213 34.8 41.4 * The gene number of Arabidopsis is based on the 27,873 protein-coding and RNA genes from The Arabidopsis Information Resource website (http://www.arabidopsis.org/portals/ genAnnotation/genome_snapshot.jsp) and recently published 3,241 novel genes6. 27,873 protein coding and RNA genes, or including the 3,241 novel genes)2,13, 34% less than rice3, 46% less than poplar4 and 19% less than grape5 (Table 1). Comparison of the papaya genome with that of Arabidopsis sheds new light on angiosperm evolutionary history in several ways. Considering only the 200 longest papaya scaffolds, we found 121 colinear blocks. The papaya blocks range in size from 1.36 Mb containing 181 genes to 0.16 Mb containing 19 genes (a statistical, rather than a biological, lower limit); the corresponding Arabidopsis regions range from 0.69 Mb containing 163 genes to 60 kilobases (kb) containing 18 genes. Across the 121 papaya segments for which colinearity can be detected, 26 show primary correspondence (that is, excluding the effects of ancient triplication detailed below) to only one Arabidopsis segment, 41 to two, 21 to three, 30 to four, and only 3 to more than four. The fact that many papaya segments show co-linearity with two to four Arabidopsis segments (Fig. 1, and Supplementary Figs 3 and 4) is most parsimoniously explained if either one or two genome duplications have affected the Arabidopsis lineage since its divergence from papaya. Although it was suspected that the most recent Arabidopsis genome duplication, a14, might affect only a subset of the Brassicales15, previous phylogenetic dating of these events15 had suggested that the more ancient b-duplication occurred early in the eudicot radiation, well before the Arabidopsis–Carica divergence. This incongruity is under investigation. In contrast, individual Arabidopsis genome segments correspond to only one papaya segment, indicating that no genome duplication has occurred in the papaya lineage since its divergence from Arabidopsis about 72 million years ago5. The lack of relatively recent papaya genome doubling is further supported by an L-shaped distribution of intra-EST correspondence for papaya (not shown). However, multiple genome/subgenome alignments (see Supplementary Methods) reveal evidence in papaya of the ancient ‘c’ genome duplication shared with Arabidopsis and poplar that is postulated to have occurred near the origin of angiosperms14. Indeed, both papaya (with no subsequent duplication) and poplar (with a relatively low rate of duplicate gene loss) suggest that c was not a duplication but a triplication (Fig. 1), with triplicated patterns evident for about 25% of the 247 Mb comprising the 200 largest papaya scaffolds. Cp sc29 0.4–0.1 Mb Pt sc1 8.6–8.3 Mb Pt sc3 11.3–11.6 Mb Vv chr2 4.4–3.9 Mb At chr1 4.3–4.3 Mb At chr1 23.3–23.4 Mb At chr4 7.2–7.1 Mb At chr4 12.0–12.1 Mb Cp sc18 1.4–1.6 Mb Pt sc2 11.3–11.6 Mb Pt sc14 1.7–1.9 Mb Vv chr15 6.3–5.7 Mb At chr2 18.8–18.8 Mb At chr3 22.6–22.6 Mb α20 β6 α3 γ7 α11 Cp sc4 3.2–3.9 Mb Pt sc12 13.5–13.3 Mb Pt sc15 9.7–9.9 Mb Vv chr16r 2.6–3.0 Mb At chr5 21.1–21.1 Mb Figure 1 | Alignment of co-linear regions from Arabidopsis (green), papaya (magenta), poplar (blue) and grape (red). ‘Vv chr16r’ is an unordered ultracontig that has been assigned to grape chromosome 16. Triangles represent individual genes with transcriptional orientations. Several Arabidopsis regions belong to previously identified duplication segments (a3, a11, a20, b6, c7, shown to the right)23. The whole syntenic alignment supports four distinct whole-genome duplication events: a, b within the Arabidopsis lineage, an independent duplication in poplar, and c which is shared by all four eudicot genomes. Co-linear regions can be grouped into three c sub-genomes based on Camin–Sokal parsimony criteria. 992 ©2008 Nature Publishing Group LETTERS NATURE | Vol 452 | 24 April 2008 Number of genes 300 Arabidopsis Papaya 200 100 0 0) 0, 1, 0, 1, ) s ( 0) , 0 en 0, , 0 nd 0, , 1 pe 1, , 0 e 1, (1 nid a ( ) ) Y mi lut , 0 , 0 LF -Lu vo , 0 , 1 0) 3 a 0 0 , C Ph 1, 3, , 0 B- , 4, 0 H M (1 P ( 1, 1) 0) , LI 2-D (1 , 1, , 0, P 0 EF ZI 3, 0 1, ) D- , , , 0 ) ) H (4 1 (1 , 3 0, 0 , 0 3 Z l 1 TA -Be , 1, , 0, 2, 1 , 2) ) B 5 0 , ) 2 H B ( (1, , 1 , 1 2, ) , 0 6 6 , 2 SP D (1 , 4 1 , -H nji , 1 (8, , 0, , 1 ZFmo 1, 4 P5 , 2 6, 3 ( 6 Ju P (1 -HA(8, w 1) 0) , T ro TC AA ers rec 11, 5, 3 , C h C ot ca 14 0, ) ) B- -S , , 1 7 4 H AS , 5 (17 5, , 1 , 8 R 0 G T (3 ers 7, 8 0, 1 0) ) h ) , 1 5 1 SE ot , 8 9, 1 1 , 1 B- 30 , , 2, 7 H P ( (36 3, 1 , 1 6, I 0) , 9 1 bZ DS 3, 31, 1, 8 ) ,1 A 2 20 M C ( 52, P (2 3, 3, 2) 1, 1 A ( N LH EB , 1, 1, 3 5, R 5 , 2 bH2-E (9, , 4 56, 8 ( AP KY K ( ns R R i W P- rote RWB p Y M This is most probably an underestimate that will increase as papaya contiguity is improved. Triplication in papaya and poplar corresponds closely to the triplication suggested by an independent analysis of the grape genome5. A few hundred papaya chromosomal segments were aligned using BLASTZ to their one to four syntenic regions in Arabidopsis, and the results examined visually using the Genome Evolution (GEvo) viewer16. The orthologous region of grape was also included5, making the alignment a six-way comparison. One example is given in Supplementary Fig. 5: a 500 kb segment of papaya, its four 60 kb syntenic, orthologous Arabidopsis segments and the 400 kb orthologous segment of grape. For the homologous Arabidopsis segments that are discernibly colinear (by MC-SCANNER) to the 200 longest papaya scaffolds, 34.8% of Arabidopsis genes in any one segment correspond to a papaya gene, whereas only 24.8% of papaya genes in any one segment correspond to an Arabidopsis gene. Moreover, the Arabidopsis homologous segments contain fewer genes, on average only about 57.9% of the number in their papaya counterparts. Papaya provides a useful outgroup necessary to detect subfunctionalization. Supplementary Fig. 6 is a GEvo screenshot of a blastn alignment illustrating subfunctionalization of conserved non-coding sequences (CNSs)17 upstream of two syntenic, duplicate Arabidopsis genes and their single papaya orthologous gene. The a-duplicated genomes within Arabidopsis are perfect for CNS discovery18. Comparative analysis of the papaya and Arabidopsis 59 untranslated regions showed that only 14% of orthologous promoter pairs exhibit significantly higher levels of sequence identity than random comparisons (Supplementary Figs 7 and 8). Although some highly conserved promoters show substantial conservation across much of their length, sequence similarity for most orthologous papaya promoters is indistinguishable from background. Global analysis of all inferred protein models from papaya, Arabidopsis, poplar, grape and rice clusters the 208,901 non-redundant protein sequences into 39,706 similarity groups, or ‘tribes’19, 11,851 of which contain two or more genes (see Supplementary Methods). Tribes with multiple genes in a species typically correspond to families or subfamilies of genes; however, tribes may also contain just one gene (‘singleton tribes’). In papaya, 25,312 gene models were classified into 12,958 tribes, 5,669 of which were specific to papaya (Supplementary Table 4). Of the papaya-specific tribes, 5,314 were singleton tribes. EST support was markedly lower for genes in papaya-specific tribes (below 14%) than in tribes that included genes from at least one other taxon (72.4%). To investigate the smaller number of genes in papaya, we compared tribe membership from each of the five sequenced angiosperm species (Supplementary Table 5). Among the 6,726 tribes that contain genes from both Arabidopsis and papaya, 3,595 contain equal numbers of genes from both species. However, tribes with more Arabidopsis genes outnumber those with more papaya genes by more than 2:1 (2,153:979). The trend of smaller number of papaya genes is widespread across tribes of all sizes and major functional categories (Supplementary Table 6 and Supplementary Fig. 9). We then examined membership in the 815 tribes with members identified as being likely transcription factors in the Arabidopsis transcription factor database (http://arabidopsis.med.ohio-state.edu/ AtTFDB/). This set includes 2,897 genes in Arabidopsis and 2,438 in papaya (a ratio of 1.19:1). The details of tribe membership are illustrated for 25 exemplar families and superfamilies (Fig. 2), where most transcription-factor tribes have fewer genes in papaya than Figure 2 | Comparison of gene numbers in transcription-factor tribe or related tribes from Arabidopsis and papaya. Most transcription factors are represented by fewer genes in papaya than Arabidopsis. Transcription-factor names are given, with values after the names corresponding to: number of tribes with genes assigned to transcription factor group, number of tribes with smaller counts in papaya than Arabidopsis, number of tribes with equal counts in papaya and Arabidopsis, number of tribes with larger counts in papaya, and number of tribes with zero members in papaya. Supporting data are provided in Supplementary Table 8. Arabidopsis. Some transcription-factor tribes had more genes in papaya, specifically RWP-RK, MADS-box, Scarecrow, TCP and Jumonji gene families. Interestingly, the difference in MADS protein family size appears to be due to expanded numbers for half of the 36 MADS tribes. The other 18 MADS tribes had fewer papaya genes, including 14 that were not found in papaya. Assuming that a generalized angiosperm could potentially require only the types and minimal numbers of genes that are shared among divergent plant species, we examined each of the tribes shared among the five angiosperms with sequenced genomes. The number of genes required in a minimal flowering plant is based on the observed minimum number of genes across each of the shared tribes (Table 2). When the smallest observed number is taken for each evolutionarily conserved tribe, a minimal angiosperm genome of 13,311 genes is estimated. Papaya has the smallest number of genes for more tribes than any other sequenced taxon (4,515, or 76% of 5,925 shared tribes), reinforcing the notion that papaya has fewer genes than any angiosperm sequenced so far. Only 55 nucleotide-binding site (NBS)-containing R genes were identified in papaya; about 28% of the 200 NBS genes in Arabidopsis20 and less than 10% of the 600 NBS genes in rice21. Resistance proteins also have a carboxy-terminal leucine-rich repeat (LRR) domain. These NBS-containing R-gene families can be subdivided into three classes: NBS–LRR, toll interleukin receptor (TIR)–NBS–LRR, and coiled-coil (CC)–NBS–LRR on the basis of their amino-terminal region. Papaya NBS–LRR outnumbered both TIR–NBS–LRR and CC–NBS–LRR genes, in contrast to both poplar (with more CC– NBS–LRR genes4) and Arabidopsis (with more TIR–NBS–LRR). More than 50% of the NBS-type R genes were clustered in about eight scaffolds, indicating that resistance gene evolution may involve duplication and divergence of linked gene families. Table 2 | Deduced potential minimal angiosperm gene number based on species with smallest number of genes for each tribe Shared tribes with minimum Number of unique tribes Number of conserved tribes lost or missing from each species Carica papaya Arabidopsis thaliana 4,515 5,708 405 3,597 2,950 113 Populus trichocarpa Oryza sativa (japonica) 1,548 6,338 28 3,657 13,003 429 Vitis vinifera 3,597 3,567 175 Shared tribes Minimal gene number 5,925 13,331 993 ©2008 Nature Publishing Group LETTERS NATURE | Vol 452 | 24 April 2008 Homologues for genes involved in cellulose biosynthesis are present in papaya and Arabidopsis, with more cellulose synthase genes in poplar, perhaps associated with wood formation. Papaya has at least 32 putative b-glucosyl transferase (GT1) genes compared with 121 in Arabidopsis identified using sequence alignment. A total of 38 and 40 cellulose synthase-related genes (GT2) were identified in papaya using the 48 poplar and 31 Arabidopsis genes as queries, respectively. These genes include 11 cellulose synthase (CesA) genes, the same number as in Arabidopsis but 7 fewer than in poplar. Putative cellulose orientation genes (COBRA) were more abundant in Arabidopsis (12) than in papaya (8). Papaya also has a similar complement though fewer genes for cellwall synthesis than Arabidopsis. Papaya and Arabidopsis, respectively, have 6 and 12 callose synthase genes (GT2); 15 and 15 xyloglucan a-1,2-fucosyl transferases (GT37); 5 and 7 b-glucuronic acid transferases in familes GT43 and GT47; and 27 and 42 in GT8 that includes galacturonosyl transferases, associated with pectin synthesis. The cell wall of plants is capable of both plastic and elastic extension, and controls the rate and direction of cell expansion22. Despite fewer whole-genome duplications, papaya has a similar number of putative expansin A genes (24) as Arabidopsis (26) and poplar (27), and more expansin B genes (10) than Arabidopsis (6) and poplar (3). In contrast to expansion-related genes, papaya has on average about 25% fewer cell-wall degradation genes than Arabidopsis, in some cases far fewer. For example, papaya and Arabidopsis, respectively, have 4 and 12 endoxylanase-like genes in glycoside hydrolase family 10 (GH10); 29 and 67 pectin methyl esterases (carbohydrate esterase family 8); 28 and 69 polygalacturonases (GH28); 15 and 49 xyloglucan endotransglycosylase/hydrolases (GH16); 18 and 25 b-1,4-endoglucanases (GH9); 42 and 91 b-1,3-glucanases (GH17); and 15 and 27 pectin lyases (PL1). A semi-woody giant herb that accumulates lignin in the cell wall at an intermediate level between Arabidopsis and poplar, papaya generally has intermediate numbers of lignin synthetic genes, fewer than poplar but more than Arabidopsis despite fewer opportunities for duplication in papaya. Poplar, papaya and Arabidopsis have 37, 30 and 18 candidate genes for the lignin synthesis pathway, respectively4,23, with papaya having an intermediate number of genes for the PAL, C4H, 4CL and HCT gene families, and only one COMT and two C3H genes. In contrast, poplar has three C3H genes, which are presumed to convert p-coumaroyl quinic acid to caffeoyl shikimic acid, whereas there are two in papaya and one in Arabidopsis. Papaya, Arabidopsis and poplar each have two genes in the family CCoAOMT, which are presumed to convert caffeic acid to ferulic acid4. Compared with these other plants, papaya has the fewest genes in the CCR gene family (1 gene) and the most in the F5H (4 genes) and CAD gene families (18 genes), which all mediate later steps of the lignin biosynthesis pathway. More starch-associated genes in papaya, a perennial, may be due to a greater need for storage in leaves, stem and developing fruit than in Arabidopsis, an ephemeral that stores oil in the seed. Papaya and Arabidopsis, respectively, have 13 and 6 putative starch synthase (GT5) genes; 8 and 3 starch branching genes; 6 and 3 isoamylases (GH13); and 12 and 9 b-amylases (GH14). Early unloading of fruit sugar in papaya is probably symplastic24, with five genes for sucrose synthase/sucrose phosphate synthase (GT4); seven are reported for Arabidopsis. Five acid invertase (GH32) sequences were found in papaya whereas 11 have been reported in Arabidopsis. Papaya has at least seven putative neutral invertase (GH32) genes; Arabidopsis has six. Wall-associated kinases (WAK) are thought to be involved in the regulation of vacuolar invertases, with 17 in Arabidopsis and only 10 in papaya. Arabidopsis and papaya have 14 and 7 hexose transporters, respectively. The greater number of genes for sugar accumulation in Arabidopsis may reflect recent genome duplications. Papaya has undergone particularly striking amplification of genes involved in volatile development. Papaya and Arabidopsis, respectively, have 18 and 8 genes for cinnamyl alcohol dehydrogenase; 2 and 1 genes for cinnamate-4-hydroxylase; 9 and 3 genes for phenylalanine ammonia lyase; and 24 and 3 limonene cyclase genes. Papaya ripening is climacteric, with the rise in ethylene production occurring at the same time as the respiratory increase25. Papaya and Arabidopsis, respectively, have similar numbers of genes involved in ethylene synthesis, with four each for S-adenosyl methionine synthase (SAM synthase); 8 and 13 for aminocyclopropane carboxylic acid (ACC) synthase (ACS); 8 and 12 for ACC oxidase (ACO); and 42 and 64 for ethylene-responsive binding factors (AP2/ERF). Because papaya grows in tropical climates where daily light/dark cycles do not change much over the year, we can ask if more or fewer light/circadian genes are required to synchronize with the environment. In fact, there are fewer light/clock genes in the papaya genome (49% and 34% of poplar and Arabidopsis, respectively; Supplementary Table 7). However, among the core circadian clock genes, the pseudo-response regulators (PRRs; Supplementary Fig. 10) have expanded in poplar compared with Arabidopsis, and the papaya PRR7 cluster has seemingly duplicated with the recent poplar salicoid-specific genome duplication4 (Supplementary Fig. 11). Against the backdrop of fewer overall genes, the parallel expansion of the PRRs is consistent with circadian timing being important in papaya. The PAS–FBOX–KELCH genes control light signalling and flowering time; however, the only papaya orthologue (ZTL) lacks an obvious KELCH domain compared with Arabidopsis and poplar, which have five and one KELCH domains, respectively (Supplementary Fig. 10). In fact, the papaya genome contains fewer KELCH domains (37 compared with 130 and 74 in Arabidopsis and poplar, respectively). In contrast, there are three constitutive photomorphogenic 1 (COP1) paralogues in the papaya genome compared with only one in Arabidopsis (Supplementary Tables 7 and 8). A similar expansion has been noted in moss (Physcomitrella patens), which has nine COP1 paralogues that are hypothesized to aid in tolerance to ultraviolet light (Supplementary Fig. 12)26. Both KELCH domains and the WD-40 of the COP1 family form b-propellers and play a role in light-mediated ubiquitination. There is not a general expansion of WD-40 genes in papaya (173 compared with 227 in Arabidopsis). Perhaps papaya has developed an alternative way of integrating light or timing information specific to day-neutral plants, such as a strict adherence to the diel light/dark cycle that is better served by the COPmediated system. Sex determination in papaya is controlled by a pair of primitive sex chromosomes, with a small male-specific region of the Y chromosome (MSY)8. The physical map of the MSY is currently estimated by chromosome walking to span about 8 Mb (ref. 27). Two scaffolds in the current female-genome sequence align to the X chromosome physical map based on BAC end sequences, spanning 4.5 Mb and including 254 predicted protein-encoding genes, of which 75 (29.5%) have EST support (Supplementary Table 9 and Supplementary Fig. 13). If adjusted for the percentage of unigene validation for other genes (48.0%), the estimated number of genes in the X-specific region would be 156. The average gene density would be one gene per 19.5 kb, lower than the estimated genome average of one gene per 14.3 kb. By contrast, among seven completely sequenced MSY BACs totalling 1.2 Mb, a total of four expressed genes were found on two of the BACs14,28. The somewhat lower-than-average gene density in the X-specific scaffolds is accompanied by more repetitive DNA (58.3%) than the genome-wide average, perhaps because this region is near the centromere28. Re-analysis of the repetitive DNA content of the MSY BACs, to include the new papaya-specific repeat families identified herein, increased the average repeat sequence to 85.6%, with 54.1% Gypsy and 1.9% Copia retro-elements (Supplementary Table 10). This compares with an earlier estimate of 17.9% using the Arabidopsis repeat database alone28. 994 ©2008 Nature Publishing Group LETTERS NATURE | Vol 452 | 24 April 2008 The SunUp genome has presented an opportunity to analyse transgene insertion sites critically. Southern blot analysis was key in the initial identification of transgenic insertion fragments and was performed with probes spanning the entire 19,567-bp transformation vector used for bombardment (Supplementary Fig. 14). Among the identified inserts were the functional coat-protein transgene conferring resistance to papaya ringspot virus, which was found in an intact 9,789-bp fragment of the transformation plasmid, and a 1,533-bp fragment composed of a truncated, non-functional tetA gene and flanking vector backbone sequence. The structures of the coatprotein transgene and tetA region insertion sites were determined from cloned sequences. Southern analysis also confirmed a 290-bp non-functional fragment of the nptII gene originally identified by WGS sequence analysis (Supplementary Fig. 15). Five of the six flanking sequences of the three insertions are nuclear DNA copies of papaya chloroplast DNA fragments. The integration of the transgenes into chloroplast DNA-like sequences may be related to the observation that transgenes produced either by Agrobacteriummediated or biolistic transformation are often inserted in AT-rich DNA29, as is the chloroplast DNA of papaya and other land plants. Four of the six insert junctions have sequences that match topoisomerase I recognition sites, which are associated with breakpoints in genomic DNA transgene insertion sites and transgene rearrangements29. The presence of these inserts was confirmed by high-throughput MUMmer30 analysis for each region of the transformation vector. Evidence for the presence of other transgene inserts is not conclusive (Supplementary Note 3). Its lower overall gene number notwithstanding, striking variations in gene number within particular functional groups, superimposed on the average approximate 20% reduction in papaya gene number relative to Arabidopsis, may be related to key features of papaya morphological evolution. Despite a closer evolutionary relationship to Arabidopsis, papaya shares with poplar an increased number of genes associated with cell expansion, consistent with larger plant size; and lignin biosynthesis, consistent with the convergent evolution of tree-like habit. Amplification of starch-synthesis genes in papaya relative to Arabidopsis is consistent with a greater need for storage in leaves, stem and developing fruit of this perennial. Tremendous amplification in papaya of genes related to volatile development implies strong natural selection for enhanced attractants that may be key to fruit (seed) dispersal by animals and which may also have attracted the attention of aboriginal peoples. This also foreshadows what we might expect to discover in the genomes of other fragrantfruited trees, as well as plants with striking fragrance of leaves (herbs), flowers or other organs. Arguably, the sequencing of the genome of SunUp papaya makes it the best-characterized commercial transgenic crop. Because papaya ringspot virus is widespread in nearly all papaya-growing regions, SunUp could serve as a transgenic germplasm source that could be used to breed suitable cultivars resistant to the virus in various parts of the world. The characterization of the precise transgenic modifications in SunUp papaya should also serve to lower regulatory barriers currently in place in some countries. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. METHODS SUMMARY Gene annotation. Papaya unigenes from complementary DNA were aligned to the unmasked genome assembly, which was then used in training ab initio gene prediction software. Spliced alignments of proteins from the plant division of GenBank, and transcripts from related angiosperms, were generated. Gene predictions were combined with spliced alignments of proteins and transcripts to produce a reference gene set. Detailed descriptions are given in Methods. Full Methods and any associated references are available in the online version of the paper at www.nature.com/nature. Received 6 September 2007; accepted 22 February 2008. 1. Gonsalves, D. Control of papaya ringspot virus in papaya: a case study. Annu. Rev. Phytopathol. 36, 415–437 (1998). 30. The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000). International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature 436, 793–800 (2005). Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 1596–1604 (2006). Jaillon, C. O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007). Arumuganathan, K. & Earle, E. D. Nuclear DNA content of some important plant species. Plant Mol. Biol. Rep. 9, 208–218 (1991). Fitch, M. M. M., Manshardt, R. M., Gonsalves, D., Slightom, J. L. & Sanford, J. C. Virus resistant papaya plants derived from tissues bombarded with the coat protein gene of papaya ringspot virus. Bio/technology 10, 1466–1472 (1992). Liu, Z. et al. A primitive Y chromosome in papaya marks incipient sex chromosome evolution. Nature 427, 348–352 (2004). Wikström, N., Savolainen, V. & Chase, M. W. Evolution of the angiosperms: calibrating the family tree. Proc. R. Soc. Lond. B 268, 2211–2220 (2001). Storey, W. B. Papaya. in Outlines of Perennial Crop Breeding in the Tropics (eds Ferwerda, F. P. and Wit, F.) 389–408 (H. Veenman & Zonen, Wageningen, 1969). Li, L. et al. Genome-wide transcription analyses in rice using tiling microarrays. Nature Genet. 38, 124–129 (2006). Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988). Hanada, K., Zhang, X., Borevitz, J. O., Li, W.-H. & Shiu, S.-H. A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection. Genome Res. 17, 632–640 (2007). Bowers, J. E., Chapman, B. A., Rong, J. & Paterson, A. H. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422, 433–438 (2003). Schranz, M. E. & Mitchell-Olds, T. Independent ancient polyploidy events in the sister families Brassicaceae and Cleomaceae. Plant Cell 18, 1152–1165 (2006). Lyons, E. & Freeling, M. How to usefully compare homologous plant genes and chromosomes as DNA sequence. Plant J. 53, 661–673 (2008). Inada, D. C. et al. Conserved noncoding sequences in the grasses. Genome Res. 13, 2030–2041 (2003). Thomas, B. C., Rapaka, L., Lyons, E., Pedersen, B. & Freeling, M. Arabidopsis intragenomic conserved noncoding sequence. Proc. Natl Acad. Sci. USA 104, 3348–3353 (2007). Wall, P. K. et al. PlantTribes: a gene and gene family resource for comparative genomics in plants. Nucleic Acids Res. 36, D970–D976 (2008). Meyers, B. C., Morgante, M. & Michelmore, R. W. TIR-X and TIR-NBS proteins: two new families related to disease resistance TIR-NBS-LRR proteins encoded in Arabidopsis and other plant genomes. Plant J. 32, 77–92 (2002). Zhou, T. et al. Genome-wide identification of NBS genes in japonica rice reveals significant expansion of divergent non-TIR NBS-LRR genes. Mol. Genet. Genomics 271, 402–415 (2004). Fry, S. C. Primary cell wall metabolism: tracking the careers of wall polymers in living plant cells. New Phytol. 161, 641–675 (2004). Ehlting, J. et al. Global transcript profiling of primary stems from Arabidopsis thaliana identifies candidate genes for missing links in lignin biosynthesis and transcriptional regulators of fiber differentiation. Plant J. 42, 618–640 (2005). Zhou, L. L. & Paull, R. E. Sucrose metabolism during papaya (Carica papaya) fruit growth and ripening. J. Am. Soc. Hortic. Sci. 126, 351–357 (2001). Paull, R. E. & Chen, N. J. Postharvest variation in cell wall-degrading enzymes of papaya (Carica papaya L.) during fruit ripening. Plant Physiol. 72, 382–385 (1983). Richardt, S., Lang, D., Reski, R., Frank, W. & Rensing, S. A. PlanTAPDB, a phylogeny-based resource of plant transcription-associated proteins. Plant Physiol. 143, 1452–1466 (2007). Yu, Q. et al. Low X/Y divergence of four pairs of papaya sex-liked genes. Plant J. 53, 124–132 (2008). Yu, Q. et al. Chromosomal location and gene paucity of the male specific region on papaya Y chromosome. Mol. Genet. Genomics 278, 177–185 (2007). Sawasaki, T., Takahashi, M., Goshima, N. & Morikawa, H. Structures of transgene loci in transgenic Arabidopsis plants obtained by particle bombardment: junction regions can bind to nuclear matrices. Gene 218, 27–35 (1998). Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements We thank X. Wan, J. Saito and A. Young at the University of Hawaii for technical assistance; C. Detter at the DOE Joint Genome Institute; F. MacKenzie, O. Veatch and T. Uhm at the Hawaii Agriculture Research Center; L. Li, W. Teng, Y. Wu, Y. Yang, C. Zhou, N. Wang, P. Wang and D. Fei at the Tianjin Biochip Corporation, Tianjin Economic-Technological Development Area, Tianjin; and R. Herdes, L. Diebold, R. Kim, A. Hernandez, S. Ali and L. Bynum at the University of Illinois at Urbana-Champaign. This papaya genome-sequencing project was given support by the University of Hawaii and the US Department of Defense grant number W81XWH0520013 to M.A., the Maui High Performance 995 ©2008 Nature Publishing Group LETTERS NATURE | Vol 452 | 24 April 2008 Computing Center to M.A., the Hawaii Agriculture Research Center to R.M. and Q.Y., and Nankai University, China, to L.W. Other support to the papaya genome project included the United States Department of Agriculture T-STAR program; a United States Department of Agriculture–Agricultural Research Service cooperative agreement (CA 58-3020-8-134) with the Hawaii Agriculture Research Center; the University of Illinois; the National Science Foundation Plant Genome Research Program; and Tianjin Municipal Special Fund for Science and Technology Innovation Grant 05FZZDSH00800. We thank P. Englert, former chancellor of the University of Hawaii, for initial infrastructure support of the research. Author Information The papaya WGS sequence is deposited at DNA Data Bank of Japan/European Molecular Biology Laboratory/GenBank under accession number ABIM00000000. The version described in this paper is the first version, ABIM01000000. The GenBank accession numbers of the papaya ESTs are EX227656–EX303501. This paper is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence, and is freely available to all readers at www.nature.com/nature. Reprints and permissions information is available at www.nature.com/reprints. Correspondence and requests for materials should be addressed to M.A. ([email protected]) or L.W. ([email protected]). 996 ©2008 Nature Publishing Group doi:10.1038/nature06856 METHODS Genome assembly. The Genome sequence was assembled by Arachne31. WGS reads and BAC end reads were trimmed by LUCY and screened for organellar sequences32. Two approaches were applied to screening and removing reads of presumably organellar origin to alleviate the load in assembling highly repetitive regions by WGS assembly software. The first approach was an iterative process, in which reads were assembled, contigs matching with organellar genomes identified, constituent reads removed, and the process repeated by two or three more rounds. This approach produced the read sets for the released assemblies Stripped3 and Stripped4. The second approach was to remove plasmid clones and BAC clones of presumably organellar origin by identifying clones with both end reads matching entirely with organellar genomes, with physical map information an amendment to the identification of BAC clones. Two rounds of iterative screening based on pairing information of assembled and unplaced reads were added to the second approach to generate the read set for the released Papaya1.0 assembly. The sequence error rates were estimated by aligning assembled shotgun sequences with two finished BACs (GenBank accession numbers EF661023 and EF661026). The error rate of the assembly at 33 coverage or deeper (74.2% of assembled sequences) was less than 0.01% based on average quality values of 20 or greater in trimmed sequence. The error rate at 23 coverage (16.3%) was 0.37%. The error rate at 13 coverage (9.5%) was approximately 0.75%, because these sequences are at the ends of the contigs (and sequence reads) where the sequence quality declined. Genome annotation. Gene annotation was conducted following the TIGR Eukaryotic Annotation Pipeline. Repeat sequences were identified in the assembled genome and masked by RepeatMasker, RepeatScout and TransposonPSI, based on known repeat elements in RepBase databases and TIGR Plant Repeat Databases, and the papaya novel repeat database constructed in this study33,34. Program to Assemble Spliced Alignments (PASA)35 was used to generate spliced alignments of papaya unigenes to the unmasked assembly, which was then used in training ab initio gene prediction software Augustus, GlimmerHMM and SNAP36–38. Ab initio gene prediction software Fgenesh, Genscan and TWINSCAN were trained on Arabidopsis39–41. Spliced alignments of proteins from the plant division of GenBank and transcripts from related angiosperms (Arabidopsis thaliana, Glycine max, Gossypium hirsutum, Medicago truncatula, Nicotiana tabacum, Oryza sativa, Zea mays) were generated by the Analysis and Annotation Tool (AAT)42. Spliced alignment of proteins from the Pfam database were generated using GeneWise43,44. Gene predictions generated by Augustus, Fgenesh, Genscan, GlimmerHMM, SNAP and TWINSCAN were combined with spliced alignments of proteins and transcripts to produce a reference gene set using the evidence-based combiner EVidenceModeler (EVM)45. Protein domains were predicted using InterProScan against protein databases (PRINTS, Pfam, ProDom, PROSITE, SMART)46–50. Construction of papaya repeat database. We used a combination of homologybased and de novo methods to identify signatures of transposable elements in the papaya genome. We used RepeatMasker (http://www.repeatmasker.org) in combination with a custom-built library of plant repeat elements for our initial classification of transposable elements. The customized library was generated by combining plant repeats from Repbase and plant repeat databases from TIGR (ftp://ftp.tigr.org/pub/data/TIGR_Plant_Repeats)33. Repeat elements identified as ribosomal RNA sequences in the TIGR databases match a large fraction of the papaya genome (about 3%). Ribosomal RNAs were identified separately, and therefore were excluded from our repeat library, leaving a database of 76,924 repeat sequences that were used to search the papaya genome. Homology-based methods are limited to finding elements that have not diverged too greatly from known repeats. Because databases of known transposable elements are necessarily incomplete, we used additional de novo methods to search for repeat elements in papaya contigs. For this, we applied two recently developed repeat-finding tools, PILER and RepeatScout to the complete set of contigs from the papaya genome34,51. PILER was able to find 428 repeat families whereas RepeatScout found 6,596 repeat sequences. The repeat families obtained from PILER and RepeatScout were annotated using a combination of manual curation (786 repeat families) and automated analysis. For the automated annotation, the combined data set from PILER and RepeatScout was made non-redundant (using CD-HIT at the 90% similarity level), leaving behind 6,240 repeat families52. As a post-processing step, we selected only those families that had at least ten good (E value , 1 3 1020) BLAST matches to papaya contigs. The resulting data set contained 2,198 repeat families in the papaya genome. BLAST searches against non-redundant and PTREP (http://wheat.pw.usda.gov/ITMI/Repeats) were then used to identify repeat families matching genes associated with transposons and retrotransposons. This procedure discovered an additional 103 repeat families that could be annotated as being retrotransposons. The combined database of 889 annotated papaya-specific transposable-element sequences was used in addition to the database of known repeats to annotate the papaya genome. The remaining, unannotated repeat families (1,455 sequences with no matches to known genes) were then used to estimate the additional repeat content of the genome. 31. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003). 32. Chou, H. H. & Holmes, M. H. DNA sequence quality trimming and vector removal. Bioinformatics 17, 1093–1104 (2001). 33. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker (Release Open-3.1.3, 2006). 34. Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21 (suppl.), i351–i358 (2005). 35. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003). 36. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 (suppl.), ii215–ii225 (2003). 37. Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004). 38. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). 39. Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000). 40. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997). 41. Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene structure prediction. Bioinformatics 17 (suppl. 1), S140–S148 (2001). 42. Huang, X., Adams, M. D., Zhou, H. & Kerlavage, A. R. A tool for analyzing and annotating genomic sequences. Genomics 46, 37–45 (1997). 43. Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic Acids Res. 34 (Database issue), D247–D251 (2006). 44. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988–995 (2004). 45. Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7.1–R7.19 (2008). 46. Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116–W120 (2005). 47. Attwood, T. K. et al. PRINTS and its automatic supplement, prePRINTs. Nucleic Acids Res. 31, 400–402 (2003). 48. Bru, C. et al. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 33 (Database issue), D212–D215 (2005). 49. Hulo, N. et al. The PROSITE database. Nucleic Acids Res. 34 (Database issue), D227–D230 (2006). 50. Letunic, I. et al. SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 34 (Database issue), D257–D260 (2006). 51. Edgar, R. C. & Myers, E. W. PILER: Identification and classification of genomic repeats. Bioinformatics 21 (suppl.), i152–i158 (2005). 52. Li, W. & Godzik, A. CD-HIT: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, i1658–i1659 (2006). ©2008 Nature Publishing Group Vol 457 | 29 January 2009 | doi:10.1038/nature07723 ARTICLES The Sorghum bicolor genome and the diversification of grasses Andrew H. Paterson1, John E. Bowers1, Rémy Bruggmann2, Inna Dubchak3, Jane Grimwood4, Heidrun Gundlach5, Georg Haberer5, Uffe Hellsten3, Therese Mitros6, Alexander Poliakov3, Jeremy Schmutz4, Manuel Spannagl5, Haibao Tang1, Xiyin Wang1,7, Thomas Wicker8, Arvind K. Bharti2, Jarrod Chapman3, F. Alex Feltus1,9, Udo Gowik10, Igor V. Grigoriev3, Eric Lyons11, Christopher A. Maher12, Mihaela Martis5, Apurva Narechania12, Robert P. Otillar3, Bryan W. Penning13, Asaf A. Salamov3, Yu Wang5, Lifang Zhang12, Nicholas C. Carpita14, Michael Freeling11, Alan R. Gingle1, C. Thomas Hash15, Beat Keller8, Patricia Klein16, Stephen Kresovich17, Maureen C. McCann13, Ray Ming18, Daniel G. Peterson1,19, Mehboob-ur-Rahman1,20, Doreen Ware12,21, Peter Westhoff10, Klaus F. X. Mayer5, Joachim Messing2 & Daniel S. Rokhsar3,4 Sorghum, an African grass related to sugar cane and maize, is grown for food, feed, fibre and fuel. We present an initial analysis of the ,730-megabase Sorghum bicolor (L.) Moench genome, placing ,98% of genes in their chromosomal context using whole-genome shotgun sequence validated by genetic, physical and syntenic information. Genetic recombination is largely confined to about one-third of the sorghum genome with gene order and density similar to those of rice. Retrotransposon accumulation in recombinationally recalcitrant heterochromatin explains the ,75% larger genome size of sorghum compared with rice. Although gene and repetitive DNA distributions have been preserved since palaeopolyploidization ,70 million years ago, most duplicated gene sets lost one member before the sorghum–rice divergence. Concerted evolution makes one duplicated chromosomal segment appear to be only a few million years old. About 24% of genes are grass-specific and 7% are sorghum-specific. Recent gene and microRNA duplications may contribute to sorghum’s drought tolerance. The Saccharinae plants include some of the most efficient biomass accumulators, providing food and fuel from starch (sorghum) and sugar (sorghum and Saccharum, sugar cane), and have potential for use as cellulosic biofuel crops (sorghum, sugar cane, Miscanthus). Of singular importance to Saccharinae productivity is C4 photosynthesis, comprising biochemical and morphological specializations that increase net carbon assimilation at high temperatures1. Despite their common photosynthetic strategy, the Saccharinae show much morphological and genomic variation (Supplementary Fig. 1). Its small genome (,730 Mb) makes sorghum an attractive model for functional genomics of Saccharinae and other C4 grasses. Rice, the first fully sequenced cereal genome, is more representative of C3 photosynthetic grasses. Drought tolerance makes sorghum especially important in dry regions such as northeast Africa (its centre of diversity) and the southern plains of the United States. Genetic variation in the partitioning of carbon into sugar stores versus cell wall mass, and in perenniality and associated features such as tillering and stalk reserve retention2, make sorghum an attractive system for the study of traits important in perennial cellulosic biomass crops. Its high level of inbreeding makes it an attractive association genetics system3. Transgenic approaches to sorghum improvement are constrained by high gene flow to weedy relatives4, making knowledge of its intrinsic genetic potential all the more important. Reconstructing a repeat-rich genome from shotgun sequences Preferred approaches to sequencing entire genomes are currently to apply shotgun sequencing5 either to a minimum ‘tiling path’ of genomic clones, or to genomic DNA directly. The latter approach, wholegenome shotgun (WGS) sequencing, is widely used for mammalian genomes, being fast, relatively economical and reducing cloning bias. However, its applicability has been questioned for repetitive DNArich plant genomes6. Despite a repeat content of ,61%, a high-quality genome sequence was assembled from homozygous sorghum genotype BTx623 by using WGS and incorporating the following: (1) ,8.5 genome equivalents of paired-end reads7 from genomic libraries spanning a ,100-fold range of insert sizes (Supplementary Table 1), resolving many repetitive regions; and (2) high-quality read length averaging 723 bp, facilitating assembly. Comparison with 27 finished bacterial artificial chromosomes (BACs) showed the WGS assembly to be .98.46% complete and accurate to ,1 error per 10 kb (Supplementary Note 2.5). 1 Plant Genome Mapping Laboratory, University of Georgia, Athens, Georgia 30602, USA. 2Waksman Institute for Microbiology, Rutgers University, Piscataway, New Jersey 08854, USA. 3DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 4Stanford Human Genome Center, Stanford University, Palo Alto, California 94304, USA. 5MIPS/IBIS, Helmholtz Zentrum München, Inglostaedter Landstrasse 1, 85764 Neuherberg, Germany. 6Center for Integrative Genomics, University of California, Berkeley, California 94720, USA. 7 College of Sciences, Hebei Polytechnic University, Tangshan, Hebei 063000, China. 8Institute of Plant Biology, University of Zurich, Zollikerstrasse 107, 8008 Zurich, Switzerland. 9 Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina 29631, USA. 10Institut fur Entwicklungs und Molekularbiologie der Pflanzen, Heinrich-HeineUniversitat, Universitatsstrasse 1, D-40225 Dusseldorf, Germany. 11Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA. 12Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA. 13Department of Biological Sciences, 14Department of Botany and Plant Pathology, Purdue University, West Lafayette, Indiana 47907, USA. 15International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru 502 324, India. 16Department of Horticulture and Institute for Plant Genomics and Biotechnology, Texas A&M University, College Station, Texas 77843, USA. 17Institute for Genomic Diversity, Cornell University, Ithaca, New York 14853, USA. 18 Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 19Mississippi Genome Exploration Laboratory, Mississippi State University, Starkville, Mississippi 39762, USA. 20National Institute for Biotechnology & Genetic Engineering (NIBGE), Faisalabad, Pakistan. 21USDA NAA Robert Holley Center for Agriculture and Health, Ithaca, New York 14853, USA. 551 ©2009 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 457 | 29 January 2009 Comparison with a high-density genetic map8, a ‘finger-print contig’ (FPC)-based physical map9, and the rice sequence6 improved the sorghum WGS assembly (Supplementary Notes 1 and 2). Among the 201 largest scaffolds (spanning 678.9 Mb, 97.3% of the assembly), 28 showed discrepancies with two or more of these lines of evidence (Supplementary Note 2.6), often near repetitive elements. After breaking the assembly at the points of discrepancy, the resulting 229 scaffolds have an N50 (number of scaffolds that collectively cover at least 50% of the assembly) of 35 and L50 (length of the shortest scaffold among those that collectively cover 50% of the assembly) of 7.0 Mb. A total of 38 (2%) of 1,869 FPC contigs9 were deemed erroneous, containing .5 BAC ends that fell into different sequence scaffolds. A total of 127 scaffolds containing 625.7 Mb (89.7%) of DNA and 1,476 FPC contigs could be assigned to chromosomal locations and oriented. Fifteen out of twenty chromosome ends terminated in telomeric repeats. The other 102 scaffolds were generally smaller (53.2 Mb, 7.6%), with 85 (83%) containing far greater-than-average abundance of the Cen38 (ref. 10) centromeric repeat, and with only 374 predicted genes. These 102 scaffolds merged only 193 FPC contigs, presumably due to the greater abundance of repeats that are recalcitrant to clone-based physical mapping9 and may be omitted in BAC-by-BAC approaches11. Genome size evolution and its causes The ,75% larger quantity of DNA in the genome of sorghum compared with rice is mostly heterochromatin. Alignment to genetic8 and cytological maps12 suggests that sorghum and rice have similar quantities of euchromatin (252 and 309 Mb, respectively; Supplementary Table 7), accounting for 97–98% of recombination (1,025.2 cM and 1,496.5 cM, respectively) and 75.4–94.2% of genes in the respective cereals, with largely collinear gene order9. In contrast, sorghum heterochromatin occupies at least 460 Mb (62%), far more than in rice (63 Mb, 15%). The ,33 genome expansion in maize since its divergence from sorghum13 has been more dispersed—recombinogenic DNA has grown 4.53 to ,1,382 Mb, much more than can be explained by genome duplication14. The net size expansion of the sorghum genome relative to rice largely involved long terminal repeat (LTR) retrotransposons. The sorghum genome contains 55% retrotransposons, intermediate between the larger maize genome (79%) and smaller rice genome (26%). Sorghum more closely resembles rice in having a higher ratio of gypsy-like to copia-like elements (3.7 to 1 and 4.9 to 1) than maize (1.6 to 1: Supplementary Table 10). Although recent retroelement activity is widely distributed across the sorghum genome, turnover is rapid (as in other cereals15) with pericentromeric elements persisting longer. Young LTR retrotransposon insertions (,0.01 million years (Myr) ago) appear randomly distributed along chromosomes, suggesting that they are preferentially eliminated from gene-rich regions9 but accumulate in genepoor regions (Fig. 1; see also Supplementary Note 3.1). Insertion times suggest a major wave of retrotransposition ,1 Myr ago, after a smaller wave 1–2 Myr ago (Supplementary Fig. 2). CACTA-like elements, the predominant sorghum DNA transposons (4.7% of the genome), seem to relocate genes and gene fragments, as do rice ‘Pack-MULEs’16 and maize helitrons17. Many sorghum CACTA elements are non-autonomous deletion derivatives in which transposon genes have been replaced with non-transposon DNA including exons from one or more cellular genes as exemplified for family G118 (Fig. 2). Among 13,775 CACTA elements identified (Supplementary Note 3.4), 200 encode no transposon proteins but contain at least one cellular gene fragment. In total, DNA transposons constitute 7.5% of the sorghum genome, intermediate between maize (2.7%) and rice (13.7%; Supplementary Table 10). Miniature inverted-repeat transposable elements, 1.7% of the genome, are associated with genes (Fig. 1; see also Supplementary Note 3) as in other cereals6. Helitrons, ,0.8% of the genome, nearly all lack helicase in sorghum as in maize17, but carry fewer gene fragments in sorghum than maize (Supplementary Note 3.5). Organellar DNA insertion has contributed only 0.085% to the sorghum nuclear genome, far less than the 0.53% of rice (Supplementary Note 2.7). The gene complement of sorghum Among 34,496 sorghum gene models, we found ,27,640 bona fide protein-coding genes by combining homology-based and ab initio gene prediction methods with expressed sequences from sorghum, Chr 3 Cen38 Retrotransposons DNA transposons Genes (introns) Genes (exons) Young LTR-RTs Full-length LTR-RTs LTR-RT/gypsy LTR-RT/copia DNA-TE/CACTA CpG islands DNA-TE/MITE Genes (exons) Paralogues Paralogues Genes (exons) DNA-TEs/MITE CpG islands DNA-TE/CACTA LTR-RT/copia LTR-RT/gypsy Full-length LTR-RTs Young LTR-RTs Chr 9 0 20 40 60 (Mb) 552 ©2009 Macmillan Publishers Limited. All rights reserved Figure 1 | Genomic landscape of sorghum chromosomes 3 and 9. Area charts quantify retrotransposons (55%), genes (6% exons, 8% introns), DNA transposons (7%) and centromeric repeats (2%). Lines between chromosomes 3 and 9 connect collinear duplicated genes. Heat-map tracks detail the distribution of selected elements. Figures for all sorghum chromosomes are in Supplementary Note 3. Cen38, sorghum-specific centromeric repeat10; RTs, retrotransposons (class I); LTRRTs, long terminal repeat retrotransposons; DNA-TEs, DNA transposons (class II). ARTICLES NATURE | Vol 457 | 29 January 2009 Autonomous CACTA-G118 ‘Mother’ element (9,043 bp) Transposase ORF2 Non-autonomous derivatives G118-101, 6,698 bp Chloroplast carbonic anhydrase Conserved TIR region Metal transporter Nramp6 Foreign gene fragments G118-104, 3,609 bp >30% identical >40% identical HP Figure 2 | CACTA element deletion derivatives that carry gene fragments. CACTA family G118 has only one complete and presumably autonomous ‘mother’ element. Among 18 deletion derivatives, only the terminal 500–2,500 bp are conserved, with 8 carrying gene fragments internally. One relatively homogeneous subgroup (106, 111 and 112) presumably arose recently, whereas other derivatives are unique. The locations of the hits to known rice proteins are indicated as coloured boxes. The descriptions of the foreign gene fragments are indicated underneath the boxes. HP, hypothetical protein. >50% identical G118-110, 5,088 bp >60% identical with rice protein Plant-specific domain Conserved HP Importin β1 subunit * Found in three copies G118-106, 6,537 bp* Replication factor A G118-114, 9,829 bp Chloroplast carbonic anhydrase HP 40S ribosomal S15 G118-116, 4,244 bp Importin β1 subunit 2 kb maize and sugar cane (Supplementary Note 4). Evidence for alternate splicing is found in 1,491 loci. Another 5,197 gene models are usually shorter than the bona fide genes (often ,150 amino acids); have few exons (often one) and no expressed sequence tag (EST) support (compared with 85% for bona fide genes); are more diverged from rice genes; and are often found in large families with ‘hypothetical’, ‘uncharacterized’ and/or retroelement-associated annotations, despite repeat masking (Supplementary Note 4). A high concentration in pericentromeric regions where bona fide genes are scarce (Fig. 1) suggests that many of these low confidence gene models are retroelement-derived. We also identified 727 processed pseudogenes and 932 models containing domains known only from transposons. The exon size distributions of orthologous sorghum and rice genes agree closely, and intron position and phase show .98% concordance (Supplementary Note 5). Intron size has been conserved between sorghum and rice, although it has increased in maize owing to transpositions18. Most paralogues in sorghum are proximally duplicated, including 5,303 genes in 1,947 families of $2 genes (Supplementary Note 4.3). The longest tandem gene array is 15 cytochrome P450 genes. Other sorghum-specific tandem gene expansions include haloacid dehalogenase-like hydrolases (PF00702), FNIP repeats (PF05725), and male sterility proteins (PF03015). We confirmed the genomic locations of 67 known sorghum microRNAs (miRNAs) and identified 82 additional miRNAs (Supplementary Note 4.4). Five clusters located within 500 bp of each other represent putative polycistronic miRNAs, similar to those in Arabidopsis and Oryza. Natural antisense miRNA precursors (nat-miRNAs) of family miR444 (ref. 19) have been identified in three copies. Comparative gene inventories of angiosperms The number and sizes of sorghum gene families are similar to those of Arabidopsis, rice and poplar (Fig. 3 and Supplementary Note 4.6). A total of 9,503 (58%) sorghum gene families were shared among all four species and 15,225 (93%) with at least one other species. Nearly 94% (25,875) of high-confidence sorghum genes have orthologues in rice, Arabidopsis and/or poplar, and together these gene complements define 11,502 ancestral angiosperm gene families represented in at least one contemporary grass and rosid genome. However, 3,983 (24%) gene families have members only in the grasses sorghum and rice; 1,153 (7%) appear to be unique to sorghum. Pfam domains that are over-represented, under-represented or even absent in sorghum relative to rice, poplar and Arabidopsis, may reflect biological peculiarities specific to the Sorghum lineage (Supplementary Table 20). Domains over-represented in sorghum are usually present in the other organisms, a notable exception being the a-kafirin domain that accounts for most seed storage protein and corresponds to maize zeins20 but which is absent from rice. Nucleotide-binding-site–leucine-rich-repeat (NBS-LRR) containing proteins associated with the plant immune system are only about half as frequent in sorghum as in rice. A search with 12 NBS domains from published rice, maize, wheat and Arabidopsis gene sequences revealed 211 NBS-LRR coding genes in sorghum, 410 in rice and 149 in Arabidopsis21. Sorghum NBS-LRR genes mostly encode the CC type of N-terminal domains. Only two sorghum genes (Sb02g005860 and Sb02g036630) contain the TIR domain, and neither contains an NBS domain. NBS-LRR genes are most abundant on sorghum chromosome 5 (62), and its rice homologue (chromosome 11, 106). Enrichment of NBS-LRR genes in these corresponding genomic regions suggests conservation of R gene location, in contrast to a proposal that R gene movement may be advantageous22. Evolution of distinctive pathways and processes The evolution of C4 photosynthesis in the Sorghum lineage involved redirection of C3 progenitor genes as well as recruitment and functional divergence of both ancient and recent gene duplicates. The sole sorghum C4 pyruvate orthophosphate dikinase (ppdk) and the phosphoenolpyruvate carboxylase kinase (ppck) gene and its two isoforms (produced by the whole genome duplication) have only single orthologues in rice. Additional duplicates formed in maize after the sorghum–maize split (Zmppck2 and Zmppck3). The C4 NADP-dependent malic enzyme (me) gene has an adjacent isoform but each corresponds to a different maize homologue, suggesting tandem duplication before the sorghum–maize split. The C4 malate dehydrogenase (mdh) gene and its isoform are also adjacent, but share 97% amino acid similarity and correspond to the single known maize Mdh gene, suggesting tandem duplication in sorghum after its split with maize. The rice Me and Mdh genes are single 553 ©2009 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 457 | 29 January 2009 Arabidopsis 13,144 22,813 Sorghum 16,378 clusters 28,375 genes 879 Poplar 15,288 34,783 1,153 Rice 15,148 20,109 49 3,983 1,686 634 229 2,403 542 196 9,503 631 25 139 96 Figure 3 | Orthologous gene families between sorghum, Arabidopsis, rice and poplar. The numbers of gene families (clusters) and the total numbers of clustered genes are indicated for each species and species intersection. copy, suggesting duplication and recruitment to the C4 pathway after the Panicoideae–Oryzoideae divergence (Supplementary Note 9). The sorghum sequence reinforces inferences previously based only on rice, about how different grass and dicotyledon gene inventories relate to their respective types of cell walls23,24. In grasses, cellulose microfibrils coated with mixed-linkage (1R3),(1R4)-b-D-glucans are interlaced with glucuronoarabinoxylans and an extensive complex of phenylpropanoids25. The sorghum sequence largely corroborates differences between dicotyledons and rice in the distribution of cell wall biogenesis genes (Supplementary Note 10). For example, the CesA/Csl superfamily and callose synthases have either diverged to form new subgroups or functionally non-essential subgroups were selectively lost, such as CslB and CslG lost from the grasses, and CslF and CslH lost from species with dicotyledon-like cell walls26. The previously rice-unique CslF and CslH genes are present in sorghum. Arabidopsis contains a single group F GT31 gene, whereas sorghum and rice contain six and ten, respectively. The characteristic adaptation of sorghum to drought may be partly related to expansion of one miRNA and several gene families. Rice miRNA 169g, upregulated during drought stress27, has five sorghum homologues (sbi-MIR169c, sbi-MIR169d, sbi-MIR169.p2, sbi-MIR169.p6 and sbi-MIR169.p7). The computationally predicted target of the sbi-MIR169 subfamily comprises members of the plant nuclear factor Y (NF-Y) B transcription factor family, linked to improved performance under drought by Arabidopsis and maize28. Cytochrome P450 domain-containing genes, often involved in scavenging toxins such as those accumulated in response to stress, are abundant in sorghum with 326 versus 228 in rice. Expansins, enzymes that break hydrogen bonds and are responsible for a variety of growth responses that could be linked to the durability of sorghum, occur in 82 copies in sorghum versus 58 in rice and 40 each in Arabidopsis and poplar. Duplication and diversification of cereal genomes Whole-genome duplication in a common ancestor of cereals is reflected in sorghum and rice gene ‘quartets’ (Fig. 4). A total of 19,929 (57.8%) sorghum gene models were in blocks collinear with rice (Supplementary Note 6). After the shared whole-genome duplication, only one copy was retained for 13,667 (68.6%) collinear genes with 13,526 (99%) being orthologous in rice–sorghum, indicating that most gene losses predate taxon divergence. Both sorghum and rice retained both copies of 4,912 (14.2%) genes, whereas sorghum lost one copy of 1,070 (3.1%) and rice lost one copy of 634 (1.8%). These patterns are likely to be predictive of other grass genomes, as the major grass lineages diverged from a common ancestor at about the same time29 (see also Supplementary Note 7). Although most post-duplication gene loss happened in a common cereal ancestor, some lineage-specific patterns occur. A total of 2 and 10 protein functional (Pfam) domains showed enrichment for duplicates and singletons (respectively) in sorghum but not rice (Supplementary Note 6.1). Because the sorghum–rice divergence is thought to have happened 20 Myr or more after genome duplication29, this suggests that even long-term gene loss differentially affects gene functional groups. One genomic region has been subject to a high level of concerted evolution. It was previously suggested that rice chromosomes 11 and 12 share a ,5–7-Myr-old segmental duplication30–32. We found a duplicated segment in the corresponding regions of sorghum chromosomes 5 and 8 (Fig. 5). Sorghum–sorghum and rice–rice paralogues from this region show rates of synonymous DNA substitution (Ks) of 0.44 and 0.22, respectively, consistent with only 34 and 17 Myr of divergence. However, the Ks value of sorghum–rice orthologues is 0.63, similar to the respective genome-wide averages (0.81, 0.87). We hypothesize that the apparent segmental duplication actually resulted from the pan-cereal whole-genome duplication and became differentiated from the remainder of the chromosome(s) owing to concerted evolution acting independently in sorghum, rice and perhaps other cereals. Gene conversion and illegitimate recombination are more frequent in the rice 11–12 region than elsewhere in the genome33. Physical and genetic maps suggest shared terminal segments of the corresponding chromosomes in wheat (4, 5)34, foxtail millet (VII, VIII) and pearl millet (linkage groups 1, 4)35. Synthesis and implications Comparison of the sorghum, rice and other genomes clarifies the grass gene set. Pairs of orthologous sorghum and rice genes combined with recent paralogous duplications define 19,542 conserved grass gene families, each representing one gene in the sorghum–rice common ancestor. Our sorghum gene count is similar to that in a manually curated rice annotation (RAP2)36, but this similarity masks some differences. About 2,054 syntenic orthologues shared by our sorghum annotation and the TIGR5 (ref. 37) rice annotation are absent from RAP2. Conversely, ,12,000 TIGR5 annotations may be transposable elements or pseudogenes, comprising large families of hypothetical genes in both sorghum and rice RAP2, often with short exons, few introns and limited EST support. Phylogenetically incongruent cases of apparent gene loss (for example, genes shared by Arabidopsis and sorghum but not rice: Fig. 3) may also suggest sequence gaps or misannotations. Grass genome architecture may reflect euchromatin-specific effects of recombination and selection, superimposed on non-adaptive processes of mutation and genetic drift that apply to all genomic regions38. Patterns of gene and repetitive DNA organization remain correlated in homologous chromosomes duplicated 70 Myr ago (Fig. 1), despite extensive turnover of specific repetitive elements. Synteny is highest and retroelement abundance lowest in distal chromosomal regions. More rapid retroelement removal from generich euchromatin that frequently recombines than from heterochromatin that rarely recombines supports the hypothesis that recombination may preserve gene structure, order and/or spacing by exposing new insertions to selection9. Less euchromatin–heterochromatin polarization in maize, where retrotransposon persistence in euchromatin seems more frequent, may reflect variation in grass genome architecture or perhaps a lingering consequence of more recent genome duplication39. Identification of conserved DNA sequences may help us to understand essential genes and binding sites that define grasses. Progress in sequencing Brachypodium distachyon40 sets the stage for panicoid– oryzoid–pooid phylogenetic triangulation of genomic changes, as well as association of some such changes with phenotypes ranging from molecular (gene expression patterns) to morphological. The divergence between sorghum, rice and Brachypodium is sufficient to randomize 554 ©2009 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 457 | 29 January 2009 Zea c4 155.3–156.2 Mb C1 Zea c5 199.2–2-200.2 Mb C2 Sorghum c4 61.0–60.8 Mb C3 Oryza c2 29.6–29.7 Mb C4 Oryza C5 Oryza c4 30.5–30.7 Mb C6 Sorghum c6 56.2–56.3 Mb C7 C8 C9 Zea c10 123.7–124.3 Mb Zea c2 12.4–11.6 Mb C10 C11 Sorghum/Oryza scale C12 Zea scale C1 C2 C3 C4 C5 Sorghum C6 C7 C8 C9 C10 10 kb 80 kb Zea BAC Sorghum gene Cereal duplication Oryza gene Sorghum–Oryza orthologue Gene loss Sorghum–Zea orthologue Figure 4 | Alignment of sorghum, rice and maize. Dot plots show intergenomic (gold) and intragenomic (black) alignments. One sorghum–rice quartet showing both orthologous and paralogous (duplicated) regions is magnified. Infrequent gene loss (red; see legend) after sorghum–rice divergence causes ‘special cases’ in which there are paralogues but no orthologues. Each sorghum region corresponds to two duplicated maize regions39, with maize gene loss suggested where sorghum loci only match one of the two. Because maize BACs are mostly unfinished, sorghum loci are aligned to the centres. Note the different scale necessary for maize physical distance. Larger dot plots are in Supplementary Note 6. nonfunctional sequence, facilitating conserved noncoding sequence (CNS) discovery41,42 (Supplementary Fig. 9). More distant comparisons to the dicotyledon Arabidopsis show exon conservation but no CNS (Supplementary Fig. 10). Chloridoid and arundinoid genome sequences are needed to sample the remaining grass lineages, and an outgroup such as Ananas (pineapple) or Musa (banana) would further aid in identifying genes and sequences that define grasses. The fact that the sorghum genome has not re-duplicated in ,70 Myr29 makes it a valuable outgroup for deducing fates of gene pairs and CNS in grasses that have reduplicated. Single sorghum regions correspond to two regions resulting from maize-specific genome doubling39—gene fractionation is evident (Fig. 4), and subfunctionalization is probable (Supplementary Fig. 10). Sorghum may prove especially valuable for unravelling genome evolution in the more closely related Saccharum–Miscanthus clade: two genome duplications since its divergence from sorghum 8–9 Myr ago43 complicate sugar cane genetics44 yet Saccharum BACs show substantially conserved gene order with sorghum (Supplementary Note 11). Conservation of grass gene structure and order facilitates development of DNA markers to support crop improvement. We identified ,71,000 simple-sequence repeats (SSRs) in sorghum (Supplementary List 1); among a sampling of 212, only 9 (4.2%) map to paralogues of their source locus. Conserved-intron scanning primers (Supplementary List 2) for 6,760 genes provide DNA markers useful across many monocotyledons, particularly valuable for ‘orphan cereals’45. As the first sequenced plant genome of African origin, sorghum adds new dimensions to ethnobotanical research. Of particular interest will be the identification of alleles selected during the earliest stages of sorghum cultivation, which are valuable towards testing the hypothesis that convergent mutations in corresponding genes contributed to independent domestications of divergent cereals46. Invigorated sorghum improvement would benefit regions such as the African ‘Sahel’ where drought tolerance makes sorghum a staple for human populations that are increasing by 2.8% per year. Sorghum yield improvement has lagged behind that of other grains, in Africa only gaining a total of 37% (western) to 38% (eastern) from 1961–63 to 2005–07 (Supplementary Note 12). L Oryza chr 11 L Sorghum chr 5 S S S S Oryza chr 12 Sorghum chr 8 L L Ks 0 0.2 0.4 1.0 Figure 5 | Independent illegitimate recombination in corresponding regions of sorghum and rice. Four homologous rice and sorghum chromosomes (11 and 12 in rice; 5 and 8 in sorghum) are shown, with gene densities plotted. ‘L’ and ‘S’ show long and short arms, respectively. Lines show Ks between homologous gene pairs, and colours are used to show different dates of conversion events. METHODS SUMMARY Genome sequencing. Approximately 8.5-fold redundant paired-end shotgun sequencing was performed using standard Sanger methodologies from small (,2– 3 kb) and medium (5–8 kb) insert plasmid libraries, one fosmid library (,35 kb inserts), and two BAC libraries (insert size 90 and 108 kb). (Supplementary Note 1.) Integration of shotgun assembly with genetic and physical maps. The largest 201 scaffolds, all exceeding 39 kb, excluding ‘N’s, and collectively representing 678,902,941 bp (97.3%) of nucleotides, were checked for possible chimaeras 555 ©2009 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 457 | 29 January 2009 suggested by the sorghum genetic map, sorghum physical map, abrupt changes in gene or repeat density, rice gene order, and coverage by BAC or fosmid clones (Supplementary Note 2). Repeat analysis. De novo searches for LTR retrotransposons used LTR_STRUC. De novo detection of CACTA-DNA transposons and MITEs used custom programs (Supplementary Note 3). Known repeats were identified by RepeatMasker (Open-3-1-8) (http://www.repeatmasker.org) with mips-REdat_6.2_Poaceae, a compilation of grass repeats including sorghum-specific LTR retrotransposons (http://mips.gsf.de/proj/plant/webapp/recat/). The insertion age of full-length LTR-retrotransposons was determined from the evolutionary distance between 59 and 39 soloLTR derived from a ClustalW alignment of the two soloLTRs. Protein-coding gene annotation. Putative protein-coding loci were identified based on BLAST alignments of rice and Arabidopsis peptides and sorghum and maize ESTs. GenomeScan47 was applied using maize-specific parameters. Predicted coding structures were merged with EST data from maize and sorghum using PASA48. Intergenomic and intragenomic alignments. Dot plots used ColinearScan49 and multi-alignments used MCScan50, applied to RAP236 (mapped representative models, 29,389 loci) and the sbi1.4 annotation set (34,496 loci). Pairwise BLASTP (E , 131025, top five hits), both within each genome and between the two genomes, was used to retrieve potential anchors. Zea BAC sequences and FPC contig coordinates were downloaded (http://www.maizesequence.org, release 7 January 2008). Zea BACs were searched for potential orthologues of Sorghum coding sequences using translated BLAT with a minimum score of 100. Received 20 August; accepted 9 December 2008. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. Hatch, M. D. & Slack, C. R. Photosynthesis by sugar-cane leaves—a new carboxylation reaction and pathway of sugar formation. Biochem. J. 101, 103 (1966). Paterson, A. H. et al. The weediness of wild plants—molecular analysis of genes influencing dispersal and persistence of johnsongrass, Sorghum halepense (l) pers. Proc. Natl Acad. Sci. USA 92, 6127–6131 (1995). Hamblin, M. T. et al. Equilibrium processes cannot explain high levels of short- and medium-range linkage disequilibrium in the domesticated grass Sorghum bicolour. Genetics 171, 1247–1256 (2005). Morrell, P. L. et al. Crop-to-weed introgression has impacted allelic composition of johnsongrass populations with and without recent exposure to cultivated sorghum. Mol. Ecol. 14, 2143–2154 (2005). Gardner, R. C. et al. The complete nucleotide sequence of an infectious clone of cauliflower mosaic virus by M13mp7 shotgun sequencing. Nucleic Acids Res. 9, 2871–2888 (1981). Matsumoto, T. et al. The map-based sequence of the rice genome. Nature 436, 793–800 (2005). Vieira, J. & Messing, J. The pUC plasmids, an M13mp7-derived system for insertion mutagenesis and sequencing with synthetic universal primers. Gene 19, 259–268 (1982). Bowers, J. E. et al. A high-density genetic recombination map of sequence-tagged sites for Sorghum, as a framework for comparative structural and evolutionary genomics of tropical grains and grasses. Genetics 165, 367–386 (2003). Bowers, J. E. et al. Comparative physical mapping links conservation of microsynteny to chromosome structure and recombination in grasses. Proc. Natl Acad. Sci. USA 102, 13206–13211 (2005). Miller, J. T. et al. Cloning and characterization of a centromere-specific repetitive DNA element from Sorghum bicolour. Theor. Appl. Genet. 96, 832–839 (1998). Venter, J. C. et al. Shotgun sequencing of the human genome. Science 280, 1540–1542 (1998). Kim, J. S. et al. Chromosome identification and nomenclature of Sorghum bicolour. Genetics 169, 1169–1173 (2005). Swigonova, Z. et al. Close split of sorghum and maize genome progenitors. Genome Res. 14, 1916–1923 (2004). Swigonova, Z. et al. On the tetraploid origin of the maize genome. Comp. Funct. Genomics 5, 281–284 (2004). Swigonova, Z., Bennetzen, J. L. & Messing, J. Structure and evolution of the r/b chromosomal regions in rice, maize and sorghum. Genetics 169, 891–906 (2005). Jiang, N. et al. Pack-mule transposable elements mediate gene evolution in plants. Nature 431, 569–573 (2004). Brunner, S. et al. Evolution of DNA sequence nonhomologies among maize inbreds. Plant Cell 17, 343–360 (2005). Haberer, G. et al. Structure and architecture of the maize genome. Plant Physiol. 139, 1612–1624 (2005). Lu, C. et al. Genome-wide analysis for discovery of rice microRNAs reveals natural antisensemicroRNAs(nat-miRNAs).Proc.NatlAcad.Sci.USA105,4951–4956(2008). Xu, J.-H. & Messing, J. Organization of the prolamin gene family provides insight into the evolution of the maize genome and gene duplications in grass species. Proc. Natl Acad. Sci. USA 105, 14330–14335 (2008). Meyers, B. C. et al. Genome-wide analysis of NBS-LRR-encoding genes in Arabidopsis. Plant Cell 15, 809–834 (2003). Leister, D. Tandem and segmental gene duplication and recombination in the evolution of plant disease resistance genes. Trends Genet. 20, 116–122 (2004). 23. Carpita, N. C. & Gibeaut, D. M. Structural models of primary cell walls in flowering plants—consistency of molecular structure with the physical properties of the walls during growth Plant J. 3, 1–30 (1993). 24. McCann, M. C. & Roberts, K. in The Cytoskeletal Basis of Plant Growth and Form (ed. Lloyd, C. W.) 109–129 (Academic Press, 1991). 25. Carpita, N. C. Structure and biogenesis of the cell walls of grasses. Annu. Rev. Plant Physiol. Plant Mol. Biol. 47, 445–476 (1996). 26. Hazen, S. P. et al. Quantitative trait loci and comparative genomics of cereal cell wall composition. Plant Physiol. 132, 263–271 (2003). 27. Zhao, B. T. et al. Identification of drought-induced microRNAs in rice. Biochem. Biophys. Res. Commun. 354, 585–590 (2007). 28. Nelson, D. E. et al. Plant nuclear factor Y (NF-Y) B subunits confer drought tolerance and lead to improved corn yields on water-limited acres Proc. Natl Acad. Sci. USA 104, 16450–16455 (2007). 29. Paterson, A. H., Bowers, J. E. & Chapman, B. A. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc. Natl Acad. Sci. USA 101, 9903–9908 (2004). 30. Wang, X. et al. Duplication and DNA segmental loss in rice genome and their implications for diploidization. New Phytol. 165, 937–946 (2005). 31. Yu, J. et al. The genomes of Oryza sativa: A history of duplications. PLoS Biol. 3, 266–281 (2005). 32. The Rice Chromosomes 11 and 12 Sequencing Consortia.. The sequence of rice chromosomes 11 and 12, rich in disease resistance genes and recent gene duplications. BMC Biol. 3, 20 (2005). 33. Wang, X. et al. Extensive concerted evolution of rice paralogs and the road to regaining independence. Genetics 177, 1753–1763 (2007). 34. Singh, N. K. et al. Single-copy genes define a conserved order between rice and wheat for understanding differences caused by duplication, deletion, and transposition of genes. Funct. Integr. Genomics 7, 17–35 (2007). 35. Devos, K. M., Pittaway, T. S., Reynolds, A. & Gale, M. D. Comparative mapping reveals a complex relationship between the pearl millet genome and those of foxtail millet and rice TAG. Theor. Appl. Genet. 100, 190–198 (2000). 36. Tanaka, T. et al. The rice annotation project database (RAP-DB): 2008 update. Nucleic Acids Res. 36, D1028–D1033 (2008). 37. Ouyang, S. et al. The TIGR rice genome annotation resource: Improvements and new features. Nucleic Acids Res. 35, D883–D887 (2007). 38. Lynch, M. & Conery, J. S. The origins of genome complexity. Science 302, 1401–1404 (2003). 39. Wei, F. et al. Physical and genetic structure of the maize genome reflects its complex evolutionary history. PLoS Genet. 3, e123 (2007). 40. Huo, N. et al. The nuclear genome of Brachypodium distachyon: Analysis of BAC end sequences. Funct. Integr. Genomics 8, 135–147 (2007). 41. Margulies, E. H. et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl Acad. Sci. USA 102, 4795–4800 (2005). 42. Eddy, S. R. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. 3, 95–102 (2005). 43. Jannoo, N. et al. Orthologous comparison in a gene-rich region among grasses reveals stability in the sugarcane polyploid genome. Plant J. 50, 574–585 (2007). 44. Ming, R. et al. Sugarcane improvement through breeding and biotechnology. Plant Breed. Rev. 27, 15–118 (2005). 45. Lohithaswa, H. C. et al. Leveraging the rice genome sequence for comparative genomics in monocots. Theor. Appl. Genet. 115, 237–243 (2007). 46. Paterson, A. H. et al. Convergent domestication of cereal crops by independent mutations at corresponding genetic loci. Science 269, 1714–1718 (1995). 47. Yeh, R.-F., Lim, L. P. & Burge, C. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001). 48. Haas, B. J. et al. Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 3, research0029.0021–0029.0012 (2002). 49. Wang, X. Y. et al. Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice. BMC Bioinform. 7, 447 (2006). 50. Tang,H.etal.Syntenyandcolinearityinplantgenomes.Science320,486–488(2008). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements We thank the US Department of Energy Joint Genome Institute Community Sequencing Program, J. Bristow, S. Lucas and the JGI production sequencing team for sequencing sorghum; and L. Lin for contributions to Fig. 1. We appreciate funding from the US National Science Foundation (NSF DBI-9872649, 0115903; MCB-0450260), International Consortium for Sugarcane Biotechnology, National Sorghum Producers, and a John Simon Guggenheim Foundation fellowship to A.H.P.; US Department of Energy (DE-FG05-95ER20194) to J.M.; German Federal Ministry of Education GABI initiative to MIPS (0313117 and 0314000C); NSF DBI-0321467 to A.N.; and US Department of Agriculture-Agricultural Research Service to C.A.M., L.Z. and D.W. Author Information Reprints and permissions information is available at www.nature.com/reprints. Correspondence and requests for materials should be addressed to A.H.P. ([email protected]). 556 ©2009 Macmillan Publishers Limited. All rights reserved REPORTS Patrick S. Schnable,1,2,3,4* Doreen Ware,5,6* Robert S. Fulton,7† Joshua C. Stein,6† Fusheng Wei,8† Shiran Pasternak,6 Chengzhi Liang,6 Jianwei Zhang,8 Lucinda Fulton,7 Tina A. Graves,7 Patrick Minx,7 Amy Denise Reily,7 Laura Courtney,7 Scott S. Kruchowski,7 Chad Tomlinson,7 Cindy Strong,7 Kim Delehaunty,7 Catrina Fronick,7 Bill Courtney,7 Susan M. Rock,7 Eddie Belter,7 Feiyu Du,7 Kyung Kim,7 Rachel M. Abbott,7 Marc Cotton,7 Andy Levy,7 Pamela Marchetto,7 Kerri Ochoa,7 Stephanie M. Jackson,7 Barbara Gillam,7 Weizu Chen,7 Le Yan,7 Jamey Higginbotham,7 Marco Cardenas,7 Jason Waligorski,7 Elizabeth Applebaum,7 Lindsey Phelps,7 Jason Falcone,7 Krishna Kanchi,7 Thynn Thane,7 Adam Scimone,7 Nay Thane,7 Jessica Henke,7 Tom Wang,7 Jessica Ruppert,7 Neha Shah,7 Kelsi Rotter,7 Jennifer Hodges,7 Elizabeth Ingenthron,7 Matt Cordes,7 Sara Kohlberg,7 Jennifer Sgro,7 Brandon Delgado,7 Kelly Mead,7 Asif Chinwalla,7 Shawn Leonard,7 Kevin Crouse,7 Kristi Collura,8 Dave Kudrna,8 Jennifer Currie,8 Ruifeng He,8 Angelina Angelova,8 Shanmugam Rajasekar,8 Teri Mueller,8 Rene Lomeli,8 Gabriel Scara,8 Ara Ko,8 Krista Delaney,8 Marina Wissotski,8 Georgina Lopez,8 David Campos,8 Michele Braidotti,8 Elizabeth Ashley,8 Wolfgang Golser,8 HyeRan Kim,8 Seunghee Lee,8 Jinke Lin,8 Zeljko Dujmic,8 Woojin Kim,8 Jayson Talag,8 Andrea Zuccolo,8 Chuanzhu Fan,8 Aswathy Sebastian,8 Melissa Kramer,6 Lori Spiegel,6 Lidia Nascimento,6 Theresa Zutavern,6 Beth Miller,6 Claude Ambroise,6 Stephanie Muller,6 Will Spooner,6 Apurva Narechania,6 Liya Ren,6 Sharon Wei,6 Sunita Kumari,6 Ben Faga,6 Michael J. Levy,6 Linda McMahan,6 Peter Van Buren,6 Matthew W. Vaughn,6 Kai Ying,3 Cheng-Ting Yeh,1,2 Scott J. Emrich,9,10 Yi Jia,3 Ananth Kalyanaraman,9,11 An-Ping Hsia,1,2 W. Brad Barbazuk,12 Regina S. Baucom,13 Thomas P. Brutnell,14 Nicholas C. Carpita,15 Cristian Chaparro,16 Jer-Ming Chia,6 Jean-Marc Deragon,16 James C. Estill,13,17 Yan Fu,2,4 Jeffrey A. Jeddeloh,18 Yujun Han,13,17 Hyeran Lee,19 Pinghua Li,14 Damon R. Lisch,20 Sanzhen Liu,3 Zhijie Liu,6 Dawn Holligan Nagel,13,17 Maureen C. McCann,21 Phillip SanMiguel,22 Alan M. Myers,23 Dan Nettleton,24 John Nguyen,25 Bryan W. Penning,15,21 Lalit Ponnala,26 Kevin L. Schneider,27 David C. Schwartz,28 Anupma Sharma,27 Carol Soderlund,29 Nathan M. Springer,30 Qi Sun,26 Hao Wang,13,17 Michael Waterman,25 Richard Westerman,22 Thomas K. Wolfgruber,27 Lixing Yang,13 Yeisoo Yu,29 Lifang Zhang,6 Shiguo Zhou,28 Qihui Zhu,13,17 Jeffrey L. Bennetzen,13 R. Kelly Dawe,13,17 Jiming Jiang,19 Ning Jiang,31 Gernot G. Presting,27 Susan R. Wessler,13,17 Srinivas Aluru,1,9,32 Robert A. Martienssen,6 Sandra W. Clifton,7 W. Richard McCombie,6 Rod A. Wing,8 Richard K. Wilson7,33‡ We report an improved draft nucleotide sequence of the 2.3-gigabase genome of maize, an important crop plant and model for biological research. Over 32,000 genes were predicted, of which 99.8% were placed on reference chromosomes. Nearly 85% of the genome is composed of hundreds of families of transposable elements, dispersed nonuniformly across the genome. These were responsible for the capture and amplification of numerous gene fragments and affect the composition, sizes, and positions of centromeres. We also report on the correlation of methylationpoor regions with Mu transposon insertions and recombination, and copy number variants with insertions and/or deletions, as well as how uneven gene losses between duplicated regions were involved in returning an ancient allotetraploid to a genetically diploid state. These analyses inform and set the stage for further investigations to improve our understanding of the domestication and agricultural improvements of maize. aize (Zea mays ssp. mays L.) was domesticated over the past ~10,000 years from the grass teosinte in Central America (1) and has been subject to cultivation and selection ever since. Maize is an important model organism for fundamental research into the inheritance and functions of genes, the physical linkage of genes to chromosomes, the mechanistic relation between cytological crossovers and recombination, the origin of the nucleolus, the properties of telomeres, epigenetic silencing, imprinting, and transposition (2). Maize also is an important crop, yielding in the USA alone 12 billion (B = 109) bushels of grain from ~86 million acres with a value of $47 B [2008 data from (3)]. Over the last century, breeders have increased grain yields M 1112 eightfold (4), in part by harnessing heterosis (hybrid vigor), a universal, but poorly understood, phenomenon that can increase yields of hybrids by 15 to 60% relative to inbred parents (5). The maize genome has undergone several rounds of genome duplication, including that of a paleopolyploid ancestor ~70 million years ago (mya) (6) and an additional whole-genome duplication event about 5 to 12 mya (7, 8), which distinguishes maize from its close relative, Sorghum bicolor (9). The 10 chromosomes of the maize genome are structurally diverse and have undergone dynamic changes in chromatin composition. The size of the maize genome has expanded dramatically (to 2.3 gigabases) over the last ~3 million years via a proliferation of 20 NOVEMBER 2009 VOL 326 SCIENCE long terminal repeat retrotransposons (LTR retrotransposons) (10). We sequenced the maize genome using a minimum tiling path of bacterial artificial chromosomes (BACs) (n = 16,848) and fosmid (n = 63) clones derived from an integrated physical and genetic map (11, 12), augmented by comparisons with an optical map (13). Clones were shotgun sequenced (four- to sixfold coverage), followed by automated and manual sequence improvement (14) of the unique regions only, which resulted in the B73 reference genome version 1 (B73 RefGen_v1). We identified the full complement of maize transposable elements (TEs) accessible from B73 RefGen_v1, which includes active class II DNA TEs and an abundance of class I RNA TEs (15). 1 Center for Plant Genomics, Iowa State University, Ames, IA 50011, USA. 2Department of Agronomy, Iowa State University, Ames, IA 50011, USA. 3Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA 50011, USA. 4 Center for Carbon Capturing Crops, Iowa State University, Ames, IA 50011, USA. 5U.S. Department of Agriculture (USDA), North Atlantic Area, Robert Holley Center for Agriculture and Health, Ithaca, NY 14853, USA. 6Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA. 7The Genome Center at Washington University, St. Louis, MO 63108, USA. 8Arizona Genomics Institute, School of Plant Sciences and Department of Ecology and Evolutionary Biology, BIO5 Institute for Collaborative Research, University of Arizona, Tucson, AZ 85721, USA. 9Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA. 10Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA. 11School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA 99164, USA. 12Department of Botany, University of Florida, Gainesville, FL 32611, USA. 13Department of Genetics, University of Georgia, Athens, GA 30602, USA. 14Boyce Thompson Institute, Cornell University, Ithaca, NY 14853, USA. 15Department of Botany and Plant Pathology, Purdue University, West Lafayette, IN 47907, USA. 16Université de Perpignan Via Domitia, CNRS, Perpignan, France. 17Department of Plant Biology, University of Georgia, Athens, GA 30602, USA. 18 NimbleGen, Madison, WI 53711, USA. 19Department of Horticulture, University of Wisconsin–Madison, Madison, WI 53706, USA. 20Department of Plant Biology, University of California, Berkeley, CA, 94720, USA. 21Department of Biological Sciences, Purdue University, West Lafayette, IN 47907, USA. 22Department of Horticulture and Landscape Architecture, Purdue University, West Lafayette, IN 47907, USA. 23Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, 50011, USA. 24 Department of Statistics, Iowa State University, Ames, IA 50011, USA. 25Departments of Mathematics, Biology, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA. 26Cornell University Computational Biology Service Unit, Cornell University, Ithaca, NY 14850, USA. 27Molecular Biosciences and Bioengineering, University of Hawaii, Honolulu, HI 96822, USA. 28Laboratory for Molecular and Computational Genomics, Department of Chemistry, Laboratory of Genetics, University of Wisconsin–Madison, Madison, WI 53706, USA. 29BIO5 Institute for Collaborative Research, University of Arizona, Tucson, AZ 85721, USA. 30 Department of Plant Biology, University of Minnesota, St. Paul, MN 55108, USA. 31Department of Horticulture, Michigan State University, East Lansing, MI 48824, USA. 32Indian Institute of Technology, Bombay, India. 33Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA. *These authors contributed equally to this work. †These authors contributed equally to data production and analysis. ‡To whom correspondence should be addressed. E-mail: [email protected] www.sciencemag.org Downloaded from www.sciencemag.org on February 28, 2010 The B73 Maize Genome: Complexity, Diversity, and Dynamics Almost 85% of the B73 RefGen_v1 consists of TEs (table S2). Indeed, the existence of TEs (16), as well as the first members of the CACTA (Spm/En), hAT (Ac), PIF/Harbinger and Mutator superfamilies, and MITE family (Tourist), were all initially discovered in maize (17). Further, both the existence and unparalleled abundance of LTR retrotransposons in plants were originally discovered in maize (18). The B73 RefGen_v1 contains 855 families of DNA TEs that make up 8.6% of the genome; most of these (82%) were identified in this study (table S2) (14). The most complex of these superfamilies is Mutator, with dramatic variation in element sequence and size, including 262 PackMULEs (Mutator-like elements that contain gene fragments) carrying fragments of 226 nuclear genes. About 40,000 nonredundant Mu insertion sites were amplified from Mu-active lines, sequenced, and mapped to B73 RefGen_v1. The nonuniformly distributed Mu insertion sites colocalize with gene-rich regions of the genome that have the highest rates of meiotic recombination per megabase (Fig. 1) (19). Like Mu, most maize DNA TEs (but not the CACTA elements) were enriched in the gene-rich, recombinationally active chromosome ends (Fig. 1 and fig. S1). Helitrons, a class of DNA elements believed to transpose by a rolling-circle mechanism (20), are present in plants, animals, and fungi, but are particularly active, variable, and abundant in maize (21). Maize contains eight families of Helitrons Fig. 1. The maize B73 reference genome (B73 RefGen_v1): Concentric circles show aspects of the genome. Chromosome structure (A). Reference chromosomes with physical fingerprint contigs (11) as alternating gray and white bands. Presumed centromeric positions are indicated by red bands (31); enlarged for emphasis. Genetic map (B). Genetic linkage across the genome, on the basis of 6363 genetically and physically mapped markers (14, 19). Mu insertions (C). Genome mappings of nonredundant Mu insertion sites (14, 19). Methyl-filtration reads (D). Enrichment and depletion of methyl filtration. For each nonoverlapping 1-Mb window, read counts were divided by the total number of mapped reads. Repeats (E). Sequence coverage of TEs with RepeatMasker with all identified intact elements in maize. Genes (F). Density of genes in the filtered gene set across the genome, from a gene count per 1-Mb sliding window at 200-kb intervals. Sorghum synteny (G) and rice synteny (H). Syntenic blocks between maize and related cereals on the basis of 27,550 gene orthologs. Underlined blocks indicate alignment in the reverse strand. Homoeology map (I). Oriented homoeologous sites of duplicated gene blocks within maize. www.sciencemag.org SCIENCE VOL 326 with a combined copy number of ~20,000, which are particularly active in gene fragment acquisition (table S2). In maize, we observed that Helitrons are located predominantly within generich regions, whereas, in all previously studied plant and animal genomes, they are enriched in gene-poor regions (22, 23). LTR retrotransposons compose >75% of the B73 RefGen_v1 and are diverse. Most of the 406 families have fewer than 10 copies. LTR retrotransposons exhibited family-specific, nonuniform distributions along chromosomes, e.g., Copia-like elements are overrepresented in gene-rich euchromatic regions, whereas Gypsy-like elements are overrepresented in gene-poor heterochromatic regions (fig. S1) (24, 25). We observed more than 180 acquisitions of nuclear gene fragments inside LTR retrotransposons (table S2). Protein-encoding and microRNA (miRNA) (26) genes were predicted from assembled or improved BAC contigs by a combination of evidence-based (27) and ab initio approaches, projected to B73 RefGen_v1, and subsequently filtered to a set of 32,540 protein-encoding and 150 miRNA genes (14) (fig. S2). Exon sizes of maize genes were similar to that of their orthologous genes in rice and sorghum, but maize genes contained more large introns because of insertion of repetitive elements (11, 28) (figs. S3 and S4 and tables S5 and S6). A comparative analysis with rice, sorghum, and Arabidopsis revealed similar numbers of gene families (14) (Fig. 2), of which a core set of 8494 families is shared among all four species, and of the 11,892 maize families, all but 465 are conserved with at least one other species. Species- and lineagespecific families point out potential inconsistencies between annotation projects, but also reflect genuine biological differences in gene inventories. Because of the stringent criteria used for including genes in the filtered gene set (14), we expected to miss some genes. About 95% of a collection of 63,851 full-length maize cDNAs (fl-cDNAs) (29, 30) mapped to B73 RefGen_v1. On the basis of the ratio of flcDNA to supported genes in the filtered set, we estimated that this set accounts for at least 85% of all genes in the B73 RefGen_v1 (14). Downloaded from www.sciencemag.org on February 28, 2010 REPORTS Fig. 2. Venn diagram showing unique and shared gene families between and among the three sequenced grasses (maize, rice, and sorghum) and the dicot, Arabidopsis. 20 NOVEMBER 2009 1113 The maximum rate of false-positive gene annotations was estimated by aligning ~112 million RNA-seq (transcriptome sequencing) reads from various tissues to the filtered gene set (14) (figs. S10 and S11). These experiments provided evidence for the transcription of ~91% of the genes in the filtered gene set (29,541 out of 32,540). Manual annotation of 200 randomly chosen genes from the filtered gene set indicated that only two are likely to be TE-derived. Additional manual annotation of smaller sets of selected genetically well-characterized genes (tables S8 to S10) indicated that the vast majority of genes and proteins predicted in the filtered gene set are mostly correct. Maize centromeres were found to contain variable amounts of the tandem CentC satellite repeat and centromeric retrotransposon elements of maize (CRMs). On the basis of comparisons to B73 whole-genome shotgun data, we initially identified about half of the genome’s CentC content (table S13). We captured additional CentC sequence by draft sequencing 101 centromeric repeat–containing BACs and anchoring them to the genetic and physical maps, thereby localizing all of the centromeres (31). We delineated the functional centromeres on the basis of their centromere-specific histone H3 (CENH3) (32) by using chromatin immunoprecipitation (ChIP) with an antibody against CENH3, followed by pyrosequencing. The centromere regions delineated in this way, although mostly incomplete, correlated with a high density of CentC and CRM1/CRM2/CRM3 repeats, but a number of these repeats also occurred outside of the functional centromeres (fig. S12). The CRM2 subfamily appears to be the centromeric repeat most closely associated with CENH3 in maize, as it is more enriched in the CENH3 chromatin fraction than CentC, CRM1, or CRM3 (table S13). We traversed two centromeres (2 and 5) in their entirety and determined that they differ in size and CENH3 density (31). Because CRM elements have generated recombinants with distinct periods of activity (33, 34), we were able to demonstrate that the regional centromeres of maize are dynamic loci and that the CENH3 domain shifts over time (31). To protect genome integrity, TEs are usually transcriptionally silenced (35) in part via the RNA-directed DNA methylation (RdDM) pathway, which requires an RNA-dependent RNA polymerase 2 (RDR2). When the maize homolog of RDR2 (36) is mutated, it alters the accumulation of transcripts from many characterized transposons, but unexpectedly, some TEs are down-regulated by loss of RDR2 function (37). In most plant genomes, genes are less densely methylated than heterochromatic TEs and other repeats. Consequently, ~2× coverage of the maize genome by methylation-filtered (MF) reads includes portions of ~95% of maize genes (38). Mapping MF reads (39) of maize and sorghum onto their respective genomes revealed speciesspecific distributions of heterochromatic DNA 1114 methylation along the reference chromosomes (fig. S13, A and B). It is noteworthy that, in the sorghum genome, hypomethylated genes are largely excluded from the pericentromeric regions, whereas they are dispersed more widely in maize. Visual comparisons between sorghum and maize (14) revealed high levels of coalignment, including centromeres where centromeric repeats are undermethylated relative to the surrounding heterochromatin (39, 40) (fig. S13C). Thus, the B73 RefGen_v1 yields evidence that heavily methylated regions are more condensed during interphase. Anchoring the B73 RefGen_v1 to a newly developed genetic map (19) revealed that rates of meiotic recombination per megabase are highest at the ends of the reference chromosomes and very low in the middle half of each chromosome surrounding the centromeres (Fig. 1) (19, 41). Although recombination occurs preferentially in genes (2) and gene density shows a similar distribution (Fig. 1), gene density does not fully explain the nonrandom distribution of recombination events, because a pronounced nonuniform distribution is still observed even when gene density is taken into consideration (19). Instead, epigenetic marks, including hypomethylation and histone modifications, are implicated in guiding both Mu insertion and meiotic recombination (19). Epigenetic processes have also been invoked to explain the observation that genomic imprinting contributes to the expression of thousands of genes in maize hybrids (42). Maize exhibits extremely high levels of both phenotypic and genetic diversity. This genomic diversity was explored with both resequencing (41) and array-based comparative genomic hybridization between the B73 and Mo17 inbred lines (43). This revealed extensive structural variation, including hundreds of copy number variants (CNVs) and thousands of present-absent variants (PAVs). Many of the PAVs, including an ~2-Mb region on chromosome 6, contain intact, expressed single-copy genes that are present in one inbred genome but absent from the other. These haplotype-specific sequences may contribute to heterosis and the substantial degree of phenotypic variation among maize inbreds (43). After a whole-genome duplication, the return to a genetically diploid state was associated with numerous chromosomal breakages and fusions, as shown by alignment to the genomes of sorghum and the more distantly related rice (Fig. 1 and fig. S14) (12). In contrast, sorghum has experienced relatively few interchromosomal rearrangements since its lineage split with rice (8); therefore, its chromosomal configuration closely resembles the ancestral state of maize’s two subgenomes (12). Cosynteny of maize genes to common reference genes in rice or sorghum defined maize’s duplicate regions (fig. S15). Although syntenic blocks cover 1832 Mb (~89% of the genome), individual gene losses were common and resulted in retention of only ~8110 genes as duplicate homoeologs (~25% of total 20 NOVEMBER 2009 VOL 326 SCIENCE genes; ~30% having orthologs in rice and/or sorghum). On the basis of an analysis of GO (gene ontology) terms (14, 44) (table S15), retention of genes as duplicates is not random, e.g., retained duplicates are significantly enriched for transcription factors (>1.5-fold; P value = 7.6 × 10–22) (table S15), as is also the case in rice (44) and Arabidopsis (45). An example of biased retention is the CesA family, in which all 10 ancestral sites were retained as duplicates (fig. S16) (46). Using the sorghum genome to project extant maize regions to ancestral chromosomes (14) revealed a strong bias for gene loss (fractionation) between sister regions (table S16 and fig. S17). Fractionation bias has been observed in other plant lineages and species (47–50). Sites containing proximately duplicated paralogs tend to exist as single copies, or not at all, at corresponding homoeologous positions (table S18). Of the 1454 proximately duplicated paralogs identified (making up 3614 genes), only 126 (~9%) could be found at homoeologous positions (14). Of the remainder, 279 (19%) had a single paralog at the corresponding homoeologous site, and 1049 (72%) had no homoeologs. Nearly identical paralogs (NIPs) are genes with pairwise alignments of ≥500 bp, ≥98% identity, and ≥95% coverage with other genes (51). Of maize-filtered genes, 2.5% (828 out of 32,540) were NIPs from 386 families, most of which have only two members (n = 349); the largest has nine members. Almost half (46%) of the NIP pairs had both members physically linked within 200 kb of each other, whereas in most of the remaining cases, the two members were distant from each other or on different chromosomes (fig. S18). Just as cytogenetic and genetic maps (52) revolutionized research and crop improvement over the last century, the B73 maize reference sequence promises to advance basic research and to facilitate efforts to meet the world’s growing needs for food, feed, energy, and industrial feed stocks in an era of global climate change. Findings derived from this genome sequence briefly summarized here are described in more detail in a series of companion papers (11, 13, 19, 22, 24–26, 30, 31, 37, 41–43, 46). Annotation data and browser are available at www. maizegenome.org. References and Notes 1. J. F. Doebley, B. S. Gaut, B. D. Smith, Cell 127, 1309 (2006). 2. J. L. Bennetzen, S. Hake, Handbook of Maize: Genetics and Genomics (Springer, New York, 2009). 3. C. P. National Corn Growers Association, Table showing corn harvested, yield, production, mya price, and value, 1991–2008; http://ncga.com/corn-production-trends. 4. A. F. Troyer, Crop Sci. 46, 528 (2006). 5. D. N. Duvick, Science 286, 418 (1999). 6. A. H. Paterson, J. E. Bowers, B. A. Chapman, Proc. Natl. Acad. Sci. U.S.A. 101, 9903 (2004). 7. G. Blanc, K. H. Wolfe, Plant Cell 16, 1667 (2004). 8. Z. Swigonova et al., Genome Res. 14, 1916 (2004). 9. A. H. Paterson et al., Nature 457, 551 (2009). 10. P. SanMiguel, B. S. Gaut, A. Tikhonov, Y. Nakajima, J. L. Bennetzen, Nat. Genet. 20, 43 (1998). www.sciencemag.org Downloaded from www.sciencemag.org on February 28, 2010 REPORTS 11. F. Wei et al., PLoS Genet., 19 November 2009 (10.1371/ journal.pgen.1000715). 12. F. Wei et al., PLoS Genet. 3, e123 (2007). 13. S. Zhou et al., PLoS Genet., 19 November 2009 (10.1371/journal.pgen.1000711). 14. Materials and methods are available as supporting material on Science Online. 15. P. SanMiguel et al., Science 274, 765 (1996). 16. B. McClintock, Cold Spring Harbor Symp. Quant. Biol. 16, 13 (1951). 17. C. Feschotte, N. Jiang, S. R. Wessler, Nat. Rev. Genet. 3, 329 (2002). 18. A. Kumar, J. L. Bennetzen, Annu. Rev. Genet. 33, 479 (1999). 19. S. Liu et al., PLoS Genet., 19 November 2009 (10.1371/ journal.pgen.1000733). 20. V. V. Kapitonov, J. Jurka, Proc. Natl. Acad. Sci. U.S.A. 98, 8714 (2001). 21. S. Lal, N. Georgelis, L. Hannah, in Handbook of Maize: Genetics and Genomics, J. L. Bennetzen, S. Hake, Eds. (Springer, New York, 2008), pp. 329–339. 22. L. Yang, J. L. Bennetzen, Proc. Natl. Acad. Sci. USA, published online 19 November 2009 (10.1073/ pnas.0908008106). 23. L. Yang, J. L. Bennetzen, Proc. Natl. Acad. Sci. U.S.A. 106, 12832 (2009). 24. R. S. Baucom et al., PLoS Genet., 19 November 2009 (10.1371/journal.pgen.1000732). 25. F. Wei et al., PLoS Genet., 19 November 2009 (10.1371/ journal.pgen.1000728). 26. L. Zhang, PLoS Genet., 19 November 2009 (10.1371/ journal.pgen.1000716). 27. H. Liang, W. H. Li, Mol. Biol. Evol. 26, 1195 (2009). 28. G. Haberer et al., Plant Physiol. 139, 1612 (2005). 29. N. N. Alexandrov et al., Plant Mol. Biol. 69, 179 (2009). 30. C. Soderlund et al., PLoS Genet., 19 November 2009 (10.1371/journal.pgen.1000740). 31. T. K. Wolfgruber et al., PLoS Genet., 19 November 2009 (10.1371/journal.pgen.1000743). 32. C. X. Zhong et al., Plant Cell 14, 2825 (2002). 33. A. Sharma, G. G. Presting, Mol. Genet. Genomics 279, 133 (2008). 34. A. Sharma, K. L. Schneider, G. G. Presting, Proc. Natl. Acad. Sci. U.S.A. 105, 15470 (2008). 35. D. Lisch, Annu. Rev. Plant Biol. 60, 43 (2009). 36. M. Alleman et al., Nature 442, 295 (2006). 37. Y. Jia et al., PLoS Genet., 19 November 2009 (10.1371/ journal.pgen.1000737). 38. Y. Fu et al., Proc. Natl. Acad. Sci. U.S.A. 102, 12282 (2005). 39. L. E. Palmer et al., Science 302, 2115 (2003). 40. W. Zhang, H. R. Lee, D. H. Koo, J. Jiang, Plant Cell 20, 25 (2008). 41. M. A. Gore et al., Science, 326, 1115 (2009). 42. R. A. Swanson-Wagner et al., Science 326, 1118 (2009). 43. N. M. Springer et al., PLoS Genet., 19 November 2009 (10.1371/journal.pgen.1000734). 44. C. G. Tian et al., Yi Chuan Xue Bao 32, 519 (2005). 45. C. Seoighe, C. Gehring, Trends Genet. 20, 461 (2004). 46. B. W. Penning et al., Plant Physiol., published online 19 November 2009 (10.1104/pp.109.136804). 47. H. Shaked, K. Kashkush, H. Ozkan, M. Feldman, A. A. Levy, Plant Cell 13, 1749 (2001). 48. K. Song, P. Lu, K. Tang, T. C. Osborn, Proc. Natl. Acad. Sci. U.S.A. 92, 7719 (1995). 49. J. A. Tate, P. Joshi, K. A. Soltis, P. S. Soltis, D. E. Soltis, BMC Plant Biol. 9, 80 (2009). 50. B. C. Thomas, B. Pedersen, M. Freeling, Genome Res. 16, 934 (2006). 51. S. J. Emrich et al., Genetics 175, 429 (2007). 52. B. McClintock, Science 69, 629 (1929). 53. The Maize Genome Sequencing Project supported by NSF award DBI-0527192 (R.K.W., S.W.C., R.S.F., R.A.W., P.S.S., S.A., L.S., D.W., W.R.M., R.A.M.). The Maize Transposable Element Consortium and the A First-Generation Haplotype Map of Maize Michael A. Gore,1,2,3*† Jer-Ming Chia,4* Robert J. Elshire,3 Qi Sun,5 Elhan S. Ersoz,3 Bonnie L. Hurwitz,4‡ Jason A. Peiffer,2 Michael D. McMullen,1,6 George S. Grills,7 Jeffrey Ross-Ibarra,8 Doreen H. Ware,1,4§ Edward S. Buckler1,2,3§ Maize is an important crop species of high genetic diversity. We identified and genotyped several million sequence polymorphisms among 27 diverse maize inbred lines and discovered that the genome was characterized by highly divergent haplotypes and showed 10- to 30-fold variation in recombination rates. Most chromosomes have pericentromeric regions with highly suppressed recombination that appear to have influenced the effectiveness of selection during maize inbred development and may be a major component of heterosis. We found hundreds of selective sweeps and highly differentiated regions that probably contain loci that are key to geographic adaptation. This survey of genetic diversity provides a foundation for uniting breeding efforts across the world and for dissecting complex traits through genome-wide association studies. aize (Zea mays L.) is both a model genetic system and an important crop species. Already a critical source of food, fuel, feed, and fiber, the addition of genomic information allows maize to be further improved through plant breeding that exploits its tremendous genetic diversity (1–3). Genome-wide association studies (GWAS) of diverse maize germplasm offer the potential to rapidly resolve complex traits to gene-level resolution, but these studies require a high density of genome-wide markers. To do this, we targeted the 20% of the maize genome M that is low-copy (4, 5) on a diverse panel of 27 inbred lines (representative of maize breeding efforts and worldwide diversity)―founders of the maize nested association mapping (NAM) population (6)―and used sequencing-by-synthesis (SBS) technology with three complementary restriction enzyme–anchored genomic libraries (figs. S1 and S2A) (7). More than 1 billion SBS reads (>32 gigabases of sequence) were generated, covering ~38% of the total maize genome, albeit at mostly lowcoverage levels. We focused on the ~93 million www.sciencemag.org SCIENCE VOL 326 Maize Centromere Consortium supported by NSF awards DBI-0607123 (S.R.W., J.L.B., R.K.D., N.J., P.S.M.) and DBI-0421671 (R.K.D., J.J., G.G.P.). Also supported by NSF grants DBI-0321467 (D.W.), DBI-0321711 (P.S.S.), DBI-0333074 (D.W.), DBI-0501818 (D.C.S.), DBI-0501857 (Y.Y.), DBI-0701736 (T.P.B., Q.S.), DBI-0703273 (R.A.M.), and DBI-0703908 (D.W.), and by USDA National Research Initiative Grants 2005-35301-15715 and 2007-35301-18372 from the USDA Cooperative State Research, Education, and Extension Service (P.S.S.) and from the USDA-ARS (408934 and 413089) to D.W., and from the Office of Science (Biological and Environmental Research), U.S. Department of Energy, grant DE-FG02-08ER64702 to N.C.C. and M.C.M. Sequences of the reference chromosomes have been deposited in GenBank as accession numbers CM000777 to CM000786. RNA-sequence reads have been deposited in the Gene Expression Omnibus (GEO) database (www.ncbi.nlm.nih.gov/geo) as accession numbers GSE16136, GSE16868, and GSE16916. Centromeric sequences have been deposited in the National Center for Biotechnology Information, NIH, Trace Archive as accessions 1757396377 to 1757412600 and 2185189231 to 2185200942. Supporting Online Material www.sciencemag.org/cgi/content/full/326/5956/1112/DC1 Materials and Methods SOM Text Figs. S1 to S18 Tables S1 to S18 References 1 July 2009; accepted 13 October 2009 10.1126/science.1178534 base pairs (Mbp) of low-copy sequence present in 13 or more lines in this study. Roughly 39% of the sequenced low-copy fraction was derived from introns and exons (5), covering 32% of the total genic fraction in the genome. We identified 3.3 million single-nucleotide polymorphisms (SNPs) and indels (table S1) and found that, overall, 1 in every 44 bp was polymorphic (p = 0.0066 per base pair). In a subset used for the population genetics analyses, the error rate was 1/2570 or 17-fold lower than p (roughly half the errors are paralogy issues). The absolute level of diversity we examined, though high, may be slightly reduced because of difficulties aligning highly divergent sequences and our low power to call Downloaded from www.sciencemag.org on February 28, 2010 REPORTS 1 United States Department of Agriculture–Agriculture Research Service (USDA-ARS). 2Department of Plant Breeding and Genetics, Cornell University, Ithaca, NY 14853, USA. 3 Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853, USA. 4Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA. 5Computational Biology Service Unit, Cornell University, Ithaca, NY 14853, USA. 6Division of Plant Sciences, University of Missouri, Columbia, MO 65211, USA. 7 Institute for Biotechnology and Life Science Technologies, Cornell University, Ithaca, NY 14853, USA. 8Department of Plant Sciences, University of California, Davis, CA 95616– 5294, USA. *These authors contributed equally to this work. †Present address: United States Arid-Land Agricultural Research Center, Maricopa, AZ 85138, USA. ‡Present address: Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA. §To whom correspondence should be addressed. E-mail: [email protected] (D.H.W.); [email protected] (E.S.B.) 20 NOVEMBER 2009 1115 Articles © 2009 Nature America, Inc. All rights reserved. The genome of the cucumber, Cucumis sativus L. Sanwen Huang1,19, Ruiqiang Li2,3,19, Zhonghua Zhang1,19, Li Li2,19, Xingfang Gu1,19, Wei Fan2,19, William J Lucas4,19, Xiaowu Wang1, Bingyan Xie1, Peixiang Ni2, Yuanyuan Ren2, Hongmei Zhu2, Jun Li2, Kui Lin5, Weiwei Jin6, Zhangjun Fei7, Guangcun Li8, Jack Staub9, Andrzej Kilian10, Edwin A G van der Vossen11, Yang Wu5, Jie Guo5, Jun He1, Zhiqi Jia1, Yi Ren1, Geng Tian2, Yao Lu2, Jue Ruan2,12, Wubin Qian2, Mingwei Wang2, Quanfei Huang2, Bo Li2, Zhaoling Xuan2, Jianjun Cao2, Asan2, Zhigang Wu2, Juanbin Zhang2, Qingle Cai2, Yinqi Bai2, Bowen Zhao13, Yonghua Han6, Ying Li1, Xuefeng Li1, Shenhao Wang1, Qiuxiang Shi1, Shiqiang Liu1, Won Kyong Cho14, Jae-Yean Kim14, Yong Xu15, Katarzyna Heller-Uszynska10, Han Miao1, Zhouchao Cheng1, Shengping Zhang1, Jian Wu1, Yuhong Yang1, Houxiang Kang1, Man Li1, Huiqing Liang2, Xiaoli Ren2, Zhongbin Shi2, Ming Wen2, Min Jian2, Hailong Yang2, Guojie Zhang2,12, Zhentao Yang2, Rui Chen2, Shifang Liu2, Jianwen Li2, Lijia Ma2,12, Hui Liu2, Yan Zhou2, Jing Zhao2, Xiaodong Fang2, Guoqing Li2, Lin Fang2, Yingrui Li2,12, Dongyuan Liu2, Hongkun Zheng2,3, Yong Zhang2, Nan Qin2, Zhuo Li2, Guohua Yang2, Shuang Yang2, Lars Bolund2,16, Karsten Kristiansen17, Hancheng Zheng2,18, Shaochuan Li2,18, Xiuqing Zhang2, Huanming Yang2, Jian Wang2, Rifei Sun1, Baoxi Zhang1, Shuzhi Jiang1, Jun Wang2,17, Yongchen Du1 & Songgang Li2 Cucumber is an economically important crop as well as a model system for sex determination studies and plant vascular biology. Here we report the draft genome sequence of Cucumis sativus var. sativus L., assembled using a novel combination of traditional Sanger and next-generation Illumina GA sequencing technologies to obtain 72.2-fold genome coverage. The absence of recent whole-genome duplication, along with the presence of few tandem duplications, explains the small number of genes in the cucumber. Our study establishes that five of the cucumber’s seven chromosomes arose from fusions of ten ancestral chromosomes after divergence from Cucumis melo. The sequenced cucumber genome affords insight into traits such as its sex expression, disease resistance, biosynthesis of cucurbitacin and ‘fresh green’ odor. We also identify 686 gene clusters related to phloem function. The cucumber genome provides a valuable resource for developing elite cultivars and for studying the evolution and function of the plant vascular system. The botanical family Cucurbitaceae, commonly known as cucurbits and gourds, includes several economically important cultivated plants, such as cucumber (C. sativus L.), melon (C. melo L.), watermelon (Citrullus lanatus (Thunb.) Matsum. & Nakai) and squash and pumpkin (Cucurbita spp.). Agricultural production of cucurbits uses 9 million hectares of land and yields 184 million tons of vegetables, fruits and seeds annually (http://faostat.fao.org). The cucurbit family also displays a rich diversity of sex expression, and the cucumber has served as a primary model system for sex determination studies 1. The cucurbits are also model plants for the study of vascular biology, as both xylem and phloem sap can be readily collected for studies of long-distance signaling events2,3. Despite the agricultural and biological importance of cucurbits, knowledge of their genetics and genome is currently very limited. We have therefore sequenced and assembled the genome of the domestic cucumber, C. sativus var. sativus L. All previous plant genome sequences have been derived using traditional Sanger technology 4–9. The recent development of 1Key Laboratory of Horticultural Crops Genetic Improvement of Ministry of Agriculture, Sino-Dutch Joint Lab of Horticultural Genomics Technology, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, China. 2BGI-Shenzhen, Shenzhen, China. 3Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense, Denmark. 4Department of Plant Biology, College of Biological Sciences, University of California, Davis, California, USA. 5College of Life Sciences, Beijing Normal University, Beijing, China. 6National Maize Improvement Center of China, Key Laboratory of Crop Genetic Improvement and Genome of Ministry of Agriculture, Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing, China. 7Boyce Thompson Institute and USDA Robert W. Holley Center for Agriculture and Health, Cornell University, Ithaca, New York, USA. 8High-Tech Research Center, Shandong Academy of Agricultural Sciences, Jinan, China. 9US Department of Agriculture, Agricultural Research Service, Vegetable Crops Research Unit, Department of Horticulture, University of Wisconsin, Madison, Wisconsin, USA. 10Diversity Arrays Technology, Canberra, Australia. 11Wageningen UR Plant Breeding, Wageningen, The Netherlands. 12The Graduate University of Chinese Academy of Sciences, Beijing, China. 13High School Affiliated to Renmin University of China, Beijing, China. 14Division of Applied Life Science (BK21 and WCU program), PMBBRC and EB-NCRC, Gyeongsang National University, Jinju, Republic of Korea. 15National Engineering Research Center for Vegetables, Beijing, China. 16Institute of Human Genetics, University of Aarhus, Aarhus, Denmark. 17Department of Biology, University of Copenhagen, Copenhagen, Denmark. 18South China University of Technology, Guangzhou, China. 19These authors contributed equally to this work. Correspondence should be addressed to Y.D. ([email protected]), S.H. ([email protected]), Jun Wang ([email protected]) or Songgang Li ([email protected]). Received 6 May; accepted 28 September; published online 1 November 2009; doi:10.1038/ng.475 Nature Genetics volume 41 | number 12 | december 2009 1275 © 2009 Nature America, Inc. All rights reserved. Articles next-generation sequencing technologies has Table 1 Cucumber genome assembly statistics Contig N50a Contig total Scaffold N50 Scaffold total % sequence anchored on significantly improved sequencing throughput at (kb) (Mb) (kb) (Mb) chromosome a markedly reduced cost10. However, an intrinsic Assembly 2.6 204 19 238 — characteristic of next-generation technologies is Sanger 12.5 190 172 200 — their short read length (~50 bp), which prevents Illumina GA 19.8 226.5 1,140 243.5 72.8% their direct application for de novo assembly of Sanger + Illumina GA large genomes. When using these new technolo- aN50 refers to the size above which half of the total length of the sequence set can be found. gies, assembly is typically carried out by mapping these short reads onto a known reference genome11,12. For the cucumber recombination suppression of two 10-Mb regions at either end of genome, we carried out a novel combination de novo sequencing strat- chromosome 4, a 20-Mb region on chromosome 5 and an 8-Mb region egy, taking advantage of the long read and clone length of Sanger on chromosome 7. Using high-resolution FISH, we confirmed previtechnology and, for the first time, the high sequencing depth and low ously identified segmental inversion16 within the suppression region unit cost of Illumina GA technology. on chromosome 5 between Gy14 and PI183967 (Fig. 1b), which provides an explanation for recombination suppression in these regions. RESULTS These regions of recombination suppression are additionally useful Sequencing and assembly for studying cucumber evolution during domestication. We selected the ‘Chinese long’ inbred line 9930, which is commonly After excluding 16 markers whose genetic positions were ambiguused in modern cucumber breeding13, for our genome sequencing ous, we examined the six remaining regions that had conflicts between project. We generated a total of 26.5 billion high-quality base pairs, the genetic map and our assembly. Upon inspection, we found that or 72.2-fold genome coverage, of which the Sanger reads provided clone mate-pair information supported our assembly in all of these 3.9-fold coverage and the Illumina GA reads provided 68.3-fold regions (Supplementary Fig. 2). We also identified no misassemcoverage (Supplementary Table 1). The GA reads ranged in length bly within the regions covered by the six finished fosmid or BAC from 42 to 53 bp. sequences (Supplementary Fig. 3). The conflicts may be a result of We compared the assemblies obtained by Sanger reads only, chromosomal rearrangement that occurred between the sequenced Illumina GA reads only and Sanger plus Illumina reads. The ‘hybrid’ genotype 9930 and the genotypes used to create the mapping popuapproach achieved markedly longer N50 (the size above which half of lation; alternatively, these markers may have been placed incorrectly the total length of the sequence set can be found) in both contigs and on the genetic map. Sequencing depth distribution showed that scaffolds, so we used this assembly for further analyses (Table 1 and we obtained more than 10× coverage on more than 97.5% of the Supplementary Table 2). The total length of the assembled genome assembly (Supplementary Fig. 4). was 243.5 Mb, about 30% smaller than the genome size estimated by flow cytometry of isolated nuclei stained with propidium iodide Repetitive sequences and transposons (367 Mb)14 and by K-mer depth distribution of sequenced reads The cucumber genome contains a large number of transposable ele(350 Mb; Supplementary Fig. 1). Several types of satellite sequences ments, but only a few have previously been identified. We therefore were present in the data set, comprising 23.2% of all Sanger reads and constructed repeat libraries using multiple de novo methods and then 76.2% of unassembled reads (Supplementary Table 3). FISH analysis derived a combined repeat library that contained 1,566 sequences indicated that these are primarily located in the centromeric and telo- (Supplementary Table 5), of which 469 (29.9%) were manually clasmeric regions15. The cucumber genome also contains a large number sified (Supplementary Table 6). We then used this library for repeat of rRNA sequences, and about 3.3% of the Sanger reads matched 45S annotation of the cucumber genome. We identified a total of 54.4 Mb, rRNA. These results indicated that the majority of the remaining 30% which represents ~24% of the genome, as repeats. Among them, of unassembled regions of the genome are likely to be heterochro- 51.5% could be classified based on known repeats. The long termimatic satellite or rRNA sequences. nal repeat (LTR) retrotransposons (gypsy and copia) made up the The high coverage of the cucumber genome by this assembly was majority of the transposable element classes and comprised 10.4% also confirmed using the available EST, fosmid and BAC sequences. of the genome (Supplementary Table 7). The repeats divergence rate The assembly contains 96.8% of the 63,312 cucumber unigenes (percentage of substitutions in the matching region compared with assembled from ~350,000 Roche 454–sequenced ESTs, 99.3% of the consensus repeats in constructed libraries) distribution showed a peak 6,952 NCBI-deposited ESTs of cucumber, 91.2% of the 50,441 NCBI- at 20%. A fraction of LTR retrotransposons, long interspersed nuclear deposited ESTs of melon and 98.7% of the six finished fosmid and elements and DNA transposons (composing 2.3%, 0.4% and 0.2% BAC sequences (Supplementary Table 4). of the genome, respectively) are of relatively recent origin, having a A genetic map was developed using 77 recombinant inbred lines sequence divergence rate of less than 5% (Supplementary Fig. 5). from the intersubspecific cross between Gy14 (a North American processing market–type cucumber cultivar) and PI183967 (an acces- Gene annotation sion of C. sativus var. hardwickii originating from India). The map We used three gene-prediction methods (cDNA-EST, homology based spans 581 cM and contains 1,885 markers, including 995 micro- and ab initio) to identify protein-coding genes and then built a consensatellite markers16 and 890 Diversity Arrays Technology markers sus gene set by merging all of the results (Supplementary Fig. 6). We (marker sequences can be accessed at http://cucumber.genomics.org. predicted 26,682 genes, with a mean coding sequence size of 1,046 bp cn). Using this map, we were able to anchor 72.8% of the assembled and an average of 4.39 exons per gene (Supplementary Table 8). sequences onto the seven chromosomes. Among the 1,885 mark- Under an 80% sequence overlap threshold, we found that 26.7% of ers, 1,763 (93.5%) were uniquely aligned and used for construct- the genes were supported by models from all three gene prediction ing the pseudochromosomes. The majority (98.7%) of the markers methods, 25% had both ab initio prediction and homology-based were collinear with the sequence assembly (Fig. 1a). Comparison of evidence, and 7.4% had ab initio prediction and cDNA-EST expresthe genetic and physical distances between markers revealed sion evidence; the remaining genes were primarily derived from pure 1276 volume 41 | number 12 | december 2009 Nature Genetics Articles © 2009 Nature America, Inc. All rights reserved. a 120 120 LG1 100 120 LG2 100 100 80 80 80 60 60 60 40 40 40 20 20 20 0 0 0 5 10 15 20 25 30 35 40 100 0 5 120 120 Genetic distance (cM) Figure 1 Integrated genetic and physical map of cucumber. (a) Genetic versus physical distance map of the seven cucumber chromosomes. The genetic map was constructed using a recombinant inbred line mapping population from the intersubspecific cross between Gy14 (domestic cucumber) and PI183967 (wild cucumber). (b) Segmental inversion between Gy14 and PI183967 on cucumber chromosome 5 detected by high-resolution FISH (12-2 and 12-7 denote individual fosmid clones). A low-resolution FISH analysis was also recently reported16. Scale bars represent 1 µm. LG4 100 LG5 0 10 15 20 25 30 35 40 0 120 100 80 80 80 60 60 60 40 40 40 20 20 20 LG3 5 10 15 20 25 30 35 40 LG6 ab initio prediction, but the majority of these 0 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 were supported by multiple gene finders 120 (Supplementary Table 9). About 81% of LG7 Physical distance (Mb) 100 the genes have homologs in the TrEMBL b 80 Gy14 Pl183967 protein database, and 66% can be classified 60 by InterPro. In sum, 82% of the genes have 40 either known homologs or can be function20 ally classified (Supplementary Table 10). In 0 addition to protein-coding genes, we iden0 5 10 15 20 25 30 35 40 Physical distance (Mb) tified 292 rRNA fragments and 699 tRNA, Centromeric regions estimated by FISH 238 small nucleolar RNA, 192 small nuclear RNA and 171 miRNA genes in the cucumber genome (Supplementary Table 11). 12-7 12-2 On the basis of pairwise protein sequence similarities, we carried out a gene family clustering analysis on all genes in sequenced plants, using rice as an outgroup. The cucumber genes consist of 15,669 families. Of these, 4,362 are cucum- two recent WGDs (Fig. 2b). In cucumber, the analysis showed ancient ber unique families, among which 3,784 are single-gene families duplication events (peak at ~0.60) but did not reveal recent WGD. (Supplementary Table 12). The EST confirmation rate of these unique This lack of recurrent WGD in the small cucumber genome provides single-copy genes was much lower than the average of all predicted an important complement to the grapevine and papaya genomes to genes (33.4% vs. 72.3%, respectively). This category may therefore study ancestral forms and arrangements of plant genes. contain a number of false-positive predictions. In papaya, there are 4,622 unique families, but the actual number of genes is estimated to Synteny with flowering plant genomes be 24,746, which is lower than the 28,629 predicted genes7. Thus, the Given the similar gene arrangements between cucumber and other actual number in cucumber should be lower than 26,682 and similar plant genomes, we defined syntenic blocks that contained 5,473, to that in papaya. The smaller average gene family size in cucumber 6,525, 9,842, 8,439 and 3,992 cucumber genes collinear to Arabidopsis, (1.71) and papaya (1.77) supports this conclusion (Fig. 2a). papaya, poplar, grapevine and rice, respectively (Supplementary The cucumber genome contains the smallest number of tandem Table 13 and Supplementary Figs. 8–12). The numbers of collinear gene duplications (479) among all the plants we compared, whereas genes were consistent with the phylogenetic distances of the other grapevine has the largest number (5,382; Fig. 2a). This may contribute plants to cucumber. Within the syntenic blocks, we observed the in part to the small number of genes in cucumber. highest density of collinear genes between cucumber and grapevine (90.5 genes per Mb), followed by papaya (76.1; the low contiguity Absence of recent whole-genome duplication of genome assembly may have, in part, decreased this value), poplar Whole-genome duplication (WGD) is common in angiosperm plants (68.8), rice (55.6) and Arabidopsis (43.5; Supplementary Table 13). and produces a tremendous source of raw material for gene genesis. This indicates that Arabidopsis has the most reshuffled or rearranged Previous research has revealed a paleohexaploidy (γ) event in the genome, whereas the genomes of grapevine and papaya are more common ancestor of Arabidopsis thaliana and grapevine after the conserved, probably because they have not undergone WGD since divergence of monocotyledons and dicotyledons6. Subsequently, two the ancestral paleohexaploidy. WGDs (α and β) occurred in Arabidopsis17 and one (p) in poplar8, whereas no recent WGD occurred in grapevine and papaya. Evidence Substantial fusion events involved in chromosomal evolution indicates that rice underwent an ancient WGD18. We carried out a Melon and cucumber belong to the same genus, although cucumcollinear gene-order analysis on the cucumber genome and observed ber has seven chromosomes and melon has 12. Watermelon, their no recent WGD and only a few segmental duplication events common distant relative, has 11 chromosomes. To investigate cucur(Supplementary Fig. 7). We also used the distance-transversion rate bit chromosomal evolution, we compared the melon19 and waterat fourfold degenerate sites (4DTv method) to analyze paralogous melon genetic maps to the cucumber genome (Fig. 3a). In total, gene pairs between syntenic blocks in Arabidopsis and cucumber, 348 (66.7%) of the 522 melon markers and 136 (58.6%) of the 232 respectively. Two peaks (~0.06 and ~0.25) in Arabidopsis support the watermelon markers were aligned on the cucumber chromosomes Nature Genetics volume 41 | number 12 | december 2009 1277 Articles © 2009 Nature America, Inc. All rights reserved. 12 a 2 6 4 10 1 at iva O .s V. v in ife ra rp a at iv us ca 9 C. s ho P. tri c ap C. p A. th a lia n a ay a Percentage of syntenic blocks (Supplementary Table 14). The comparison 60 a b 40 C. sativus No. of all revealed that there has been no substantial 40 A. thaliana predicted genes rearrangement of cucumber chromosome 7, 20 (×1,000) 30 0 which corresponds to melon chromosome 1 6 No. of tandem and watermelon group 7. 4 Recent two WGD duplications duplicated genes 20 2 in Arabidopsis Using watermelon as an outgroup, we (×1,000) Ancient duplications 0 found that cucumber chromosomes 1, 2, 3, 5 3 in cucumber 2 No. of genes 10 and 6 were collinear to melon chromosomes per family 1 2 and 12, 3 and 5, 4 and 6, 9 and 10, and 8 0 and 11, respectively, indicating that after spe0 0 0.25 0.5 0.75 1 ciation these cucumber chromosomes each 4DTv resulted from a fusion of two ancestral chromosomes. We also found that cucumber chro- Figure 2 Comparison of cucumber genome with other sequenced plant genomes. (a) Numbers of mosome 6 and melon chromosome 3 have a predicted genes, numbers of tandem duplicated genes and gene family sizes of the six sequenced syntenic segment, indicating that interchro- plant genomes. (b) The 4DTv distribution of duplicate gene pairs in cucumber and Arabidopsis, mosome rearrangement occurred in one of calculated based on alignment of codons with HKY substitution model. the two genomes after speciation. Cucumber chromosome 4 largely corresponds to melon chromosome 7, although BAC sequences could be aligned onto the cucumber genome, with an a segment of melon chromosome 8 is syntenic with cucumber chro- average of 88% sequence identity. Nonetheless, the highly conserved mosome 4 (crossing the centromere). These data indicate that the gene content and order between the two species make the cucumber rearrangement is most likely to have occurred before the divergence genome useful for genetic analysis of melon. of cucumber and melon. In addition to chromosome fusion and interUsing the annotated genes in the four melon BACs, we obtained chromosome rearrangements, the comparison revealed the occur- and manually curated eight orthologous families among rice, cucumrence of several intrachromosome rearrangements (Fig. 3a). ber, melon, Arabidopsis and papaya. Extrapolating from the age of divergence between Arabidopsis and papaya (54–90 million years ago), Cucumber-melon microsynteny we estimated that cucumber and melon diverged about 4–7 million To estimate the sequence divergence rate, we compared the four years ago, which is consistent with a previous estimate of 9 ± 3 million sequenced melon BACs to the cucumber genome (Fig. 3b and years ago20. Supplementary Fig. 13). There are 56 genes on the melon BACs, 52 of which are collinear with the cucumber genome. The mean sequence Pathogen resistance genes similarity over coding regions is 95%. Although the gene region simi- Only 61 nucleotide-binding site (NBS)-containing resistance (NBS-R) larity is very high, the repeat content between the two genomes is genes have been identified in cucumber, similar to papaya (55) 7 quite different. New transposable elements were frequently inserted but only a fraction of what is found in Arabidopsis (200), poplar in the intergenic regions of both genomes. Hence, only 54% of the (398) and rice (600)8. Distribution of NBS genes on chromosomes 5 3 11 8 7 Melon Cucumber 1 3 5 7 2 6 4 Watermelon 11 b 9 12 16 4 6 1 13 7 18 10 8 2 5 15 0 10 20 30 40 50 60 70 80 90 15,060 15,070 15,080 15,090 15,100 15,110 15,120 15,130 15,140 15,150 14 Kb EF188258 Chr. 1 Kb 15,160 Figure 3 Comparative genomic analysis of cucurbits. (a) Comparative analysis of the melon and watermelon genetic maps with the cucumber sequence map. Cucumber, melon and watermelon have 7, 12 and 11 pairs of chromosomes, respectively. The current version of the watermelon genetic map is organized into 18 genetic groups. (b) Syntenic blocks between the cucumber genome and a melon BAC sequence (GenBank accession code EF188258.1). Genes are indicated by black arrows with the orientation indicated on the sequence. Rectangles, transposable elements; red, retrotransposable elements; blue, DNA transposons; green, unclassified transposable elements. Orthologous sequence regions between the two genomes are shown. 1278 volume 41 | number 12 | december 2009 Nature Genetics Articles 2,190 2,250 kb Chr. 2 C. sativus A. thaliana C. papaya P. trichocarpa V. vinifera O. sativa 20 kb 0 Scaffold 337 Type I Type II © 2009 Nature America, Inc. All rights reserved. Chr. 4 9,525 9,625-Kb Figure 4 Lineage-specific expansion of the LOX gene family in the five sequenced dicot genomes and rice genome. The LOX family is divided into two groups, type I and type II. The two tandem duplicated gene clusters are ordered and shown on chromosomes 2 and 4, as well as one unmapped scaffold of the cucumber genome. is nonrandom, with only five genes located on chromosomes 1, 6 and 7 and 20 genes located on chromosome 2 (Supplementary Fig. 14). Three-quarters of the NBS genes are located within 11 clusters, indicating that they evolved through tandem duplications, similar to other known plant genomes. The lipoxygenase (LOX) pathway has an important role in developmentally and environmentally regulated processes in plants21 and generates short-chain aldehydes and alcohols that are involved in plant defense and pest resistance22. The LOX gene family has been notably expanded in the cucumber genome (23 LOX genes in cucumber, 6 in Arabidopsis, 15 in papaya, 21 in poplar, 18 in grapevine and 15 in rice). Fourteen of the LOX genes are specific to the cucumber lineage. The majority of cucumber LOX genes (19 of 23) are distributed in three clusters, the largest of which contains 11 members that are arranged in tandem (Fig. 4). The other sequenced plant genomes show no obvious LOX clustering, with the exception of grapevine, which has one cluster harboring six copies. Given that the cucumber has only 61 NBS-R genes, the expanded lipoxygenase pathway might be a complementary mechanism to cope with biotic stress. In support of this hypothesis, Arabidopsis has more NBS-R genes and fewer LOX genes than does papaya. The volatile (E,Z)-2,6-nonadienal (NDE) gives cucumber its ‘fresh green’ flavor23 and confers resistance to some bacteria and fungi24. Lipoxygenase and one type of hydroperoxide lyase, 9-HPL, synthesize NDE from linolenic acid precursors. Genes encoding enzymes with 9-HPL activity are rarely found in other plants25. However, cucumber contains two tandem HPL genes, one of which has been experimentally confirmed as encoding an enzyme with 9-HPL activity25. The expansion of the LOX gene family and the duplicated HPL genes may be related to the high level of NDE synthesis in cucumber. Eukaryotic translation initiation factors, particularly the eIF4E and eIF4G families, confer recessive resistance to plant RNA virus infections. An EIF4E gene in melon was found to mediate recessive resistance against melon necrotic spot virus26. In the cucumber genome, three EIF4E and three EIF4G genes have been identified, providing candidates for known recessive resistance genes against RNA viruses Nature Genetics volume 41 | number 12 | december 2009 such as zucchini yellow mosaic virus and watermelon mosaic virus27. In some wild melon genotypes, enhanced expression of two glyoxylate aminotransferase genes (At1 and At2) controls the resistance to downy mildew, a devastating foliar disease of cucurbits28. We identified two At homologs in cucumber that could be candidate genes for downy mildew resistance. Novel biosynthetic pathways Cucurbitacins are bitter cucurbit triterpenoid compounds that are toxic to most organisms but can attract specialized insects29,30. The presence of cucurbitacin in the cucumber is controlled by a mendelian gene, Bi30. Oxidosqualene cyclase catalyzes the formation of the triterpene carbon framework in plants31. An OSC gene, CPQ, in squash (Cucurbita pepo L.) is the first committed enzyme in the cucurbitacin biosynthesis pathway32. In cucumber, we identified four OSC genes; the CPQ ortholog Csa008595 resides in a genetic interval that defines the Bi gene (Supplementary Fig. 15). Notably, Csa008595 forms a cluster that contains an acyltransferase-encoding gene (Csa008594) and two cytochrome P450–encoding genes (Csa008596 and Csa008597). Three of these (Csa008594, Csa008595 and Csa008597) are coexpressed strongly in cucumber leaf tissue (Supplementary Fig. 16) in a pattern similar to that of the operonlike gene cluster involved in thalianol biosynthesis in Arabidopsis33. This gene cluster may therefore catalyze the stepwise formation of cucurbitacin in cucumber. Cucumber is a model system for studying sex expression in plants1. Ethylene stimulates femaleness and is considered the sex hormone of cucumber34. We identified 137 cucumber genes that are related to the biosynthetic and signaling pathways of ethylene35,36, but we found no gene family expansion in these pathways compared with other sequenced plant genomes (Supplementary Table 15). Thus, the origin of monoecy in cucumber might involve other evolutionary mechanisms. The melon gene Cm-ACS7 (ref. 37) and its cucumber ortholog Cs-ACS2 (ref. 38) encode 1-aminocyclopropane-1-carboxylate synthase (ACS), a key regulatory enzyme in the ethylene biosynthetic pathway. Both genes are crucial to the inhibition of male organs and development of the female flower. In situ mRNA hybridization experiments revealed that both Cm-ACS7 and Cs-ACS2 transcripts accumulate only in the pistil and ovule, whereas their Arabidopsis ortholog, AT4G26200 (Supplementary Fig. 17), is expressed only in the roots39. We also identified two ethylene-responsive elements (AWTTCAAA) and one flower meristem identity gene LEAFY-responsive element (CCAATGT) within the Cs-ACS2 and Cm-ACS7 promoter sequences, but these were absent from the promoter of AT4G26200. These findings indicate that the evolution of unisexual flowers in cucurbits may have involved the acquisition of new cis elements of the ACS genes. To better understand the mechanism of sex determination in cucumber, we sequenced 359,105 EST sequences from near-isogenic unisexual and bisexual flower buds using the 454 pyrosequencing technology. Our analysis revealed that six auxin-related genes (auxin can regulate sex expression by stimulating ethylene production40) and three short-chain dehydrogenase or reductase genes (homologs to the sex determination gene ts2 in maize41) are more highly expressed in unisexual flowers (Supplementary Table 16). This analysis provides an important resource for further study of sex determination in cucumber. Novel developmental programs The tendril is a specific climbing tool of vines, such as Vitaceae and all Cucurbitaceae. Darwin considered tendrils a key innovation in plant 1279 Articles © 2009 Nature America, Inc. All rights reserved. evolution42. In cucumber and grapevine, gibberellic acid regulates tendril formation43,44. In most plants, the transition of GA12aldehyde to GA12 is catalyzed by cytochrome P450 monooxygenase. In cucurbits, it is also catalyzed by specific GA-7-oxidase genes, which are absent from Arabidopsis45. Cucumber has two GA-7-oxidase genes (Supplementary Table 17). GA-20-oxidase controls key steps leading to bioactive GA1 and GA4, and our data show that the cucumber has three lineage-specific clades (three copies; Supplementary Fig. 18). These specific genes might be associated with the role of gibberellic acid in the regulation of tendril formation. Tendril coiling involves rapid cell wall modification46, and expansins are cell wall–loosening proteins in plants47. We found that, in cucumber, the expansin subfamily EXLA has undergone marked expansion through tandem duplication (eight genes in cucumber, compared with one to three genes in other genomes; Supplementary Fig. 19); this event may have contributed to the development of tendril coiling in cucumber. Use in plant vascular biology studies The evolution of the plant vascular system, comprising xylem and phloem tissues, had a pivotal role in the emergence of land plants. The sieve tube system of phloem, the equivalent of the animal arterial system, delivers nutrients and signaling molecules to developing organs2. A BLASTP analysis of 1,209 protein fragments from pumpkin phloem48 identified 800 phloem proteins in the cucumber genome (Supplementary Table 18). Using these cucumber proteins, we conducted orthologous gene family (cluster) analysis (Supplementary Table 19) with their homologs in other vascular plants as well as the nonvascular moss Physcomitrella patens49. In total, we constructed 686 clusters (Table 2). About two-thirds (49 of 75) of the Arabidopsis and half (57 of 120) of the rice phloem proteins identified in previous studies50,51 were included in this data set, indicating the effectiveness of these analyses and the value of this resource for vascular biology studies in plants. The vascular and nonvascular plants shared 596 clusters; between monocots and eudicots, there are 648 clusters in common. Phloem protein II (PP2; cluster 2432) are present in angiosperms but absent from the moss genome. PP2-like genes are also present in gymnosperm52, indicating their association with the advent of vascular plants. In cucurbits, these genes can increase the size-exclusion limit of plasmodesmata and facilitate cell-to-cell traffic of macromolecules52 and thus are likely to have an essential role in vascular function. The sieve element occlusion proteins (gene cluster 4754), present in all eudicots but absent from mosses and monocots, represent a novel mechanism that evolved for sealing the sieve tube system after wounding53. The average number of genes in each cluster ranges from 2.9 to 5.1 in the vascular plants, compared to 1.7 in moss (Table 2). The increase of gene numbers per cluster may be associated with the evolution of the plant vascular system. The 16-kDa PP16 cluster (cluster 2599) has an average of 3.7 genes in the vascular plants compared to 2 in moss. The CmPP16 gene in pumpkin is involved in transport of mRNA into the phloem3. The increase of the number of PP16 genes in vascular plants indicates these new members may be involved in long-distance trafficking of mRNA. To better understand xylem formation, we compared gene families related to lignin and cellulose biosynthesis between woody and herbaceous plants. The perennial woody plants, poplar and grapevine, have a large number of lignin biosynthesis–related genes (48 and 49, respectively), whereas the semiwoody plant papaya has an intermediate number (39). In contrast, the herbaceous plants Arabidopsis and cucumber have smaller numbers (28 and 26, respectively; Supplementary Table 20). Among these gene families, the number of genes in the cadmium-sensitive CAD family was consistent with 1280 Table 2 Summary of orthologous gene families (clusters) established using cucumber genes homologous to pumpkin phloem proteins Genes Gene clusters Average genes per cluster P. patensa 1,072 622 1.7 O. sativa 2,458 676 3.6 S. bicolor 2,780 679 4.1 A. thaliana 2,351 682 3.5 C. papaya 1,944 672 2.9 P. trichocarpa 3,454 684 5.1 C. sativusb 1,986 686 2.9 V. vinifera 2,535 668 3.8 aMoss (P. patens) was used as the only outgroup. bFor each cluster, at least one cucumber phloem protein was included. this trend. In poplar and grapevine, homologs for AT4G37980 and AT4G37990 in Arabidopsis, which have low cadmium-sensitive enzymatic activity in vitro and may have only a minor role in lignin formation in this species54, were expanded markedly. In papaya, there is an expansion of homologs for AT1G37970, which lack detectable cadmium-sensitive catalytic activities in vitro but are expressed predominantly in lignin-forming tissues54 (Supplementary Fig. 20). Thus, the expansion of CAD genes may be associated with wood formation. It is also notable that grapevine has the largest PAL gene family, with 15 members, and that poplar and papaya have the largest number of HCT genes, with 7 members. Of the cellulose biosynthesis–related genes, poplar has more CESA and COB genes (18 of each) than do any of the other sequenced dicots (Supplementary Table 20). DISCUSSION The sequence of the cucumber genome provides an invaluable new resource for biological research and breeding of cucurbits. The high collinearity between cucumber and melon genomes enables cucumber to serve as a model system in the Cucurbitaceae family for comparative genomics studies in plants. The cucumber genome and related transcriptome analysis can provide insights into the mechanisms underlying sex determination, an important biological process that has been well characterized in cucumber at the phenotypic level. The genome can also advance our knowledge of the evolution and function of the plant vascular system. We have also shown that, in combination with traditional Sanger sequencing, next-generation DNA sequencing technologies can be used effectively for de novo sequencing of plant genomes, making it possible to carry out rapid and low-cost sequencing for other important plant species. Methods Methods and any associated references are available in the online version of the paper at http://www.nature.com/naturegenetics/. Accession codes. The cucumber genome sequence has been deposited in GenBank with accession code ACHR00000000 (the version described here is the first version, with accession code ACHR01000000). Note: Supplementary information is available on the Nature Genetics website. Acknowledgments We thank L. Goodman for assistance in editing the manuscript and R. Quatrano, L. Kochian, L. Comai, V. Sundaresan, S. Kamoun and S. Renner for critical readings of the manuscript. This work was funded by the Chinese Ministry of Agriculture (948 program), Ministry of Science and Technology (2006DFA32140, 2007CB815701, 2007CB815703 and 2007CB815705) and Ministry of Finance volume 41 | number 12 | december 2009 Nature Genetics Articles © 2009 Nature America, Inc. All rights reserved. (1251610601001); the National Natural Science Foundation of China (30871707 and 30725008); the Chinese Academy of Agricultural Sciences (seed grant to S.H.); the Chinese Academy of Science (GJHZ0701-6 and KSCX2-YWN-023); the US Department of Agriculture (National Research Initiative grant 2006-35304-17346 to W.J.L.); the National Science Foundation (grant IOS-07-15513 to W.J.L.); and the Korea Science and Engineering Foundation–Ministry of Education, Science and Technology (WCU R33-10002 and BK21 grants to J.-Y.K.). WKC was partly supported by grants from the Environmental Biotechnology National Core Research Center (R15-2003-012-01003-0) and National Research Laboratory (2009-0066339). This work was also supported by the Shenzhen Municipal and Yantian District Governments and the Society of Entrepreneurs & Ecology. D. Qu and Z. Fang of the Chinese Academy of Agricultural Sciences provided management support for this work. AUTHOR CONTRIBUTIONS S.H., Y.D., Jun Wang and Songgang Li managed the project. S.H., Z.Z., W.J.L., X.G. and R.L. designed the analyses. X.G., H.M., L.L., Yuanyuan Ren, G.T., Y. Lu, Z.X., J.C., A., Z.W., J. Zhang, H. Liang, X.R., M.J., Hailong Yang, R.C., Shifang Liu and X.Z. conducted DNA preparation and sequencing. X.W., B.X., K.L., W.J., Guangcun Li, Z.F., J.S., A.K., E.A.G.v.d.V. and Y.X. contributed new reagents and analytic tools. S.H., Z.Z., W.J.L., X.G., R.L., X.W., B.X., K.L., W.J., J.H., Z.J., Yi Ren, Ying Li, X.L., S.W., Q.S., W.K.C., J.-Y.K., K.H.-U., H.M., Z.C., S.Z., J. Wu, Y.Y., H.K., Y.W., J.G., Y.H., M.L., B. Zhao, Shiqiang Liu, W.F., P.N., H. Zhu, Jun Li, J.R., W.Q., M. Wang, Q.H., B.L., Q.C., Y.B., Z.S., M. Wen, G.Z., Z.Y., Jianwen Li, L.M., H. Liu., Y. Zhou, J. Zhao, X.F., Guoqing Li, L.F., Yingrui Li, D.L., Hancheng Zheng and Shaochuan Li conducted the data analyses. S.H., R.L., Z.Z. and W.J.L. wrote the paper. Y.D., R.S., B. Zhang., S.J., G.Y., S.Y., Hongkun Zheng, Y. Zhang, N.Q., Z.L., L.B., K.K., Huanming Yang and Jian Wang revised the paper. Published online at http://www.nature.com/naturegenetics/. Reprints and permissions information is available online at http://npg.nature.com/ reprintsandpermissions/. 1. Tanurdzic, M. & Banks, J.A. Sex-determining mechanisms in land plants. Plant Cell 16, S61–S71 (2004). 2. Lough, T.J. & Lucas, W.J. Integrative plant biology: role of phloem long-distance macromolecular trafficking. Annu. Rev. Plant Biol. 57, 203–232 (2006). 3. Xoconostle-Cázares, B. et al. Plant paralog to viral movement protein that potentiates rransport of mRNA into the phloem. Science 283, 94–98 (1999). 4. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000). 5. International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature 436, 793–800 (2005). 6. Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007). 7. Ming, R. et al. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452, 991–996 (2008). 8. Tuskan, G.A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 1596–1604 (2006). 9. Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92 (2002). 10.Shendure, J., Mitra, R.D., Varma, C. & Church, G.M. Advanced sequencing technologies: methods and goals. Nat. Rev. Genet. 5, 335–344 (2004). 11.Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). 12.Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008). 13.Staub, J.E., Serquen, F.C., Horejsi, T. & Chen, J.-f. Genetic diversity in cucumber (Cucumis sativus L.): IV. An evaluation of Chinese germplasm1. Genet. Resour. Crop Evol. 46, 297–310 (1999). 14.Arumuganathan, K. & Earle, E. Nuclear DNA content of some important plant species. Plant Mol. Biol. Rep. 9, 208–218 (1991). 15.Han, Y.H. et al. Distribution of the tandem repeat sequences and karyotyping in cucumber (Cucumis sativus L.) by fluorescence in situ hybridization. Cytogenet. Genome Res. 122, 80–88 (2008). 16.Ren, Y. et al. An integrated genetic and cytogenetic map of the cucumber genome. PLoS One 4, e5795 (2009). 17.Bowers, J.E., Chapman, B.A., Rong, J. & Paterson, A.H. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422, 433–438 (2003). 18.Yu, J. et al. The genomes of Oryza sativa: a history of duplications. PLoS Biol. 3, e38 (2005). 19.Fernandez-Silva, I. et al. Bin mapping of genomic and EST-derived SSRs in melon (Cucumis melo L.). Theor. Appl. Genet. 118, 139 (2008). 20.Schaefer, H., Heibl, C. & Renner, S.S. Gourds afloat: a dated phylogeny reveals an Asian origin of the gourd family (Cucurbitaceae) and numerous oversea dispersal events. Proc. Biol. Sci. 276, 843–851 (2009). Nature Genetics volume 41 | number 12 | december 2009 21.Liavonchanka, A. & Feussner, I. Lipoxygenases: occurrence, functions and catalysis. J. Plant Physiol. 163, 348–357 (2006). 22.Schwab, W., Davidovich-Rikanati, R. & Lewinsohn, E. Biosynthesis of plant-derived flavor compounds. Plant J. 54, 712–732 (2008). 23.Buescher, R.H. & Buescher, R.W. Production and stability of (E, Z )-2, 6-nonadienal, the major flavor volatile of cucumbers. J. Food Sci. 66, 357–361 (2001). 24.Cho, M.J., Buescher, R.W., Johnson, M. & Janes, M. Inactivation of pathogenic bacteria by cucumber volatiles (E,Z )-2,6-nonadienal and (E )-2-nonenal. J. Food Prot. 67, 1014–1016 (2004). 25.Matsui, K. et al. Fatty acid 9- and 13-hydroperoxide lyases from cucumber. FEBS Lett. 481, 183–188 (2000). 26.Nieto, C. et al. An eIF4E allele confers resistance to an uncapped and nonpolyadenylated RNA virus in melon. Plant J. 48, 452–462 (2006). 27.Wai, T. & Grumet, R. Inheritance of resistance to watermelon mosaic virus in the cucumber line TMG-1: tissue-specific expression and relationship to zucchini yellow mosaic virus resistance. Theor. Appl. Genet. 91, 699–706 (1995). 28.Taler, D., Galperin, M., Benjamin, I., Cohen, Y. & Kenigsbuch, D. Plant eR genes that encode photorespiratory enzymes confer resistance against disease. Plant Cell 16, 172–184 (2004). 29.Balkema-Boomstra, A.G. et al. Role of cucurbitacin C in resistance to spider mite Tetranychus urticae in cucumber Cucumis sativus L. J. Chem. Ecol. 29, 225–235 (2003). 30.Da Costa, C.P. & Jones, C.M. Cucumber beetle resistance and mite susceptibility controlled by the bitter gene in Cucumis sativus L. Science 172, 1145–1146 (1971). 31.Phillips, D.R., Rasbery, J.M., Bartel, B. & Matsuda, S.P. Biosynthetic diversity in plant triterpene cyclization. Curr. Opin. Plant Biol. 9, 305–314 (2006). 32.Shibuya, M., Adachi, S. & Ebizuka, Y. Cucurbitadienol synthase, the first committed enzyme for cucurbitacin biosynthesis, is a distinct enzyme from cycloartenol synthase for phytosterol biosynthesis. Tetrahedron 60, 6995–7003 (2004). 33.Field, B. & Osbourn, A.E. Metabolic diversification–independent assembly of operonlike gene clusters in different plants. Science 320, 543–547 (2008). 34.Rudich, J., Halevy, A.H. & Kedar, N. Ethylene evolution from cucumber plants as related to sex expression. Plant Physiol. 49, 998–999 (1972). 35.Pirrung, M.C. Ethylene biosynthesis from 1-aminocyclopropanecarboxylic acid. Acc. Chem. Res. 32, 711–718 (1999). 36.Stepanova, A.N. & Alonso, J.M. Ethylene signaling pathway. Sci. STKE 2005, cm3 (2005). 37.Boualem, A. et al. A conserved mutation in an ethylene biosynthesis enzyme leads to andromonoecy in melons. Science 321, 836–838 (2008). 38.Li, Z. et al. Molecular isolation of the M gene suggests that a conserved-residue conversion induces the formation of bisexual flowers in cucumber plants. Genetics 182, 1381–1385 (2009). 39.Yamagami, T. et al. Biochemical diversity among the 1-amino-cyclopropane-1carboxylate synthase isozymes encoded by the Arabidopsis gene family. J. Biol. Chem. 278, 49102–49112 (2003). 40.Takahashi, H. & Jaffe, M.J. Further studies of auxin and ACC induced feminization in the cucumber plant using ethylene inhibitors. Phyton (Buenos Aires) 44, 81–86 (1984). 41.DeLong, A., Calderon-Urrea, A. & Dellaporta, S.L. Sex determination gene TASSELSEED2 of maize encodes a short-chain alcohol dehydrogenase required for stage-specific floral organ abortion. Cell 74, 757–768 (1993). 42.Darwin, C.R. The Movements and Habits of Climbing Plants (Murray, London, 1875). 43.Boss, P.K. & Thomas, M.R. Association of dwarfism and floral induction with a grape /‘green revolution/’ mutation. Nature 416, 847–850 (2002). 44.Galun, E. The cucumber tendril—a new test organ for gibberellic acid. Cell. Mol. Life Sci. 15, 184–185 (1959). 45.Lange, T. Cloning gibberellin dioxygenase genes from pumpkin endosperm by heterologous expression of enzyme activities in Escherichia coli. Proc. Natl. Acad. Sci. USA 94, 6553–6558 (1997). 46.Braam, J. In touch: plant responses to mechanical stimuli. New Phytol. 165, 373–389 (2005). 47.Cosgrove, D.J. Loosening of plant cell walls by expansins. Nature 407, 321–326 (2000). 48.Lin, M.-K., Lee, Y.-J., Lough, T.J., Phinney, B.S. & Lucas, W.J. Analysis of the pumpkin phloem proteome provides insights into angiosperm sieve tube function. Mol. Cell. Proteomics 8, 343–356 (2009). 49.Rensing, S.A. et al. The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 319, 64–69 (2008). 50.Aki, T., Shigyo, M., Nakano, R., Yoneyama, T. & Yanagisawa, S. Nano scale proteomics revealed the presence of regulatory proteins including rhree FT-Like proteins in phloem and xylem saps from rice. Plant Cell Physiol. 49, 767–790 (2008). 51.Giavalisco, P., Kapitza, K., Kolasa, A., Buhtz, A. & Kehr, J. Towards the proteome of Brassica napus phloem sap. Proteomics 6, 896–909 (2006). 52.Dinant, S. et al. Diversity of the superfamily of phloem lectins (phloem protein 2) in angiosperms. Plant Physiol. 131, 114–128 (2003). 53.Pélissier, H.C., Peters, W.S., Collier, R., van Bel, A.J. & Knoblauch, M. GFP tagging of sieve element occlusion (SEO) proteins results in green fluorescent forisomes. Plant Cell Physiol. 49, 1699–1710 (2008). 54.Kim, S.J. et al. Expression of cinnamyl alcohol dehydrogenases and their putative homologues during Arabidopsis thaliana growth and development: lessons for database annotations. Phytochemistry 68, 1957–1974 (2007). 1281 ONLINE METHODS © 2009 Nature America, Inc. All rights reserved. Removal of contamination for Sanger reads. Sanger reads were aligned against mitochondrion (assembled by us based on the gene sequences of mitochondria of rice and Arabidopsis), chloroplast (GenBank accession code AJ970307) and satellite (GenBank X03768, X03769, X03770, X69163, AY424361 and AY424362) sequences. Reads with identity >95% were filtered. De novo assembly of Solexa data. The De Bruijn graph method was used to represent all possible sequences assembled by Solexa reads, with a K-mer as a node and the (K − 1) base overlap between two K-mers as an edge. Some tips and low-coverage K-mers in the graph were removed to reduce sequencing errors and eliminate branches. The De Bruijn graph was then converted to a contiging graph by turning a series of linearly connected K-mers into a precontig node. Dijkstra’s algorithm was implemented to detect bubbles, which were then straightforwardly merged into a single path if sequences of the branches were sufficiently similar. By this approach, the repeat regions could be assembled into consensus sequences. Contigs were next connected by paired reads to form a scaffolding graph. Edges in this graph were connections between contigs, and the edge length was estimated from the insert size of the paired reads. The paired-end information was used step by step, from insert sizes around 200 bp and 500 bp to 2 kb. At each step, two procedures were applied: the repeat-masking method masked the complicated connections around repeat contigs, and the subgraph linearization turned the interleaving contigs into linear structure. This process yielded the final set of Solexa contigs and scaffolds. Combination of Sanger reads and Solexa scaffolds. RePS2 (ref. 55) software was used to assemble the Solexa scaffolds and Sanger reads. We counted the depth of each 17-mer in the 3.9× plasmid and fosmid ends to create the 17-mer database, which contained all the depth information of the 17-mers. This database was then used to check all the contigs to identify repeated ones. A contig was defined as a repeat if over 80% of the 17-mers it contained were with higher depth than the threshold. After removing the repeat contigs, the scaffolds were divided into fake paired reads with read length of 600 bp and insert size of 1,700 bp. All segments over 200 bp were put into the second data set, which was then assembled as a unique region. In the same way as the construction of Solexa scaffolds, the plasmid, fosmid and BAC ends were used, step by step, to construct a ‘superscaffold’. Misassembly checking and gap filling. In the final stage, we used the repeat sequences to fill the gaps in the scaffolds using the following steps. First, we mapped all of the reads that contained paired-end information (Solexa and plasmid reads, as well as fosmid and BAC ends) to the scaffolds, and we used the unique contigs to establish the paired-end relationship between the contigs. Second, we identified repeat contigs with paired ends that uniquely connected two other scaffolded contigs. If the length of the repeat contig and the estimated size of the gap were similar, the gap was filled by this repeat. Any remaining repeat contigs that were not used for gap filling were added into the final set of scaffolds. Chromosome anchoring along the cucumber genetic map. The marker sequences in the cucumber genetic map were aligned against the scaffold sequences using BLASTN at an E-value cutoff of 1 × 10−20. Hits with coverage >30% and identity >90% were considered mapped markers. Based on the mapped markers, the scaffold sequences were anchored on the cucumber chromosomes. During this process, the scaffolds with mapped markers that showed inconsistent genetic positions were manually checked by paired-end relationships; the incorrect scaffold was then split. FISH analysis. The FISH protocol was described in a previous study16. To better visualize the segmental inversion, we chose chromosome spreads where chromosome 5 appeared in a straight form. Instead of showing all chromosomes16, only chromosome 5 is shown in Figure 1b of this study. In addition, the image was taken in a higher resolution. Scale bars represent 1 µm, as compared to 3 µm previously16. Red and green signals were detected with anti-digoxigenin antibody coupled to rhodamine (Roche) and by anti-avidin antibody conjugated with FITC (Vector Laboratories), respectively. Nature Genetics Identification of repetitive elements in the cucumber genome. Four de novo software packages, ReAS56, PILER-DF57, RepeatScout58 and LTR_Finder59, were used to search for repeat sequences within the cucumber genome. All repeat sequences with lengths >100 bp and gap ‘N’ <5% constituted the raw transposable element library. The repeat elements belonging to rRNA and satellite sequences were first filtered using BLASTN (E value ≤ 1 × 10−10, identity ≥ 80%, coverage ≥ 50% and minimal matching length ≥ 100 bp). All-versus-all BLASTN (E value ≤ 1 × 10−10) searches were then conducted iteratively, and the shorter sequences were filtered when two repeats aligned with identity ≥ 80%, coverage ≥ 80% and minimal matching length ≥ 100 bp; this yielded a nonredundant transposable element library. The nonredundant repeats were then searched against the Swiss-Prot protein database to filter the protein-coding genes by BLASTX (E value ≤ 1 × 10−4, identity ≥ 30%, coverage ≥ 30% and minimal matching length ≥ 30 amino acids). After manual curation, a de novo transposable element library for cucumber was obtained. Transposable elements in the cucumber genome assembly were identified both at the DNA and protein level. RepeatMasker was applied for DNA-level identification using a custom library (a combination of Repbase, plant repeat database and our cucumber de novo transposable element library). At the protein level, RepeatProteinMask was used to conduct WU-BLASTX searches against the transposable element protein database. Overlapping transposable elements belonging to the same type of repeats were integrated together, whereas those with low scores were removed if they overlapped >80% and belonged to different types. Gene prediction. Our strategy for gene prediction was to conduct de novo predictions on the repeat-masked genome and then integrate them with spliced alignments of proteins and transcripts to genome sequences using GLEAN60. Cucumber genome sequences were masked by identified repeat sequences with length >500 bp, except for miniature inverted-repeat transposable elements, which are usually found near genes or inside introns. The EST and full-length cDNA sequences of cucumber were processed by PASA61 to train gene prediction software BGF62, GlimmerHMM63 and SNAP64. Augustus65 and Genscan66 software used gene model parameters trained for Arabidopsis. We aligned the protein sequences of five sequenced plants (Arabidopsis, papaya, poplar, grapevine and rice) onto the cucumber genome using TBLASTN, at an E-value cutoff of 1 × 10−5, and the homologous genome sequences were aligned against the matching proteins using GeneWise67 for accurate spliced alignments. The cDNA and EST sequences of cucumber and melon were aligned against the cucumber genome using BLAT (identity ≥ 0.95, coverage ≥ 0.90) to generate spliced alignments. We also aligned TIGR unigenes 68 from Cucurbitales, Fabales and Fagales to the cucumber genome by ATT_gap2 (ref. 69). All of these resources were combined by GLEAN60 to produce the consensus gene sets. Identification of noncoding RNA genes in the cucumber genome. The tRNA genes were identified by tRNAscan-SE70 with default parameters. The C/D-box small nucleolar RNAs were identified by Snoscan71 using yeast rRNA and yeast methylation sites. Other noncoding RNAs, including miRNA, small nuclear RNA and H/ACA-box small nucleolar RNA, were identified using INFERNAL software by searching against the Rfam72 database with default parameters. Construction of gene families. We adapted the Treefam73 method to construct gene families for the genes in cucumber, Arabidopsis, papaya, poplar, grapevine and rice (outgroup). Construction of syntenic blocks. We identified syntenic blocks between two species (A and B) by an automatic clustering algorithm on a dot plot graph, which included five steps. First, markers (gene pairs) were generated between A and B. All protein sequences of A were aligned to all proteins of B using BLASTP (E value < 1 × 10−10 and identity > 20%). The fragmental alignments were conjoined for each gene pair. Those gene pairs with aligned regions covering <50% were filtered. The remaining gene pairs were plotted on the dot graph as markers (points). Second, the Euclidean distance was calculated for each pair. Distances were calculated based on the gene order in each chromosome rather than the genomic position. Third, hierarchical clustering was doi:10.1038/ng.475 determined for all of the points. If the distance between two points was less than the distance cutoff, a link was assigned. The distance cutoff was adapted in accordance with the selected species. Fourth, the quality was estimated for each cluster by calculating the point number (N), average point distance (D) and correlation coefficient (R). A score (S) was calculated to show the overall quality, defined as S = N × sqrt(2)/D × R. Finally, problematic clusters were filtered. Clusters with N < 8 or |R| < 0.5 were filtered out. The clusters caused by tandem duplication were further filtered by determining the slope (L) of the regression line within a range of 0.1 < |L| < 10. This algorithm can also be used to study intraspecies synteny. © 2009 Nature America, Inc. All rights reserved. 4DTv calculation. After the identification of syntenic blocks, the pairwise protein alignments for each gene pair were first constructed with MUSCLE 74. The nucleotide alignment was then created according to the protein alignment. 4DTv was then calculated on concatenated nucleotide alignments with HKY substitution models75. Comparative analysis between cucumber and melon. Cucumber genome sequences were aligned with melon BAC sequences using NUCmer, a program in the MUMmer package76. The delta-filter program was then run with the −1 option to remove complex alignments. Orthologous gene pairs were identified by the reciprocal best method. The Bayesian relaxed molecular clock approach was used to estimate divergence time using the program MULTIDIVTIME, which was implemented using the Thornian Time Traveler (T3) package. The calibration time (fossil record time) interval (54–90 million years ago) of Capparales was obtained from previous results77,78. URLs. Arabidopsis thaliana (TIGR Release 5.0), ftp://ftp.tigr.org/pub/data/a_ thaliana/ath1; Carica papaya (assembly v1.0, EVidence Modeler genes), http:// www.life.uiuc.edu/ming; Populus trichocarpa (assembly release v1.0, annotation v1.1), http://genome.jgi-psf.org/Poptr1_1/Poptr1_1.download.ftp.html; Vitis vinifera (published assembly, annotation v1), http://www.genoscope. cns.fr/externe/GenomeBrowser/Vitis/; Oryza sativa (assembly International Rice Genome Sequencing Project build 3), http://rgp.dna.affrc.go.jp/IRGSP/ download.html; Oryza sativa (GLEAN genes annotated by Beijing Genomics Institute), ftp.genomics.org.cn/pub/ricedb/rice_update_data/GLEAN_genes/ IRGSP_japonica/; Physcomitrella patens (assembly release v1.0, annotation v1.1), http://genome.jgi-psf.org/Phypa1_1/Phypa1_1.home.html; Sorghum bicolor (assembly release v1.0, annotation v1.4), http://www.phytozome. net/sorghum; UniGene sequences of Cucurbitales, Fabales and Fagales, http:// plantta.jcvi.org/; cucumber marker sequences, http://cucumber.genomics.org. cn; UniProt (Swiss-Prot/TrEMBL) release 14.1, http://www.uniprot.org/down loads; InterPro v18.0, http://www.ebi.ac.uk/interpro/; KEGG release 47, ftp:// ftp.genome.jp/pub/kegg/pathway/; Repbase release 13.07, http://www.girinst. org/repbase/index.html; Plant Repeat Databases (TIGR), http://plantrepeats. doi:10.1038/ng.475 plantbiology.msu.edu/index.html; Rfam release 9.0, http://rfam.sanger.ac.uk/; Thornian Time Traveler (T3) package, http://abacus.gene.ucl.ac.uk/software. html; RepeatMasker, http://www.repeatmasker.org. 55.Wang, J. et al. RePS: a sequence assembler that masks exact repeats identified from the shotgun data. Genome Res. 12, 824–831 (2002). 56.Li, R. et al. ReAS: Recovery of ancestral sequences for transposable elements from the unassembled reads of a whole genome shotgun. PLOS Comput. Biol. 1, e43 (2005). 57.Edgar, R.C. & Myers, E.W. PILER: identification and classification of genomic repeats. Bioinformatics 21 Suppl 1, i152–i158 (2005). 58.Price, A.L., Jones, N.C. & Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1, i351–i358 (2005). 59.Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007). 60.Elsik, C.G. et al. Creating a honey bee consensus gene set. Genome Biol. 8, R13 (2007). 61.Campbell, M.A., Haas, B.J., Hamilton, J.P., Mount, S.M. & Buell, C.R. Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics 7, 327 (2006). 62.Li, H. et al. Test data sets and evaluation of gene prediction programs on the rice genome. J Comp Sci Tech 20, 446–453 (2005). 63.Majoros, W.H., Pertea, M. & Salzberg, S.L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004). 64.Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004). 65.Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 Suppl 2, ii215–ii225 (2003). 66.Salamov, A.A. & Solovyev, V.V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000). 67.Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988–995 (2004). 68.Childs, K.L. et al. The TIGR Plant Transcript Assemblies database. Nucleic Acids Res. 35, D846–D851 (2007). 69.Huang, X., Adams, M.D., Zhou, H. & Kerlavage, A.R. A tool for analyzing and annotating genomic sequences. Genomics 46, 37–45 (1997). 70.Lowe, T.M. & Eddy, S.R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997). 71.Lowe, T.M. & Eddy, S.R. A computational screen for methylation guide snoRNAs in yeast. Science 283, 1168–1171 (1999). 72.Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33, D121–D124 (2005). 73.Li, H. et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 34, D572–D580 (2006). 74.Edgar, R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004). 75.Hasegawa, M., Kishino, H. & Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160–174 (1985). 76.Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004). 77.Crepet, W.L., Nixon, K.C. & Gandolfo, M.A. Fossil evidence and phylogeny: the age of major angiosperm clades based on mesofossil and macrofossil evidence from Cretaceous deposits. Am. J. Botany 91, 1666–1682 (2004). 78.Wikström, N., Savolainen, V. & Chase, M.W. Evolution of the angiosperms: calibrating the family tree. Proc. Biol. Sci. 268, 2211–2220 (2001). Nature Genetics Vol 463 | 14 January 2010 | doi:10.1038/nature08670 ARTICLES Genome sequence of the palaeopolyploid soybean Jeremy Schmutz1,2, Steven B. Cannon3, Jessica Schlueter4,5, Jianxin Ma5, Therese Mitros6, William Nelson7, David L. Hyten8, Qijian Song8,9, Jay J. Thelen10, Jianlin Cheng11, Dong Xu11, Uffe Hellsten2, Gregory D. May12, Yeisoo Yu13, Tetsuya Sakurai14, Taishi Umezawa14, Madan K. Bhattacharyya15, Devinder Sandhu16, Babu Valliyodan17, Erika Lindquist2, Myron Peto3, David Grant3, Shengqiang Shu2, David Goodstein2, Kerrie Barry2, Montona Futrell-Griggs5, Brian Abernathy5, Jianchang Du5, Zhixi Tian5, Liucun Zhu5, Navdeep Gill5, Trupti Joshi11, Marc Libault17, Anand Sethuraman1, Xue-Cheng Zhang17, Kazuo Shinozaki14, Henry T. Nguyen17, Rod A. Wing13, Perry Cregan8, James Specht18, Jane Grimwood1,2, Dan Rokhsar2, Gary Stacey10,17, Randy C. Shoemaker3 & Scott A. Jackson5 Soybean (Glycine max) is one of the most important crop plants for seed protein and oil content, and for its capacity to fix atmospheric nitrogen through symbioses with soil-borne microorganisms. We sequenced the 1.1-gigabase genome by a whole-genome shotgun approach and integrated it with physical and high-density genetic maps to create a chromosome-scale draft sequence assembly. We predict 46,430 protein-coding genes, 70% more than Arabidopsis and similar to the poplar genome which, like soybean, is an ancient polyploid (palaeopolyploid). About 78% of the predicted genes occur in chromosome ends, which comprise less than one-half of the genome but account for nearly all of the genetic recombination. Genome duplications occurred at approximately 59 and 13 million years ago, resulting in a highly duplicated genome with nearly 75% of the genes present in multiple copies. The two duplication events were followed by gene diversification and loss, and numerous chromosome rearrangements. An accurate soybean genome sequence will facilitate the identification of the genetic basis of many soybean traits, and accelerate the creation of improved soybean varieties. Legumes are an important part of world agriculture as they fix atmospheric nitrogen by intimate symbioses with microorganisms. The soybean in particular is important worldwide as a predominant plant source of both animal feed protein and cooking oil. We report here a soybean whole-genome shotgun sequence of Glycine max var. Williams 82, comprised of 950 megabases (Mb) of assembled and anchored sequence (Fig. 1), representing about 85% of the predicted 1,115-Mb genome1 (Supplementary Table 3.1). Most of the genome sequence (Fig. 1) is assembled into 20 chromosome-level pseudomolecules containing 397 sequence scaffolds with ordered positions within the 20 soybean linkage groups. An additional 17.7 Mb is present in 1,148 unanchored sequence scaffolds that are mostly repetitive and contain fewer than 450 predicted genes. Scaffold placements were determined with extensive genetic maps, including 4,991 single nucleotide polymorphisms (SNPs) and 874 simple sequence repeats (SSRs)2–5. All but 20 of the 397 sequence scaffolds are unambiguously oriented on the chromosomes. Unoriented scaffolds are in repetitive regions where there is a paucity of recombination and genetic markers (see Supplementary Information for assembly details). The soybean genome is the largest whole-genome shotgunsequenced plant genome so far and compares favourably to all other high-quality draft whole-genome shotgun-sequenced plant genomes (Supplementary Table 4). A total of 8 of the 20 chromosomes have telomeric repeats (TTTAGGG or CCCTAAA) on both of the distal scaffolds and 11 other chromosomes have telomeric repeats on a single arm, for a total of 27 out of 40 chromosome ends captured in sequence scaffolds. Also, internal scaffolds in 19 of 20 chromosomes contain a large block of characteristic 91- or 92-base-pair (bp) centromeric repeats6,7 (Fig. 1). Four chromosome assemblies contain several 91/92-bp blocks; this may be the correct physical placements of these sequences, or may reflect the difficulty in assembling these highly repetitive regions. Gene composition and repetitive DNA A striking feature of the soybean genome is that 57% of the genomic sequence occurs in repeat-rich, low-recombination heterochromatic regions surrounding the centromeres. The average ratio of geneticto-physical distance is 1 cM per 197 kb in euchromatic regions, and 1 cM per 3.5 Mb in heterochromatic regions (see Supplementary Information section 1.8). For reference, these proportions are similar to those in Sorghum, in which 62% of the sequence is heterochromatic, and different than in rice, with 15% in heterochromatin8. In 1 HudsonAlpha Genome Sequencing Center, 601 Genome Way, Huntsville, Alabama 35806, USA. 2Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California 94598, USA. USDA-ARS Corn Insects and Crop Genetics Research Unit, Ames, Iowa 50011, USA. 4Department of Bioinformatics and Genomics, 9201 University City Blvd, University of North Carolina at Charlotte, Charlotte, North Carolina 28223, USA. 5Department of Agronomy, Purdue University, 915 W. State Street, West Lafayette, Indiana 47906, USA. 6Center for Integrative Genomics, University of California, Berkeley, California 94720, USA. 7Arizona Genomics Computational Laboratory, BIO5 Institute, 1657 E. Helen Street, The University of Arizona, Tucson, Arizona 85721, USA. 8USDA, ARS, Soybean Genomics and Improvement Laboratory, B006, BARC-West, Beltsville, Maryland 20705, USA. 9Department Plant Science and Landscape Architecture, University of Maryland, College Park, Maryland 20742, USA. 10Division of Biochemistry & Interdisciplinary Plant Group, 109 Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri 65211, USA. 11Department of Computer Science, University of Missouri, Columbia, Missouri 65211, USA. 12The National Center for Genome Resources, 2935 Rodeo Park Drive East, Santa Fe, New Mexico 87505, USA. 13Arizona Genomics Institute, School of Plant Sciences, University of Arizona, Tucson, Arizona 85721, USA. 14RIKEN Plant Science Center, Yokohama 230-0045, Japan. 15Department of Agronomy, Iowa State University, Ames, Iowa 50011, USA. 16Department of Biology, University of Wisconsin-Stevens Point, Stevens Point, Wisconsin 54481, USA. 17National Center for Soybean Biotechnology, Division of Plant Sciences, University of Missouri, Columbia, Missouri 65211, USA. 18Department of Agronomy and Horticulture, University of Nebraska, Lincoln, Nebraska 68583, USA. 3 178 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 14 January 2010 0 10 20 30 40 50 60 Mb Chr1-D1a Chr2-D1b Chr3-N Chr4-C1 Chr5-A1 Chr6-C2 Chr7-M Chr8-A2 Chr9-K Chr10-O Chr11-B1 Chr12-H Chr13-F Chr14-B2 Chr15-E Chr16-J Chr17-D2 Chr18-G Chr19-L Chr20-I Genes DNA transposons Copia-like retrotransposons Gypsy-like retrotransposons Cen91/92 Figure 1 | Genomic landscape of the 20 assembled soybean chromosomes. Major DNA components are categorized into genes (blue), DNA transposons (green), Copia-like retrotransposons (yellow), Gypsy-like retrotransposons (cyan) and Cent91/92 (a soybean-specific centromeric repeat (pink)), with respective DNA contents of 18%, 17%, 13%, 30% and 1% of the genome sequence. Unclassified DNA content is coloured grey. Categories were determined for 0.5-Mb windows with a 0.1-Mb shift. general, these boundaries, determined on the basis of suppressed recombination, correlate with transitions in gene density and transposon density. Ninety-three per cent of the recombination occurs in the repeat-poor, gene-rich euchromatic genomic region that only accounts for 43% of the genome. Nevertheless, 21.6% of the highconfidence genes are found in the repeat- and transposon-rich regions in the chromosome centres. We identified 46,430 high-confidence protein-coding loci in the soybean genome, using a combination of full-length complementary DNAs9, expressed sequence tags, homology and ab initio methods (Supplementary Information section 2). Another ,20,000 loci were predicted with lower confidence; this set is enriched for hypothetical, partial and/or transposon-related sequences, and possess shorter coding sequences and fewer introns than the high-confidence set. The exon–intron structure of genes shows high conservation among soybean, poplar and grapevine, consistent with a high degree of position and phase conservation found more broadly across angiosperms10. Introns in soybean gene pairs retained in duplicate have a strong tendency to persist. Of 19,775 introns shared by poplar and grapevine (diverged more than 90 million years (Myr) ago11), and hence by the last common ancestor of soybean and grapevine, 19,666 (99.45%) were preserved in both copies in soybean. Of the remaining 0.55%, 78% are absent in both recent soybean copies (that is, lost before the ,13-Myr-ago duplication) and 22% are found only in one paralogue (that is, other copy lost). We find a slower intron loss rate in poplar (0.4%) than in soybean (0.6%) since the last common rosid ancestor, which is consistent with the slower rate of sequence evolution in the poplar lineage thought to be associated with its perennial, clonal habit, global distribution and wind pollination12. Intron size is also highly conserved in recent soybean paralogues, indicating that few insertions and deletions have accumulated within introns over the past 13 Myr. Of the 46,430 high-confidence loci, 34,073 (73%) are clearly orthologous with one or more sequences in other angiosperms, and can be assigned to 12,253 gene families (Supplementary Table 5). Among pan-angiosperm or pan-rosid gene families that also have members outside the legumes, soybean is particularly enriched (using a Fisher’s exact test relative to Arabidopsis) in genes containing NBARC (nucleotide-binding-site-APAF1-R-Ced) and LRR (leucinerich-repeat) domains. These genes are associated with the plant immune system, and are known to be dynamic13. Tandem gene family expansions are common in soybean and include NBS-LRR, F-box, auxin-responsive protein, and other domains commonly found in large gene families in plants. The ages of genes in these tandem families, inferred from intrafamily sequence divergence, indicate that they originated at various times in the evolutionary history of soybean, rather than in a discrete burst. From protein families in the sequenced angiosperms (http:// www.phytozome.net) (Supplementary Table 4), we identified 283 putative legume-specific gene families containing 448 high-confidence soybean genes (Supplementary Information section 2). These gene families include soybean and Medicago representatives, but no representatives from grapevine, poplar, Arabidopsis, papaya, or grass (Sorghum, rice, maize, Brachypodium). The top domains in this set are the AP2 domain, protein kinase domain, cytochrome P450, and PPR repeat. An additional 741 putatively soybean-specific gene families (each consisting of two or more high-confidence soybean genes) may also include legume-specific genes that have not yet been sequenced in the ongoing Medicago sequencing project, or may represent bona fide soybean-specific genes. The top domains in this list include protein kinase and protein tyrosine kinase, AP2, LRR, MYB-like DNA binding domain, cytochrome P450 (the same domains most common in the entire soybean proteome) as well as GDSL-like lipase/acylhydrolase and stress-upregulated Nod19. A combination of structure-based analyses and homology-based comparisons resulted in identification of 38,581 repetitive elements, covering most types of plant transposable elements. These elements, together with numerous truncated elements and other fragments, make up ,59% of the soybean genome (Supplementary Table 6). Long terminal repeat (LTR) retrotransposons are the most abundant class of transposable elements. The soybean genome contains ,42% LTR retrotransposons, fewer than Sorghum8 and maize14, but higher than rice15. The intact element sizes range from 1 kb to 21 kb, with an average size of 8.7 kb (Supplementary Fig. 2). Of the 510 families containing 14,106 intact elements, 69% are Gypsy-like and the remainder Copia-like. However, most (,78%) of these families are present at low copy numbers, typically fewer than 10 copies. The genome also contains an estimated 18,264 solo LTRs, probably caused by homologous recombination between LTRs from a single element. Nested retrotransposons are common, with 4,552 nested insertion events identified. The copy numbers within each block range from one to six. The genome consists of ,17% transposable elements, divided into Tc1/Mariner, haT, Mutator, PIF/Harbinger, Pong, CACTA superfamilies and Helitrons. Of these superfamilies, those containing more than 65 complete copies, Tc1/Mariner and Pong, comprise ,0.1% of the genome sequence, and seem to have not undergone recent amplification, indicating that they may be inactive and relatively old. Conversely, other families seem to have amplified recently and may still be active, indicated by the high similarity (.98%) of multiple elements. Multiple whole-genome duplication events Timing and phylogenetic position. A striking feature of the soybean genome is the extent to which blocks of duplicated genes have been retained. On the basis of previous studies that examined pairwise synonymous distance (Ks values) of paralogues16,17, and targeted sequencing of duplicated regions within the soybean genome18, we expected that large homologous regions would be identified in the genome. Using a pattern-matching search, gene families of sizes from two to six were identified, and Ks values were calculated for these genes, 179 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 14 January 2010 08 11 02 14 14 12 19 19 06 06 15 09 06 16 19 13 17 12 02 01 03 03 03 14 18 05 10 10 12 c 17 13 11 08 01 10 02 18 07 07 13 05 07 08 b 17 20 11 04 01 09 18 20 16 15 05 04 a 04 20 16 15 09 12 10 Pairs (%) 8 6 4 2 5 1. 1. 26 1. 32 1. 38 1. 44 2 1. 0. 9 0. 96 1. 02 1. 08 1. 14 0. 3 0. 36 0. 42 0. 48 0. 54 0. 6 0. 66 0. 72 0. 78 0. 84 0 0. 06 0. 12 0. 18 0. 24 0 Synonymous distance Figure 2 | Homologous relationships between the 20 soybean chromosomes. The bottom histogram plot shows pairwise Ks values for gene family sizes 2 to 6. Top panels show the 20 chromosomes in a circle with lines connecting homologous genes. Gene-rich regions (euchromatin) of each chromosome are coded a different colour around the circle. Grey represents Ks values of 0.06–0.39, 13-Myr genome duplication; black represents Ks values of 0.40–0.80, 59-Myr genome duplication. These correspond to the grey and black bars in the histogram. a, Chromosomes 1, 11, 17, 7, 10 and 3, which contain centromeric repeat Sb91. b, Chromosomes 19 and 6, which contain both Sb91 and Sb92 centromeric repeats. c, Chromosomes 18, 5, 8, 2, 14, 12, 13, 9, 15, 16, 20 and 4, which contain Sb92. 10,000 Poplar–soybean Grapevine–soybean Arabidopsis–soybean Rice–soybean Medicago–soybean Soybean–soybean 9,000 8,000 Gene pairs on syntenic segments here displayed as a histogram plot (Fig. 2), which shows two distinct peaks. Similarly, nucleotide diversity for the fourfold synonymous third-codon transversion position, 4dTv, was calculated. Both metrics give a measure of divergence between two genes, but the 4dTv uses a subset of the sites (transitions/transversion) used in the computation of Ks. 31,264 high-confidence soybean genes have recent paralogues with Ks < 0.13 synonymous substitutions per site and 4dTv < 0.0566 synonymous transversions per site (Fig. 3), corresponding to a soybean-lineage-specific palaeotetraploidization. This was probably an allotetraploidy event based on chromosomal evidence19. Of the 46,430 high-confidence genes, 31,264 exist as paralogues and 15,166 have reverted to singletons. We infer that the pre-duplication proto-soybean genome possessed ,30,000 genes: half of (2 3 15,166 1 2 3 15,632) 5 30,798. This number is comparable to the modern Arabidopsis gene complement. A second paralogue peak at Ks < 0.59 (4dTv < 0.26) corresponds to the early-legume duplication, which several lines of evidence suggest occurred near the origins of the papilionoid lineage20. The papilionoid origin has been dated to approximately 59 Myr ago21. A third highly diffuse peak is seen when the plot is expanded past a Ks value of 1.5 (data not shown) and most probably corresponds to the ‘gamma’ event22, shown to be a triplication in Vitis23 and in other angiosperms24. Owing to the existence of macrofossils in the legumes and allies, the timing of clade origins in the legumes is better established than other plant families. A fossil-calibrated molecular clock for the legumes places the origin of the legume stem clade and the oldest papilionoid crown clade at 58 to 60 Myr ago21. If the early-legume whole-genome duplication (WGD) occurred outside the papilionoid lineage, as suggested by map evidence from Arachis (an early-diverging 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 0 0.2 0.4 0.6 0.8 1.0 1.2 4dTv distance (corrected for multiple substitutions) 1.4 Figure 3 | Distribution of 4dTv distance between syntenically orthologous genes. Segments were found by locating blocks of BLAST hits with significance 1310218 or better with less than 10 intervening genes between such hits. The 4dTv distance between orthologous genes on these segments is reported. genus in the papilionoid clade)20, then the duplication occurred within the narrow window of time between the origin of the legumes and the papilionoid radiation. If the older duplication is assumed to have occurred around 58 Myr ago, then the calculated rate of silent mutations extending back to the duplication would be 5.17 3 1023, similar to previous estimates of 5.2 3 10–3 (ref. 21). The Glycine-specific duplication is estimated to have occurred ,13 Myr ago, an age consistent with previous estimates16,17. Structural organization. We identified homologous blocks within the genome using i-ADHoRe25. Using relatively stringent parameters, 442 multiplicons (that is, duplicated segments) were identified within the soybean genome and visualized using Circos26 (Fig. 2). Owing to the multiple rounds of duplication and diploidization in the genome, as well as chromosomal rearrangements, multiplicons (or blocks) between chromosomes can involve more than just two chromosomes. On average, 61.4% of the homologous genes are found in blocks involving only two chromosomes, only 5.63% spanning three chromosomes, and 21.53% traversing four chromosomes. Two notable exceptions to this pattern are chromosome 14, which has 11.8% of its genes retained across three chromosomes, and chromosome 20 with 7.08% of the homologues (gene pairs resulting from genome duplication) retained across four chromosomes. Chromosome 14 seems to be a highly fragmented chromosome with block matches to 14 other chromosomes, the highest number of all chromosomes. Conversely, chromosome 20 is highly homologous to the long arm of chromosome 10, with few matches elsewhere in the genome. Retention of homologues across the genome is exceptionally high; blocks retained in two or more chromosomes can be clearly observed (Fig. 2 and Supplementary Figs 5 and 6). The number of homologues (gene pairs) within a block average 31, although any given block may contain from 6 to 736 homologues. Given that not all genes within a block are retained as homologues (owing to loss of duplicated genes over time (fractionation)), the average number of genes in a block is ,75 genes and ranges from 8 to 1,377 genes. Repeated duplications in the soybean genome make it possible to determine rates of gene loss following each round of polyploidy. In homologous segments from the 13-Myr-old Glycine duplication, 43.4% of genes have matches in the corresponding region, in contrast to 25.9% in blocks from the early legume duplication. Combining these gene-loss rates with WGD dates of 13 Myr ago and 59 Myr ago, the rate of gene loss has been 4.36% of genes per Myr following the Glycine WGD and 1.28% of genes per Myr following the early-legume 180 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 14 January 2010 WGD. This differential in gene-loss rates indicates an exponential decay pattern of rapid gene loss after duplication, slowing over time. Table 1 | Putative acyl lipid genes in Arabidopsis and soybean Function category of acyl lipid genes Number in Arabidopsis Number in soybean Nodulation and oil biosynthesis genes A unique feature of legumes is their ability to establish nitrogenfixing symbioses with soil bacteria of the family Rhizobiaceae. Therefore, information on the nodulation functions of the soybean genome is of particular interest. Sequence comparisons with previously identified nodulation genes identified 28 nodulin genes and 24 key regulatory genes, which probably represent true orthologues of known nodulation genes in other legume species (Supplementary section 3 and Supplementary Table 8). Among this list of 52 genes, 32 have at least one highly conserved homologue gene. We hypothesize that these are homologous gene pairs arising from the Glycine WGD (that is, ,13 Myr ago). Further analysis shows that seven soybean nodulin genes produce transcript variants. The exceptional example is nodulin-24 (Glyma14g05690), which seems to produce ten transcript variants (Supplementary Table 8). In total, 25% of the examined nodulin genes produce transcript variants, which is slightly higher than the incidence of alternative splicing in Arabidopsis (,21.8%) and rice (,21.2%)27. However, none of the soybean regulatory nodulation genes produces transcript variants (Supplementary Table 8). Mining the soybean genome for genes governing metabolic steps in triacylglycerol biosynthesis could prove beneficial in efforts to modify soybean oil composition or content. Genomic analysis of acyl lipid biosynthesis in Arabidopsis revealed 614 genes involved in pathways beginning with plastid acetyl-CoA production for de novo fatty acid synthesis through cuticular wax deposition28. Comparison of these sequences to the soybean genome identified 1,127 putative orthologous and paralogous genes in soybean. This is probably a low estimate owing to the high stringency conditions used for gene mining. The distribution of these genes according to various functional classes of acyl lipid biosynthesis is shown in Table 1. Comparing Arabidopsis to soybean, the number of genes involved in storage lipid synthesis, fatty acid elongation and wax/cutin production was similar. For all other subclasses, the soybean genome contained substantially higher numbers of genes. Interestingly, the number of genes involved in lipid signalling, degradation of storage lipids, and membrane lipid synthesis were two- to threefold higher in soybean than Arabidopsis, indicating that these areas of acyl lipid synthesis are more complex in soybean. The number of genes involved in plastid de novo fatty acid synthesis was 63% higher in soybean compared to Arabidopsis. Many single-gene activities in Synthesis of fatty acids in plastids Synthesis of membrane lipids in plastids Synthesis of membrane lipids in endomembrane system Metabolism of acyl lipids in mitochondria Synthesis and storage of oil Degradation of storage lipids and straight fatty acids Lipid signalling Fatty acid elongation and wax and cutin metabolism Miscellaneous Total 46 20 56 29 19 43 153 73 175 614 75 33 117 69 22 155 312 70 274 1,127 ABI3/VP1: 78 a ZF-HD: 54 Other TFs: 561 Transcription factor diversity We identified soybean transcription factor genes by sequence comparison to known transcription factor gene families, as well as by searching for known DNA-binding domains. In total, 5,671 putative soybean transcription factor genes, distributed in 63 families, were identified (Fig. 4a and Supplementary Table 9). This number represents 12.2% of the 46,430 predicted soybean protein-coding loci. A similar analysis performed on the Arabidopsis genome identified 2,315 putative Arabidopsis transcription factor genes, representing 7.1% of the 32,825 predicted Arabidopsis protein-coding loci (Fig. 4b). Transcription factor genes are homogeneously distributed across the chromosomes in both soybean and Arabidopsis, with an average relative abundance of 8–10% transcription factor genes on each chromosome. On rare occasions, regions were identified in both genomes that had a relatively low (,5%) or high density (.12%) of transcription factor genes. Among the transcription factor genes identified, 9.5% of soybean genes (538 transcription factor genes) and 8.2% of Arabidopsis genes (190 Arabidopsis transcription factor genes) b AP2-EREBP: 381 AS2: 92 WRKY: 197 Arabidopsis are encoded by multigene families in soybean, including ketoacyl-ACP synthase II (12 copies in soybean), malonyl-CoA:ACP malonyltransferase (2 copies), enoyl-ACP reductase (5 copies), acylACP thioesterase FatB (6 copies) and plastid homomeric acetyl-CoA carboxylase (3 copies). Long-chain acyl-CoA synthetases, ER acyltransferases, mitochondrial glycerol-phosphate acyltransferases, and lipoxygenases are all unusually large gene families in soybean, containing as many as 24, 21, 20 and 52 members, respectively. The multigenic nature of these and many other activities involved in acyl lipid metabolism suggests the potential for more complex transcriptional control in soybean compared to Arabidopsis. ABI3/VP1: 71 AP2-EREBP: 146 ZF-HD: 17 Other TFs: 241 AS2: 43 WRKY: 73 AUX-IAA-ARF: 129 AUX-IAA-ARF: 51 TPR: 319 bHLH: 393 TCP: 65 TPR: 65 TCP: 6 bHLH: 172 Bromodomain: 57 SNF2: 69 BTB/POZ: 145 BZIP: 176 Bromodomain: 29 SNF2: 33 PHD: 55 BTB/POZ: 98 PHD: 222 C2C2 (Zn) CO-like: 72 NAC: 114 BZIP: 78 C2C2 (Zn) Dof: 82 C2C2 (Zn) GATA: 62 C2C2 (Zn) CO-like: 34 NAC: 208 C2C2 (Zn) Dof: 36 C2H2 (Zn): 395 MYB/HD-like: 726 C2C2 (Zn) GATA: 29 MYB/HD-like: 279 C2H2 (Zn): 173 C3H-type1(Zn): 147 CCAAT: 106 MYB: 65 Jumonji: 77 MADS: 212 CCHC (Zn): 144 GRAS: 130 Homeodomain/Homeobox: 319 Figure 4 | Distribution of soybean (a) and Arabidopsis (b) transcription factor genes in different transcription factor families. Only the distribution of the most representative transcription families is detailed here. AUX-IAAARF, indole-3-acetic acid-auxin response factor; BTB/POZ, bric-à-brac tramtrack broad complex/pox viruses and zinc fingers; BZIP, basic leucine C3H-type1(Zn): 69 MYB: 24 CCAAT: 38 CCHC (Zn): 66 Homeodomain/Homeobox: 112 Jumonji: 21 MADS: 109 GRAS: 33 zipper; GRAS, (GAI, RGA, SCR); NAC, (NAM, ATAF1/2, CUC2); PHD, plant homeodomain-finger transcription factor; TCP, (TB1, CYC, PCF); TFs, transcription factors; TPR, tetratricopepitide repeat; WRKY, conserved amino acid sequence WRKYGQK at its N-terminal end. 181 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 14 January 2010 are tandemly duplicated. By way of example, only one region in Arabidopsis has more than five duplicated transcription factor genes in tandem (seven ABI3/VP1 genes (At4G31610 to At4G31660)), whereas in soybean several such regions are present (for example, 13 C3H-type 1 (Zn) (Glyma15g19120 to Glyma15g19240); six MYB/ HD-like (Glyma06g45520 to Glyma06g45570); and five MADS (Glyma20g27320 to Glyma20g27360); Supplementary Table 8). The overall distribution of soybean transcription factor genes among the various known protein families is very similar between Arabidopsis and soybean (Supplementary Fig. 10a, b). However, some families are relatively sparser or more abundant in soybean, perhaps reflecting differences in biological function. For example, members of the ABI3/VP1 family are 2.2-times more abundant in Arabidopsis, whereas members of the TCP family are 4.4-times more abundant in soybean. In addition, those gene families with fewer members are differentially represented between soybean and Arabidopsis. FHA, HD-Zip (homeodomain/leucine zipper), PLATZ, SRS and TUB transcription factor genes are more abundant in soybean (2.7, 2.9, 4.1, 3, and 4.9 times, respectively) and HTH-ARAC (helix–turn–helix araC/ xylS-type) genes were identified exclusively in soybean. In contrast, HSF, HTH-FIS (helix–turn–helix-factor for inversion stimulation), TAZ and U1-type (Zn) genes are present in relatively larger numbers in Arabidopsis (5.4, 4.9, 24.5 and 2.9 times, respectively). Notably, both ABI3/VP1, TCP, SRS and Tubby transcription factor genes were shown to have critical roles in plant development (for example, ABI3/ VP1 during seed development; TCP, SRS and Tubby affect overall plant development29–33). The differences seen in relative transcription factor gene abundance indicates that regulatory pathways in soybean may differ from those described in Arabidopsis. Impact on agriculture Hundreds of qualitatively inherited (single gene) traits have been characterized in soybean and many genetically mapped. However, most important crop production traits and those important to seed quality for human health, animal nutrition and biofuel production are quantitatively inherited. The regions of the genome containing DNA sequence affecting these traits are called quantitative trait loci (QTL). QTL mapping studies have been ongoing for more than 90 distinct traits of soybean including plant developmental and reproductive characters, disease resistance, seed quality and nutritional traits. In most cases, the causal functional gene or transcription factor underlying the QTL is unknown. However, the integration of the whole genome sequence with the dense genetic marker map that now exists in soybean2–5 (http://www.Soybase.org) will allow the association of mapped phenotypic effectors with the causal DNA sequence. There are already examples where the availability of the soybean genomic sequence has accelerated these discovery efforts. Having access to the sequence allowed cloning and identification of the rsm1 (raffinose synthase) mutation that can be used to select for low-stachyose-containing soybean lines that will improve the ability of animals and humans to digest soybeans34. Using a comparative genomics approach between soybean and maize, a single-base mutation was found that causes a reduction in phytate production in soybean35. Phytate reduction could result in a reduction of a major environmental runoff contaminant from swine and poultry waste. Perhaps most exciting for the soybean community, the first resistance gene for the devastating disease Asian soybean rust (ASR) has been cloned with the aid of the soybean genomic sequence and confirmed with viral-induced gene silencing36. In countries where ASR is well established, soybean yield losses due to the disease can range from 10% to 80%36 and the development of soybean strains resistant to ASR will greatly benefit world soybean production. Soybean, one of the most important global sources of protein and oil, is now the first legume species with a complete genome sequence. It is, therefore, a key reference for the more than 20,000 legume species, and for the remarkable evolutionary innovation of nitrogen-fixing symbiosis. This genome, with a common ancestor only 20 million years removed from many other domesticated bean species, will allow us to knit together knowledge about traits observed and mapped in all of the beans and relatives. The genome sequence is also an essential framework for vast new experimental information such as tissue-specific expression and whole-genome association data. With knowledge of this genome’s billion-plus nucleotides, we approach an understanding of the plant’s capacity to turn carbon dioxide, water, sunlight and elemental nitrogen and minerals into concentrated energy, protein and nutrients for human and animal use. The genome sequence opens the door to crop improvements that are needed for sustainable human and animal food production, energy production and environmental balance in agriculture worldwide. METHODS SUMMARY Seeds from cultivar Williams 82 were grown in a growth chamber for 2 weeks and etiolated for 5 days before harvest. A standard phenol/chloroform leaf extraction was performed. DNA was treated with RNase A and proteinase K and precipitated with ethanol. All sequencing reads were collected with Sanger sequencing protocols on ABI 3730XL capillary sequencing machines, a majority at the Joint Genome Institute in Walnut Creek, California. A total of 15,332,163 sequence reads were assembled using Arachne v.20071016 (ref. 37) to form 3,363 scaffolds covering 969.6 Mb of the soybean genome. The resulting assembly was integrated with the genetic and physical maps previously built for soybean and a newly constructed genetic map to produce 20 chromosome-scale scaffolds covering 937.3 Mb and an additional 1,148 unmapped scaffolds that cover 17.7 Mb of the genome. Genes were annotated using Fgenesh138 and GenomeScan39 informed by EST alignments and peptide matches to genome from Arabidopsis, rice and grapevine. Models were reconciled with EST alignments and UTR added using PASA40. Models were filtered for high confidence by penalizing genes which were transposableelement-related, had low sequence entropy, short introns, incomplete start or stop, low C-score, no UniGene hit at 1 3 1025, or the model was less than 30% the length of its best hit. LTR retrotransposons were identified by the program LTR_STRUC41, manually inspected to check structure features and classified into distinct families based on the similarities to LTR sequences. DNA transposons were identified using conserved protein domains as queries in TBLASTN42 searches of the genome. Identified elements were used as a custom library for RepeatMasker (current version: open 3.2.8; http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker) to detect missed intact elements, truncated elements and fragments. Virtual suffix trees with six-frame translation were generated using Vmatch43 and then clustered into families. Pairwise alignments between gene family members were performed using ClustalW44. Identification of homologous blocks was performed using i-ADHoRe v2.1 (ref. 25). Visualization of blocks was performed with Circos26. Received 19 August; accepted 12 November 2009. 1. Arumuganathan, K. & Earle, E. D. Nuclear DNA content of some important plant species. Plant Mol. Biol. Rep. 9, 208–218 (1991). 2. Choi, I. Y. et al. A soybean transcript map: gene distribution, haplotype and singlenucleotide polymorphism analysis. Genetics 176, 685–696 (2007). 3. Hyten, D. L. et al. High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics (in the press). 4. Hyten, D. L. et al. A high density integrated genetic linkage map of soybean and the development of a 1,536 Universal Soy Linkage Panel for QTL mapping. Crop Sci. (in the press). 5. Song, Q. J. et al. A new integrated genetic linkage map of the soybean. Theor. Appl. Genet. 109, 122–128 (2004). 6. Lin, J. Y. et al. Pericentromeric regions of soybean (Glycine max L. Merr.) chromosomes consist of retroelements and tandemly repeated DNA and are structurally and evolutionarily labile. Genetics 170, 1221–1230 (2005). 7. Vahedian, M. et al. Genomic organization and evolution of the soybean SB92 satellite sequence. Plant Mol. Biol. 29, 857–862 (1995). 8. Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009). 9. Umezawa, T. et al. Sequencing and analysis of approximately 40,000 soybean cDNA clones from a full-length-enriched cDNA library. DNA Res. 15, 333–346 (2008). 10. Roy, S. W. & Penny, D. Patterns of intron loss and gain in plants: intron lossdominated evolution and genome-wide comparison of O. sativa and A. thaliana. Mol. Biol. Evol. 24, 171–181 (2007). 11. Wang, H. et al. Rosid radiation and the rapid rise of angiosperm-dominated forests. Proc. Natl Acad. Sci. USA 106, 3853–3858 (2009). 182 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 14 January 2010 12. Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 1596–1604 (2006). 13. Michelmore, R. & Meyers, B. C. Clusters of resistance genes in plants evolve by divergent selection and a birth-and-death process. Genome Res. 8, 1113–1130 (1998). 14. Bruggmann, R. et al. Uneven chromosome contraction and expansion in the maize genome. Genome Res. 16, 1241–1251 (2006). 15. Ma, J., Devos, K. M. & Bennetzen, J. L. Analyses of LTR-retrotransposon structures reveal recent and rapid genomic DNA loss in rice. Genome Res. 14, 860–869 (2004). 16. Pfeil, B. E., Schlueter, J. A., Shoemaker, R. C. & Doyle, J. J. Placing paleopolyploidy in relation to taxon divergence: a phylogenetic analysis in legumes using 39 gene families. Syst. Biol. 54, 441–454 (2005). 17. Schlueter, J. A. et al. Mining EST databases to resolve evolutionary events in major crop species. Genome 47, 868–876 (2004). 18. Schlueter, J. A., Scheffler, B. E., Jackson, S. & Shoemaker, R. C. Fractionation of synteny in a genomic region containing tandemly duplicated genes across Glycine max, Medicago truncatula, and Arabidopsis thaliana. J. Hered. 99, 390–395 (2008). 19. Gill, N. et al. Molecular and chromosomal evidence for allopolyploidy in soybean, Glycine max (L.) Merr. Plant Physiol. 151, 1167–1174 (2009). 20. Bertioli, D. J. et al. An analysis of synteny of Arachis with Lotus and Medicago sheds new light on the structure, stability and evolution of legume genomes. BMC Genomics 10, 45 (2009). 21. Lavin, M., Herendeen, P. S. & Wojciechowski, M. F. Evolutionary rates analysis of Leguminosae implicates a rapid diversification of lineages during the tertiary. Syst. Biol. 54, 575–594 (2005). 22. Bowers, J. E., Chapman, B. A., Rong, J. & Paterson, A. H. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422, 433–438 (2003). 23. Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007). 24. Tang, H. et al. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Res. 18, 1944–1954 (2008). 25. Simillion, C., Janssens, K., Sterck, L. & Van de Peer, Y. i-ADHoRe 2.0: an improved tool to detect degenerated genomic homology using genomic profiles. Bioinformatics 24, 127–128 (2008). 26. Krzywinski, M. et al. Circos: An information aesthetic for comparative genomics. Genome Res 19, 1639–1645 (2009). 27. Wang, B. B. & Brendel, V. Genomewide comparative analysis of alternative splicing in plants. Proc. Natl Acad. Sci. USA 103, 7175–7180 (2006). 28. Beisson, F. et al. Arabidopsis genes involved in acyl lipid metabolism. A 2003 census of the candidates, a study of the distribution of expressed sequence tags in organs, and a web-based database. Plant Physiol. 132, 681–697 (2003). 29. Fridborg, I., Kuusk, S., Moritz, T. & Sundberg, E. The Arabidopsis dwarf mutant shi exhibits reduced gibberellin responses conferred by overexpression of a new putative zinc finger protein. Plant Cell 11, 1019–1032 (1999). 30. Barkoulas, M., Galinha, C., Grigg, S. P. & Tsiantis, M. From genes to shape: regulatory interactions in leaf development. Curr. Opin. Plant. Biol. 10, 660–666 (2007). 31. Lai, C. P. et al. Molecular analyses of the Arabidopsis TUBBY-like protein gene family. Plant Physiol. 134, 1586–1597 (2004). 32. Herve, C. et al. In vivo interference with AtTCP20 function induces severe plant growth alterations and deregulates the expression of many genes important for development. Plant Physiol. 149, 1462–1477 (2009). 33. Stone, S. L. et al. LEAFY COTYLEDON2 encodes a B3 domain transcription factor that induces embryo development. Proc. Natl Acad. Sci. USA 98, 11806–11811 (2001). 34. Skoneczka, J., Saghai Maroof, M. A., Shang, C. & Buss, G. R. Identification of candidate gene mutation associated with low stachyose phenotype in soybean line PI 200508. Crop Sci. 49, 247–255 (2009). 35. Saghai Maroof, M. A., Glover, N. M., Biyashev, R. M., Buss, G. R. & Grabau, E. A. Genetic basis of the low-phytate trait in the soybean line CX1834. Crop Sci. 49, 69–76 (2009). 36. Meyer, J. D. F. et al. Identification and analyses of candidate genes for Rpp4mediated resistance to Asian soybean rust in soybean (Glycine max (L.) Merr.). Plant Physiol. 150, 295–307 (2009). 37. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003). 38. Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000). 39. Yeh, R. F., Lim, L. P. & Burge, C. B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001). 40. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003). 41. McCarthy, E. M. & McDonald, J. F. LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 19, 362–367 (2003). 42. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). 43. Beckstette, M., Homann, R., Giegerich, R. & Kurtz, S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 7, 389 (2006). 44. Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements We thank N. Weeks for informatics support and C. Gunter for critical reading of the manuscript. We acknowledge funding from the National Science Foundation (DBI-0421620 to G.S.; DBI-0501877 and 082225 to S.A.J.) and the United Soybean Board. Author Contributions Sequencing, assembly and integration: J. Schmutz, S.B.C., J. Schlueter, W.N., U.H., E.L., M.P., D. Grant, S.S., D. Goodstein, K.B., A.S., J.G. and D.R. Annotation: J.M., T.M., J.J.T., J.C., D.X., J.D., Z.T., L.Z., N.G., T.J., M.L., X.-C.Z. and G.S. EST sequencing: G.D.M., T.S., T.U., M.B., D.S., B.V., K.S. and H.T.N. Physical mapping: Y.Y., M.F.G., R.A.W. and R.C.S. Genetic mapping: D.H., J. Specht, Q.S. and P.C. Writing/coordination: S.A.J. Author Information This whole-genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession ACUP00000000. The version described here is the first version, ACUP01000000. Full annotation is available at http://www.phytozome.net. Reprints and permissions information is available at www.nature.com/reprints. The authors declare no competing financial interests. This paper is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence, and is freely available to all readers at www.nature.com/nature. Correspondence and requests for materials should be addressed to S.A.J. ([email protected]). 183 ©2010 Macmillan Publishers Limited. All rights reserved Vol 463 | 11 February 2010 | doi:10.1038/nature08747 ARTICLES Genome sequencing and analysis of the model grass Brachypodium distachyon The International Brachypodium Initiative* Three subfamilies of grasses, the Ehrhartoideae, Panicoideae and Pooideae, provide the bulk of human nutrition and are poised to become major sources of renewable energy. Here we describe the genome sequence of the wild grass Brachypodium distachyon (Brachypodium), which is, to our knowledge, the first member of the Pooideae subfamily to be sequenced. Comparison of the Brachypodium, rice and sorghum genomes shows a precise history of genome evolution across a broad diversity of the grasses, and establishes a template for analysis of the large genomes of economically important pooid grasses such as wheat. The high-quality genome sequence, coupled with ease of cultivation and transformation, small size and rapid life cycle, will help Brachypodium reach its potential as an important model system for developing new energy and food crops. Grasses provide the bulk of human nutrition, and highly productive grasses are promising sources of sustainable energy1. The grass family (Poaceae) comprises over 600 genera and more than 10,000 species that dominate many ecological and agricultural systems2,3. So far, genomic efforts have largely focused on two economically important grass subfamilies, the Ehrhartoideae (rice) and the Panicoideae (maize, sorghum, sugarcane and millets). The rice4 and sorghum5 genome sequences and a detailed physical map of maize6 showed extensive conservation of gene order5,7 and both ancient and relatively recent polyploidization. Most cool season cereal, forage and turf grasses belong to the Pooideae subfamily, which is also the largest grass subfamily. The genomes of many pooids are characterized by daunting size and complexity. For example, the bread wheat genome is approximately 17,000 megabases (Mb) and contains three independent genomes8. This has prohibited genome-scale comparisons spanning the three most economically important grass subfamilies. Brachypodium, a member of the Pooideae subfamily, is a wild annual grass endemic to the Mediterranean and Middle East9 that has promise as a model system. This has led to the development of highly efficient transformation10,11, germplasm collections12–14, genetic markers14, a genetic linkage map15, bacterial artificial chromosome (BAC) libraries16,17, physical maps18 (M.F., unpublished observations), mutant collections (http://brachypodium.pw.usda.gov, http://www. brachytag.org), microarrays and databases (http://www.brachybase. org, http://www.phytozome.net, http://www.modelcrop.org, http:// mips.helmholtz-muenchen.de/plant/index.jsp) that are facilitating the use of Brachypodium by the research community. The genome sequence described here will allow Brachypodium to act as a powerful functional genomics resource for the grasses. It is also an important advance in grass structural genomics, permitting, for the first time, whole-genome comparisons between members of the three most economically important grass subfamilies. Genome sequence assembly and annotation The diploid inbred line Bd21 (ref. 19) was sequenced using wholegenome shotgun sequencing (Supplementary Table 1). The ten largest scaffolds contained 99.6% of all sequenced nucleotides (Supplementary Table 2). Comparison of these ten scaffolds with a genetic map (Supplementary Fig. 1) detected two false joins and created a further seven joins to produce five pseudomolecules that spanned 272 Mb (Supplementary Table 3), within the range measured by flow cytometry20,21. The assembly was confirmed by cytogenetic analysis (Supplementary Fig. 2) and alignment with two physical maps and sequenced BACs (Supplementary Data). More than 98% of expressed sequence tags (ESTs) mapped to the sequence assembly, consistent with a near-complete genome (Supplementary Table 4 and Supplementary Fig. 3). Compared to other grasses, the Brachypodium genome is very compact, with retrotransposons concentrated at the centromeres and syntenic breakpoints (Fig. 1). DNA transposons and derivatives are broadly distributed and primarily associated with generich regions. We analysed small RNA populations from inflorescence tissues with deep Illumina sequencing, and mapped them onto the genome sequence (Fig. 2a, Supplementary Fig. 4 and Supplementary Table 5). Small RNA reads were most dense in regions of high repeat density, similar to the distribution reported in Arabidopsis22. We identified 413 and 198 21- and 24-nucleotide phased short interfering RNA (siRNA) loci, respectively. Using the same algorithm, the only phased loci identified in Arabidopsis were five of the eight trans-acting siRNA loci, and none was 24-nucelotide phased. The biological functions of these clusters of Brachypodium phased siRNAs, which account for a significant number of small RNAs that map outside repeat regions, are not known at present. A total of 25,532 protein-coding gene loci was predicted in the v1.0 annotation (Supplementary Information and Supplementary Table 6). This is in the same range as rice (RAP2, 28,236)23 and sorghum (v1.4, 27,640)5, suggesting similar gene numbers across a broad diversity of grasses. Gene models were evaluated using ,10.2 gigabases (Gb) of Illumina RNA-seq data (Supplementary Fig. 5)24. Overall, 92.7% of predicted coding sequences (CDS) were supported by Illumina data (Fig. 2b), demonstrating the high accuracy of the Brachypodium gene predictions. These gene models are available from several databases (such as http://www.brachybase.org, http://www.phytozome.net, http://www.modelcrop.org and http://mips.org). Between 77 and 84% of gene families (defined according to Supplementary Fig. 6) are shared among the three grass subfamilies represented by Brachypodium, rice and sorghum, reflecting a relatively *A list of participants and their affiliations appears at the end of the paper. 763 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 11 February 2010 a 10,000 5,000 0 5,000 10,000 STA cLTRs sLTRs DNA-TEs MITEs CDS 1,500 750 0 750 1,500 1,500 750 0 750 1,500 Chr. 2 STA cLTRs sLTRs DNA-TEs MITEs CDS Retrotransposons Genes (introns) Genes (CDS exons) DNA transposons Satellite tandem arrays Chr. 5 STA cLTRs sLTRs DNA-TEs MITEs CDS Figure 1 | Chromosomal distribution of the main Brachypodium genome features. The abundance and distribution of the following genome elements are shown: complete LTR retroelements (cLTRs); solo-LTRs (sLTRs); potentially autonomous DNA transposons that are not miniature invertedrepeat transposable elements (MITEs) (DNA-TEs); MITEs; gene exons (CDS); gene introns and satellite tandem arrays (STA). Graphs are from 0 to 100 per cent base-pair (%bp) coverage of the respective window. The heat map tracks have different ranges and different maximum (max) pseudocolour levels: STA (0–55, scaled to max 10) %bp; cLTRs (0–36, scaled to max 20) %bp; sLTRs (0–4) %bp; DNA-TEs (0–20) %bp; MITEs (0–22) %bp; CDS (exons) (0–22.3) %bp. The triangles identify syntenic breakpoints. recent common origin (Fig. 2c). Grass-specific genes include transmembrane receptor protein kinases, glycosyltransferases, peroxidases and P450 proteins (Supplementary Table 7B). The Pooideae-specific gene set contains only 265 gene families (Supplementary Table 7C) comprising 811 genes (1,400 including singletons). Genes enriched in grasses were significantly more likely to be contained in tandem arrays than random genes, demonstrating a prominent role for tandem gene expansion in the evolution of grass-specific genes (Supplementary Fig. 7 and Supplementary Table 8). To validate and improve the v1.0 gene models, we manually annotated 2,755 gene models from 97 diverse gene families (Supplementary Tables 9–11) relevant to bioenergy and food crop improvement. We annotated 866 genes involved in cell wall biosynthesis/modification and 948 transcription factors from 16 families25. Only 13% of the gene 5 Phased small RNA loci Repeat-normalized RNA-seq reads 0 b c Rice Ehrhardtoideae 16,235 families 20,559 genes in families Sorghum Panicoideae 17,608 families 25,816 genes in families 0.9 495 0.7 1,479 860 13,580 681 1,689 0.5 0.3 265 C DS SJ S 0.1 5′ Chr. 4 4 50,000 U STA cLTRs sLTRs DNA-TEs MITEs CDS 3 Repeat-normalized 24-nt reads TR U TR In tro ns Ex on cD s N As Chr. 3 100,000 Coverage over feature length STA cLTRs sLTRs DNA-TEs MITEs CDS 70 35 0 2 Repeat-normalized 21-nt reads 3′ Chr. 1 1 Total small RNA reads/loci Brachypodium Wheat/barley Pooideae 16,215 families 20,562 genes in families Figure 2 | Transcript and gene identification and distribution among three grass subfamilies. a, Genome-wide distribution of small RNA loci and transcripts in the Brachypodium genome. Brachypodium chromosomes (1–5) are shown at the top. Total small RNA reads (black lines) and total small RNA loci (red lines) are shown on the top panel. Histograms plot 21-nucleotide (nt) (blue) or 24-nucleotide (red) small RNA reads normalized for repeated matches to the genome. The phased loci histograms plot the position and phase-score of 21-nucleotide (blue) and 24-nucleotide (red) phased small RNA loci. Repeatnormalized RNA-seq read histograms plot the abundance of reads matching RNA transcripts (green), normalized for ambiguous matches to the genome. b, Transcript coverage over gene features. Perfect match 32-base oligonucleotide Illumina reads were mapped to the Brachypodium v1.0 annotation features using HashMatch (http://mocklerlab-tools.cgrb.oregonstate.edu/). Plots of Illumina coverage were calculated as the percentage of bases along the length of the sequence feature supported by Illumina reads for the indicated gene model features. The bottom and top of the box represent the 25th and 75th quartiles, respectively. The white line is the median and the red diamonds denote the mean. SJS, splice junction site. c, Venn diagram showing the distribution of shared gene families between representatives of Ehrhartoideae (rice RAP2), Panicoideae (sorghum v1.4) and Pooideae (Brachypodium v1.0, and Triticum aestivum and Hordeum vulgare TCs (transcript consensus)/EST sequences). Paralogous gene families were collapsed in these data sets. models required modification and very few pseudogenes were identified, demonstrating the accuracy of the v1.0 annotation. Phylogenetic trees for 62 gene families were constructed using genes from rice, Arabidopsis, sorghum and poplar. In nearly all cases, Brachypodium genes had a similar distribution to rice and sorghum, demonstrating that Brachypodium is suitably generic for grass functional genomics research (Supplementary Figs 8 and 9). Analysis of the predicted secretome identified substantial differences in the distribution of cell wall metabolism genes between dicots and grasses (Supplementary Tables 12, 13 and Supplementary Fig. 10), consistent with their different cell walls26. Signal peptide probability curves also suggested that start codons were accurately predicted (Supplementary Fig. 11). Maintaining a small grass genome size Exhaustive analysis of transposable elements (Supplementary Information and Supplementary Table 14) showed retrotransposon sequences comprise 21.4% of the genome, compared to 26% in rice, 764 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 11 February 2010 54% in sorghum, and more than 80% in wheat27. Thirteen retroelement sets were younger than 20,000 years, showing a recent activation compared to rice28 (Supplementary Fig. 12), and a further 53 retroelement sets were less than 0.1 million years (Myr) old. A minimum of 17.4 Mb has been lost by long terminal repeat (LTR)– LTR recombination, demonstrating that retroelement expansion is countered by removal through recombination. In contrast, retroelements persist for very long periods of time in the closely related Triticeae28. DNA transposons comprise 4.77% of the Brachypodium genome, within the range found in other grass genomes5,29. Transcriptome data and structural analysis suggest that many non-autonomous Mariner DTT and Harbinger elements recruit transposases from other families. Two CACTA DTC families (M and N) carried five non-element genes, and the Harbinger U family has amplified a NBS-LRR gene family (Supplementary Figs 13 and 14), adding it to the group of transposable elements implicated in gene mobility30,31. Centromeric regions were characterized by low gene density, characteristic repeats and retroelement clusters (Supplementary Fig. 15). Other repeat classes are b hy ce Bd5 Ri W Br he ac at So rg hu m po di um a Bd1 Bd4 32–39 40–54 45–60 WGD 56–73 Bd2 Bd3 c Rice 6 7 3 5 1 1 10 2 10 1 2 8 2 3 3 9 11 9 4 4 7 12 4 5 5 8 6 Sorghum d e Barley 4 7 3 1 6 5 2 Brachypodium 1 1 2 3 4 7 3 1 6 5 Aegilops tauschii 3 4 5 5 4 4 2 2 7 3 1 Wheat 6 5 2 described in Supplementary Table 15. Conserved non-coding sequences are described in Supplementary Fig. 16. Whole-genome comparison of three diverse grass genomes The evolutionary relationships between Brachypodium, sorghum, rice and wheat were assessed by measuring the mean synonymous substitution rates (Ks) of orthologous gene pairs (Supplementary Information, Supplementary Fig. 17 and Supplementary Table 16), from which divergence times of Brachypodium from wheat 32–39 Myr ago, rice 40–53 Myr ago, and sorghum 45–60 Myr ago (Fig. 3a) were estimated. The Ks of orthologous gene pairs in the intragenomic Brachypodium duplications (Fig. 3b) suggests duplication 56–72 Myr ago, before the diversification of the grasses. This is consistent with previous evolutionary histories inferred from a small number of genes3,32–34. Paralogous relationships among Brachypodium chromosomes showed six major chromosomal duplications covering 92.1% of the genome (Fig. 3b), representing ancestral whole-genome duplication35. Using the rice and sorghum genome sequences, genetic maps of barley36 and Aegilops tauschii (the D genome donor of hexaploid wheat)37, and bin-mapped wheat ESTs38,39, 21,045 orthologous relationships between Brachypodium, rice, sorghum and Triticeae were identified (Supplementary Information). These identified 59 blocks of collinear genes covering 99.2% of the Brachypodium genome (Fig. 3c–e). The orthologous relationships are consistent with an evolutionary model that shaped five Brachypodium chromosomes from a five-chromosome ancestral genome by a 12-chromosome intermediate involving seven major chromosome fusions39 (Supplementary Fig. 18). These collinear blocks of orthologous genes provide a robust and precise sequence framework for understanding grass genome evolution and aiding the assembly of sequences from other pooid grasses. We identified 14 major syntenic disruptions between Brachypodium and rice/sorghum that can be explained by nested insertions of entire chromosomes into centromeric regions (Fig. 4a, b)2,37,40. Similar nested insertions in sorghum37 and barley (Fig. 4c, d) were also identified. Centromeric repeats and peaks in retroelements at the junctions of chromosome insertions are footprints of these insertion events (Supplementary Fig. 15C and Fig. 1), as is higher gene density at the former distal regions of the inserted chromosomes (Fig. 1). Notably, the reduction in chromosome number in Brachypodium and wheat occurred independently because none of the chromosome fusions are shared by Brachypodium and the Triticeae37 (Supplementary Fig. 18). Figure 3 | Brachypodium genome evolution and synteny between grass subfamilies. a, The distribution maxima of mean synonymous substitution rates (Ks) of Brachypodium, rice, sorghum and wheat orthologous gene pairs (Supplementary Table 16) were used to define the divergence times of these species and the age of interchromosomal duplications in Brachypodium. WGD, whole-genome duplication. The numbers refer to the predicted divergence times measured as Myr ago by the NG or ML methods. b, Diagram showing the six major interchromosomal Brachypodium duplications, defined by 723 paralogous relationships, as coloured bands linking the five chromosomes. c, Identification of chromosome relationships between the Brachypodium, rice and sorghum genomes. Orthologous relationships between the 25,532 protein-coding Brachypodium genes, 7,216 sorghum orthologues (12 syntenic blocks), and 8,533 rice orthologues (12 syntenic blocks) were defined. Sets of collinear orthologous relationships are represented by a coloured band according to each Brachypodium chromosome (blue, chromosome (chr.) 1; yellow, chr. 2; violet, chr. 3; red, chr. 4; green, chr. 5). The white region in each Brachypodium chromosome represents the centromeric region. d, Orthologous gene relationships between Brachypodium and barley and Ae. tauschii were identified using genetically mapped ESTs. 2,516 orthologous relationships defined 12 syntenic blocks. These are shown as coloured bands. e, Orthologous gene relationships between Brachypodium and hexaploid bread wheat defined by 5,003 ESTs mapped to wheat deletion bins. Each set of orthologous relationships is represented by a band that is evenly spread across each deletion interval on the wheat chromosomes. 765 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 11 February 2010 a Bd1 Os1 Os5 Os3 Os7 Os10 Os11 Os12 Os8 Os9 Os2 Os4 Os6 Bd2 Bd3 Bd4 10 Mb Bd5 b Bd1 Os6 Os10 Os5 Os7 Os3 Bd3 Os11 A5 A7 A11 A8 A4 Bd4 Bd2 Os1 Os8 Os9 Os2 Os12 Os4 Bd5 c Sb1 Sb2 10 Mb d H1 H2 H4 Figure 4 | A recurring pattern of nested chromosome fusions in grasses. a, The five Brachypodium chromosomes are coloured according to homology with rice chromosomes (Os1–Os12). Chromosomes descended from an ancestral chromosome (A4–A11) through whole-genome duplication are shown in shades of the same colour. Gene density is indicated as a red line above the chromosome maps. Major discontinuities in gene density identify syntenic breakpoints, which are marked by a diamond. White diamonds identify fusion points containing remnant centromeric repeats. b, A pattern of nested insertions of whole chromosomes into centromeric regions explains the observed syntenic break points. Bd5 has not undergone chromosome fusion. c, Examples of nested chromosome insertions in sorghum (Sb) chromosomes 1 and 2. d, Examples of nested chromosome insertions in barley (H chromosomes) inferred from genetic maps. Nested insertions were not identified in other chromosomes, possibly owing to the low resolution of genetic maps. Comparisons of evolutionary rates between Brachypodium, sorghum, rice and Ae. tauschii demonstrated a substantially higher rate of genome change in Ae. tauschii (Supplementary Table 17). This may be due to retroelement activity that increases syntenic disruptions, as proposed for chromosome 5S later41. Among seven relatively large gene families, four were highly syntenic and two (NBS-LRR and F-box) were almost never found in syntenic order when compared to rice and sorghum (Supplementary Table 18), consistent with the rapid diversification of the NBS-LRR and F-box gene families42. The short arm of chromosome 5 (Bd5S) has a gene density roughly half of the rest of the genome, high LTR retrotransposon density, the youngest intact Gypsy elements and the lowest solo LTR density. Thus, unlike the rest of the Brachypodium genome, Bd5S is gaining retrotransposons by replication and losing fewer by recombination. Syntenic regions of rice (Os4S) and sorghum (Sb6S) demonstrate maintenance of this high repeat content for ,50–70 Myr (Supplementary Fig. 19)43. Bd5S, Os4S and Sb6S also have the lowest proportion of collinear genes (Fig. 4a and Supplementary Fig. 19). We propose that the chromosome ancestral to Bd5S reached a tipping point in which high retrotransposon density had deleterious effects on genes. Discussion As the first genome sequence of a pooid grass, the Brachypodium genome aids genome analysis and gene identification in the large and complex genomes of wheat and barley, two other pooid grasses that are among the world’s most important crops. The very high quality of the Brachypodium genome sequence, in combination with those from two other grass subfamilies, enabled reconstruction of chromosome evolution across a broad diversity of grasses. This analysis contributes to our understanding of grass diversification by explaining how the varying chromosome numbers found in the major grass subfamilies derive from an ancestral set of five chromosomes by nested insertions of whole chromosomes into centromeres. The relatively small genome of Brachypodium contains many active retroelement families, but recombination between these keeps genome expansion in check. The short arm of chromosome 5 deviates from the rest of the genome by exhibiting a trend towards genome expansion through increased retroelement numbers and disruption of gene order more typical of the larger genomes of closely related grasses. Grass crop improvement for sustainable fuel44 and food45 production requires a substantial increase in research in species such as Miscanthus, switchgrass, wheat and cool season forage grasses. These considerations have led to the rapid adoption of Brachypodium as an experimental system for grass research. The similarities in gene content and gene family structure between Brachypodium, rice and sorghum support the value of Brachypodium as a functional genomics model for all grasses. The Brachypodium genome sequence analysis reported here is therefore an important advance towards securing sustainable supplies of food, feed and fuel from new generations of grass crops. METHODS SUMMARY Genome sequencing and assembly. Sanger sequencing was used to generate paired-end reads from 3 kb, 8 kb, fosmid (35 kb) and BAC (100 kb) clones to generate 9.43 coverage (Supplementary Table 1). The final assembly of 83 scaffolds covers 271.9 Mb (Supplementary Table 3). Sequence scaffolds were aligned to a genetic map to create pseudomolecules covering each chromosome (Supplementary Figs 1 and 2). Protein-coding gene annotation. Gene models were derived from weighted consensus prediction from several ab initio gene finders, optimal spliced alignments of ESTs and transcript assemblies, and protein homology. Illumina transcriptome sequence was aligned to predicted genome features to validate exons, splice sites and alternatively spliced transcripts. Repeats analysis. The MIPS ANGELA pipeline was used to integrate analyses from expert groups. LTR-STRUCT and LTR-HARVEST46 were used for de novo retroelement searches. Received 29 August; accepted 9 December 2009. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Somerville, C. The billion-ton biofuels vision. Science 312, 1277 (2006). Kellogg, E. A. Evolutionary history of the grasses. Plant Physiol. 125, 1198–1205 (2001). Gaut, B. S. Evolutionary dynamics of grass genomes. New Phytol. 154, 15–28 (2002). International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature 436, 793–800 (2005). Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 (2009). Wei, F. et al. Physical and genetic structure of the maize genome reflects its complex evolutionary history. PLoS Genet. 3, e123 (2007). Moore, G., Devos, K. M., Wang, Z. & Gale, M. D. Cereal genome evolution. Grasses, line up and form a circle. Curr. Biol. 5, 737–739 (1995). Salamini, F., Ozkan, H., Brandolini, A., Schafer-Pregl, R. & Martin, W. Genetics and geography of wild cereal domestication in the near east. Nature Rev. Genet. 3, 429–441 (2002). Draper, J. et al. Brachypodium distachyon. A new model system for functional genomics in grasses. Plant Physiol. 127, 1539–1555 (2001). Vain, P. et al. Agrobacterium-mediated transformation of the temperate grass Brachypodium distachyon (genotype Bd21) for T-DNA insertional mutagenesis. Plant Biotechnol. J. 6, 236–245 (2008). Vogel, J. & Hill, T. High-efficiency Agrobacterium-mediated transformation of Brachypodium distachyon inbred line Bd21–3. Plant Cell Rep. 27, 471–478 (2008). Vogel, J. P., Garvin, D. F., Leong, O. M. & Hayden, D. M. Agrobacterium-mediated transformation and inbred line development in the model grass Brachypodium distachyon. Plant Cell Tissue Organ Cult. 84, 100179–100191 (2006). Filiz, E. et al. Molecular, morphological and cytological analysis of diverse Brachypodium distachyon inbred lines. Genome 52, 876–890 (2009). Vogel, J. P. et al. Development of SSR markers and analysis of diversity in Turkish populations of Brachypodium distachyon. BMC Plant Biol. 9, 88 (2009). 766 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES NATURE | Vol 463 | 11 February 2010 15. Garvin, D. F. et al. An SSR-based genetic linkage map of the model grass Brachypodium distachyon. Genome 53, 1–13 (2009). 16. Huo, N. et al. Construction and characterization of two BAC libraries from Brachypodium distachyon, a new model for grass genomics. Genome 49, 1099–1108 (2006). 17. Huo, N. et al. The nuclear genome of Brachypodium distachyon: analysis of BAC end sequences. Funct. Integr. Genomics 8, 135–147 (2008). 18. Gu, Y. Q. et al. A BAC-based physical map of Brachypodium distachyon and its comparative analysis with rice and wheat. BMC Genomics 10, 496 (2009). 19. Garvin, D. F. et al. Development of genetic and genomic research resources for Brachypodium distachyon, a new model system for grass crop research. Crop Sci. 48, S-69–S-84 (2008). 20. Bennett, M. D. & Leitch, I. J. Nuclear DNA amounts in angiosperms: progress, problems and prospects. Ann. Bot. (Lond.) 95, 45–90 (2005). 21. Vogel, J. P. et al. EST sequencing and phylogenetic analysis of the model grass Brachypodium distachyon. Theor. Appl. Genet. 113, 186–195 (2006). 22. Rajagopalan, R., Vaucheret, H., Trejo, J. & Bartel, D. P. A diverse and evolutionarily fluid set of microRNAs in Arabidopsis thaliana. Genes Dev. 20, 3407–3425 (2006). 23. Tanaka, T. et al. The rice annotation project database (RAP-DB): 2008 update. Nucleic Acids Res. 36, D1028–D1033 (2008). 24. Fox, S., Filichkin, S. & Mockler, T. Applications of ultra-high-throughput sequencing. Methods Mol. Biol. 553, 79–108 (2009). 25. Gray, J. et al. A recommendation for naming transcription factor proteins in the grasses. Plant Physiol. 149, 4–6 (2009). 26. Vogel, J. Unique aspects of the grass cell wall. Curr. Opin. Plant Biol. 11, 301–307 (2008). 27. Bennetzen, J. L. & Kellogg, E. A. Do plants have a one-way ticket to genomic obesity? Plant Cell 9, 1509–1514 (1997). 28. Wicker, T. & Keller, B. Genome-wide comparative analysis of copia retrotransposons in Triticeae, rice, and Arabidopsis reveals conserved ancient evolutionary lineages and distinct dynamics of individual copia families. Genome Res. 17, 1072–1081 (2007). 29. Wicker, T. et al. Analysis of intraspecies diversity in wheat and barley genomes identifies breakpoints of ancient haplotypes and provides insight into the structure of diploid and hexaploid triticeae gene pools. Plant Physiol. 149, 258–270 (2009). 30. Jiang, N., Bao, Z., Zhang, X., Eddy, S. R. & Wessler, S. R. Pack-MULE transposable elements mediate gene evolution in plants. Nature 431, 569–573 (2004). 31. Morgante, M. et al. Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nature Genet. 37, 997–1002 (2005). 32. Grass Phylogeny Working Group. Phylogeny and subfamilial classification of the grasses (Poaceae). Ann. Mo. Bot. Gard. 88, 373–457 (2001). 33. Bossolini, E., Wicker, T., Knobel, P. A. & Keller, B. Comparison of orthologous loci from small grass genomes Brachypodium and rice: implications for wheat genomics and grass genome annotation. Plant J. 49, 704–717 (2007). 34. Charles, M. et al. Sixty million years in evolution of soft grain trait in grasses: emergence of the softness locus in the common ancestor of Pooideae and Ehrhartoideae, after their divergence from Panicoideae. Mol. Biol. Evol. 26, 1651–1661 (2009). 35. Paterson, A. H., Bowers, J. E. & Chapman, B. A. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc. Natl Acad. Sci. USA 101, 9903–9908 (2004). 36. Stein, N. et al. A 1,000-loci transcript map of the barley genome: new anchoring points for integrative grass genomics. Theor. Appl. Genet. 114, 823–839 (2007). 37. Luo, M. C. et al. Genome comparisons reveal a dominant mechanism of chromosome number reduction in grasses and accelerated genome evolution in Triticeae. Proc. Natl Acad. Sci. USA 106, 15780–15785 (2009). 38. Qi, L. L. et al. A chromosome bin map of 16,000 expressed sequence tag loci and distribution of genes among the three genomes of polyploid wheat. Genetics 168, 701–712 (2004). 39. Salse, J. et al. Identification and characterization of shared duplications between rice and wheat provide new insight into grass genome evolution. Plant Cell 20, 11–24 (2008). 40. Srinivasachary, Dida M. M., Gale, M. D. & Devos, K. M. Comparative analyses reveal high levels of conserved colinearity between the finger millet and rice genomes. Theor. Appl. Genet. 115, 489–499 (2007). 41. Vicient, C. M., Kalendar, R. & Schulman, A. H. Variability, recombination, and mosaic evolution of the barley BARE-1 retrotransposon. J. Mol. Evol. 61, 275–291 (2005). 42. Meyers, B. C., Kozik, A., Griego, A., Kuang, H. & Michelmore, R. W. Genome-wide analysis of NBS-LRR-encoding genes in Arabidopsis. Plant Cell 15, 809–834 (2003). 43. Ma, J. & Bennetzen, J. L. Rapid recent growth and divergence of rice nuclear genomes. Proc. Natl Acad. Sci. USA 101, 12404–12410 (2004). 44. U.S. Department of Energy Office of Science. Breaking the Biological Barriers to Cellulosic Ethanol: A Joint Research Agenda Æ http://genomicscience.energy.gov/ biofuels/b2bworkshop.shtmlæ (2006). 45. Food and Agriculture Organization of the United Nations. World Agriculture: Towards 2030/2050 Interim Report. Æ http://www.fao.org/ES/esd/ AT2050web.pdfæ (2006). 46. McCarthy, E. M. & McDonald, J. F. LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 19, 362–367 (2003). Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Acknowledgements We acknowledge the contributions of the late M. Gale, who identified the importance of conserved gene order in grass genomes. This work was mainly supported by the US Department of Energy Joint Genome Institute Community Sequencing Program project with J.P.V., D.F.G., T.C.M. and M.W.B., a BBSRC grant to M.W.B., an EU Contract Agronomics grant to M.W.B. and K.F.X.M., and GABI Barlex grant to K.F.X.M. Illumina transcriptome sequencing was supported by a DOE Plant Feedstock Genomics for Bioenergy grant and an Oregon State Agricultural Research Foundation grant to T.C.M.; small RNA research was supported by the DOE Plant Feedstock Genomics for Bioenergy grants to P.J.G. and T.C.M.; annotation was supported by a DOE Plant Feedstocks for Genomics Bioenergy grant to J.P.V. A full list of support and acknowledgements is in the Supplementary Information. Author Information The whole-genome shotgun sequence of Brachypodium distachyon has been deposited at DDBJ/EMBL/GenBank under the accession ADDN00000000. (The version described in this manuscript is the first version, accession ADDN01000000). EST sequences have been deposited with dbEST (accessions 67946317–68053959) and GenBank (accessions GT758162–GT865804). The short read archive accession for RNA-seq data is SRA010177. Reprints and permissions information is available at www.nature.com/reprints. This paper is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence, and is freely available to all readers at www.nature.com/nature. The authors declare no competing financial interests. Correspondence and requests for materials should be addressed to J.P.V. ([email protected]) or D.F.G. ([email protected]) or T.C.M. ([email protected]) or M.W.B. ([email protected]). Author Contributions See list of consortium authors below. The International Brachypodium Initiative Principal investigators John P. Vogel1, David F. Garvin2, Todd C. Mockler3, Jeremy Schmutz4, Dan Rokhsar5,6, Michael W. Bevan7; DNA sequencing and assembly Kerrie Barry5, Susan Lucas5, Miranda Harmon-Smith5, Kathleen Lail5, Hope Tice5, Jeremy Schmutz4 (Leader), Jane Grimwood4, Neil McKenzie7, Michael W. Bevan7; Pseudomolecule assembly and BAC end sequencing Naxin Huo1, Yong Q. Gu1, Gerard R. Lazo1, Olin D. Anderson1, John P. Vogel1 (Leader), Frank M. You8, Ming-Cheng Luo8, Jan Dvorak8, Jonathan Wright7, Melanie Febrer7, Michael W. Bevan7, Dominika Idziak9, Robert Hasterok9, David F. Garvin2; Transcriptome sequencing and analysis Erika Lindquist5, Mei Wang5, Samuel E. Fox3, Henry D. Priest3, Sergei A. Filichkin3, Scott A. Givan3, Douglas W. Bryant3, Jeff H. Chang3, Todd C. Mockler3 (Leader), Haiyan Wu10,24, Wei Wu10, An-Ping Hsia10, Patrick S. Schnable10,24, Anantharaman Kalyanaraman11, Brad Barbazuk12, Todd P. Michael13, Samuel P. Hazen14, Jennifer N. Bragg1, Debbie Laudencia-Chingcuanco1, John P. Vogel1, David F. Garvin2, Yiqun Weng15, Neil McKenzie7, Michael W. Bevan7; Gene analysis and annotation Georg Haberer16, Manuel Spannagl16, Klaus Mayer16 (Leader), Thomas Rattei17, Therese Mitros6, Dan Rokhsar6, Sang-Jik Lee18, Jocelyn K. C. Rose18, Lukas A. Mueller19, Thomas L. York19; Repeats analysis Thomas Wicker20 (Leader), Jan P. Buchmann20, Jaakko Tanskanen21, Alan H. Schulman21 (Leader), Heidrun Gundlach16, Jonathan Wright7, Michael Bevan7, Antonio Costa de Oliveira22, Luciano da C. Maia22, William Belknap1, Yong Q. Gu1, Ning Jiang23, Jinsheng Lai24, Liucun Zhu25, Jianxin Ma25, Cheng Sun26, Ellen Pritham26; Comparative genomics Jerome Salse27 (Leader), Florent Murat27, Michael Abrouk27, Georg Haberer16, Manuel Spannagl16, Klaus Mayer16, Remy Bruggmann13, Joachim Messing13, Frank M. You8, Ming-Cheng Luo8, Jan Dvorak8; Small RNA analysis Noah Fahlgren3, Samuel E. Fox3, Christopher M. Sullivan3, Todd C. Mockler3, James C. Carrington3, Elisabeth J. Chapman3,28, Greg D. May29, Jixian Zhai30, Matthias Ganssmann30, Sai Guna Ranjan Gurazada30, Marcelo German30, Blake C. Meyers30, Pamela J. Green30 (Leader); Manual annotation and gene family analysis Jennifer N. Bragg1, Ludmila Tyler1,6, Jiajie Wu1,8, Yong Q. Gu1, Gerard R. Lazo1, Debbie Laudencia-Chingcuanco1, James Thomson1, John P. Vogel1 (Leader), Samuel P. Hazen14, Shan Chen14, Henrik V. Scheller31, Jesper Harholt32, Peter Ulvskov32, Samuel E. Fox3, Sergei A. Filichkin3, Noah Fahlgren3, Jeffrey A. Kimbrel3, Jeff H. Chang3, Christopher M. Sullivan3, Elisabeth J. Chapman3,27, James C. Carrington3, Todd C. Mockler3, Laura E. Bartley8,31, Peijian Cao8,31, Ki-Hong Jung8,31{, Manoj K Sharma8,31, Miguel Vega-Sanchez8,31, Pamela Ronald8,31, Christopher D. Dardick33, Stefanie De Bodt34, Wim Verelst34, Dirk Inzé34, Maren Heese35, Arp Schnittger35, Xiaohan Yang36, Udaya C. Kalluri36, Gerald A. Tuskan36, Zhihua Hua37, Richard D. Vierstra37, David F. Garvin3, Yu Cui24, Shuhong Ouyang24, Qixin Sun24, Zhiyong Liu24, Alper Yilmaz38, Erich Grotewold38, Richard Sibout39, Kian Hematy39, Gregory Mouille39, Herman Höfte39, Todd Michael13, Jérome Pelloux40, Devin O’Connor41, James Schnable41, Scott Rowe41, Frank Harmon41, Cynthia L. Cass42, John C. Sedbrook42, Mary E. Byrne7, Sean Walsh7, Janet Higgins7, Michael Bevan7, Pinghua Li19, Thomas Brutnell19, Turgay Unver43, Hikmet Budak43, Harry Belcram44, Mathieu Charles44, Boulos Chalhoub44, Ivan Baxter45 767 ©2010 Macmillan Publishers Limited. All rights reserved ARTICLES 1 NATURE | Vol 463 | 11 February 2010 USDA-ARS Western Regional Research Center, Albany, California 94710, USA. USDA-ARS Plant Science Research Unit and University of Minnesota, St Paul, Minnesota 55108, USA. 3Oregon State University, Corvallis, Oregon 97331-4501, USA. 4 HudsonAlpha Institute, Huntsville, Alabama 35806, USA. 5US DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 6University of California Berkeley, Berkeley, California 94720, USA. 7John Innes Centre, Norwich NR4 7UJ, UK. 8University of California Davis, Davis, California 95616, USA. 9University of Silesia, 40-032 Katowice, Poland. 10Iowa State University, Ames, Iowa 50011, USA. 11Washington State University, Pullman, Washington 99163, USA. 12University of Florida, Gainsville, Florida 32611, USA. 13Rutgers University, Piscataway, New Jersey 08855-0759, USA. 14 University of Massachusetts, Amherst, Massachusetts 01003-9292, USA. 15 USDA-ARS Vegetable Crops Research Unit, Horticulture Department, University of Wisconsin, Madison, Wisconsin 53706, USA. 16Helmholtz Zentrum München, D-85764 Neuherberg, Germany. 17Technical University München, 80333 München, Germany. 18 Cornell University, Ithaca, New York 14853, USA. 19Boyce Thompson Institute for Plant Research, Ithaca, New York 14853-1801, USA. 20University of Zurich, 8008 Zurich, Switzerland. 21MTT Agrifood Research and University of Helsinki, FIN-00014 Helsinki, Finland. 22Federal University of Pelotas, Pelotas, 96001-970, RS, Brazil. 23Michigan State University, East Lansing, Michigan 48824, USA. 24China Agricultural University, Beijing 10094, China. 25Purdue University, West Lafayette, Indiana 47907, USA. 26The University of Texas, Arlington, Arlington, Texas 76019, USA. 27Institut National de la 2 Recherché Agronomique UMR 1095, 63100 Clermont-Ferrand, France. 28University of California San Diego, La Jolla, California 92093, USA. 29National Centre for Genome Resources, Santa Fe, New Mexico 87505, USA. 30University of Delaware, Newark, Delaware 19716, USA. 31Joint Bioenergy Institute, Emeryville, California 94720, USA. 32 University of Copenhagen, Frederiksberg DK-1871, Denmark. 33USDA-ARS Appalachian Fruit Research Station, Kearneysville, West Virginia 25430, USA. 34VIB Department of Plant Systems Biology, VIB and Department of Plant Biotechnology and Genetics, Ghent University, Technologiepark 927, 9052 Gent, Belgium. 35Institut de Biologie Moléculaire des Plantes du CNRS, Strasbourg 67084, France. 36BioEnergy Science Center and Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831-6422, USA. 37University of Wisconsin-Madison, Madison, Wisconsin 53706, USA. 38The Ohio State University, Columbus, Ohio 43210, USA. 39Institut Jean-Pierre Bourgin, UMR1318, Institut National de la Recherche Agronomique, 78026 Versailles cedex, France. 40 Université de Picardie, Amiens 80039, France. 41Plant Gene Expression Center, University of California Berkeley, Albany, California 94710, USA. 42Illinois State University and DOE Great Lakes Bioenergy Research Center, Normal, Illinois 61790, USA. 43Sabanci University, Istanbul 34956, Turkey. 44Unité de Recherche en Génomique Végétale: URGV (INRA-CNRS-UEVE), Evry 91057, France. 45USDA-ARS/Donald Danforth Plant Science Center, St Louis, Missouri 63130, USA. {Present address: The School of Plant Molecular Systems Biotechnology, Kyung Hee University, Yongin 446-701, Korea. 768 ©2010 Macmillan Publishers Limited. All rights reserved