* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genome assemblies
Vectors in gene therapy wikipedia , lookup
Gene therapy wikipedia , lookup
Gene nomenclature wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genetic engineering wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Oncogenomics wikipedia , lookup
Transposable element wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
X-inactivation wikipedia , lookup
Copy-number variation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene desert wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene expression programming wikipedia , lookup
Public health genomics wikipedia , lookup
Microevolution wikipedia , lookup
History of genetic engineering wikipedia , lookup
Metagenomics wikipedia , lookup
Designer baby wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human genome wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome (book) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Pathogenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome editing wikipedia , lookup
Genomic library wikipedia , lookup
Updated March 2017 Genome assemblies a) Introduction to the wheat genome Wheat is an allopolyploid of which there are two major types: hexaploid common wheat (Triticum aestivum ssp. aestivum; 17 Gb genome size; AABBDD genomes), which is mainly used for bread and biscuit products tetraploid durum wheat (Triticum turgidum ssp. durum; 12 Gb genome size; AABB genomes), used mainly for pasta. Hexaploid wheat arose from a polyploidisation event ~ 10,000 years ago, whereas tetraploid wheat arose ~ 400,000 years ago (Figure 1). Figure 1: The evolutionary history of allopolyploid wheat. From Borrill et al., 2015. DOI: 10.1111/nph.13533 Hexaploid wheat contains three closely related genomes (A, B and D) which contain homoeologous genes in a conserved order. Wheat homoeologues share over 95 % sequence identity within coding regions and most wheat genes are expected to be present as three copies in the A, B and D genome. Due to the high sequence conservation between homoeologues, genes may be functionally redundant or act in a dose dependent manner. This means that often all three copies must be knocked out to cause a strong phenotype. However in other cases the homoeologous genes have developed specialised functions or become pseudogenized over time due to reduced selection pressure on duplicated genes. b) Multiple genome assemblies are available The large size of the wheat genome has made it very challenging to produce a reference genome. Different strategies have been used to create draft genome assemblies. Currently several genome assemblies need to be used in a complementary fashion because no single reference is the best across all regions. A brief explanation about each genome assembly is given below alongside any caveats. Genome assemblies www.wheat-training.com 1 Updated March 2017 c) Introduction to each assembly Chinese Spring Survey Sequence (CSS) The International Wheat Genome Sequencing Consortium (IWGSC) has used flow-sorting to separate out individual chromosome arms. The landrace used (Chinese Spring) had aneuploid genetic stocks available, which have only one arm of each chromosome e.g. the short arm of chromosome 1A is deleted so chromosome arm 1AL can separated from the other chromosomes. This enabled the IWGSC to separate each individual chromosome arm by flow sorting, before sequencing each chromosome arm separately. Individual chromosome arms were sequenced to 30240x coverage using Illumina NGS, generating a 10.2 Gb assembly of Chinese Spring (termed Chinese Spring survey sequence (CSS)). Gene models were created by mapping RNA-seq data and using gene models from related species. A total of 99,386 protein-coding genes were predicted, with 193,667 transcripts and splice variants. The gene models presented rely on the scaffolds assembled and in some cases gene models are incomplete because the underlying genomic scaffolds are not full length assemblies. For example in some cases genes are split between two different genomic scaffolds (shown below), therefore one gene is given two different identifiers. Figure 2: CSS gene models are affected by truncated scaffolds. Chromosome 3B was assembled using a BAC by BAC approach and is currently considered the “gold standard” for the reference genome assembly. 3B gene models were created separately using RNA-seq data. CSS reference: IWGSC 2014, DOI: 10.1126/science.1251788 CSS data access: http://archive.plants.ensembl.org/Triticum_aestivum/Info/Index W7984 A whole genome sequencing approach was undertaken in the synthetic hexaploid wheat “Synthetic W7984”. This approached used large-insert sequencing libraries and enabled separate assemblies of the three homoeologues genomes, to a total assembly size of 9.1 Gb. In some regions the W7984 scaffolds are more continuous than the CSS, but for other regions the CSS is more continuous than the W7984. Using both references will give the most complete picture of genomic regions of interest. W7984 does not have any gene models associated with it. W7984 reference: Chapman et al., 2015 DOI: 10.1186/s13059-015-0582-8 W7984 data access: http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/blast_WGS.php Genome assemblies www.wheat-training.com 2 Updated March 2017 TGAC v1 A whole genome shot gun sequence assembly of Chinese Spring was carried out using nested long mate-pair libraries alongside a modified version of the DISCOVAR algorithm for assembly. This method created an assembly of total length 13.4 Gb, with approximately 10x longer N50 than the CSS and W7984 assemblies. Gene models from IWGSC were projected onto the TGAC assembly, with 99 % of the total genes located on the TGAC assembly. De novo gene prediction has been carried out for the TGAC assembly resulting in a total of 273,739 transcripts (including non-coding and transcript variants). Of these 104,305 are high confidence protein coding genes. In general these gene models are more complete than the CSS gene models due to the longer contig length within the TGAC assembly. These gene models are available from http://plants.ensembl.org TGAC reference: preprint available at bioRxiv – Clavijo et al., 2016 DOI: 10.1101/080796 TGAC data access: http://plants.ensembl.org/Triticum_aestivum/Info/Index?db=core https://wheatis.tgac.ac.uk/grassroots-portal/blast WGA v0.4 A whole genome assembly has been carried out by the IWGSC in collaboration with the company NRGene. Using a proprietary algorithm DeNovoMAGIC with Illumina sequencing data a 14.5 Gb assembly was produced. This assembly has much larger contigs (N50 7.1 Mb) and represents a draft genome more similar in quality to rice and other model species. Sequences have been ordered using POPSEQ data and Hi-C (chromosome conformation capture) to generate 21 pseudomolecules representing the majority of the wheat genome. WGA v0.4 reference: unpublished WGA v0.4 data access: available after registration with URGI. All data downloadable at https://wheat-urgi.versailles.inra.fr/Seq-Repository/Assemblies RefSeq v1.0 The IWGSC RefSeq v1.0 is an integration of the IWGSC WGA v0.4 with the IWGSC chromosomebased strategy and additional resources including, but not limited to, physical maps of all chromosomes, sequenced BACs, BioNano optical maps, alignment to RH maps (radiation hybridbased maps) and GBS (genotyping by sequencing maps) of the SynOp RIL population (a well characterised mapping population). The RefSeqv1.0 is an improvement on the WGA v0.4: the assembly size remains similar (14.6 Gb) but the chromosomal scaffold N50 increased to 22.8 Mb. Gene models are currently being determined in a collaborative effort using multiple pipelines and will be available in the near future. Resources including the TGACv1 gene models and SNP arrays (90k iSelect, 35k and 820k axiom, see Variation Data for more information) have been aligned to the RefSeq v1.0 to obtain physical chromosomal coordinates. These files are available for download on the useful links page under ‘Genomic Data’. RefSeq v0.1 reference: unpublished Genome assemblies www.wheat-training.com 3 Updated March 2017 RefSeq v0.1 data access: available after registration with URGI. All data downloadable at https://wheat-urgi.versailles.inra.fr/Seq-Repository/Assemblies d) Varieties other than Chinese Spring All the above mentioned assemblies are of the variety Chinese Spring. Assemblies for other varieties are now becoming available as we move towards a wheat ‘pangenome’. For example, the method used for the TGACv1 assembly of Chinese Spring has also been used to construct assemblies for a number of other varieties. These varieties include the tetraploid variety Kronos in addition to 4 hexaploid UK varieties (Cadenza, Paragon, Claire, and Robigus). These assemblies are available to BLAST against at https://wheatis.tgac.ac.uk/grassroots-portal/blast. Additionally, 18 Australian cultivars have been sequenced to create a draft wheat pangenome, reported by Montenegro et al. (Plant Journal 2017, DOI: 10.1111/tpj.13515). Additional varieties from across the world are currently being sequenced and assembled; this document will be updated as they become available. e) Comparison between assemblies Currently all 4 reference genomes have their merits due to the differences in gene annotation, incorporation into other resources (e.g. expression browsers, SNP markers and TILLING mutants) and the variety which was sequenced. Table 1. Comparison between different genome assemblies. Common Variety Assembly Gene Related name size models resources CSS Chinese Spring (6×) 10.2 Gb total N50contig = 4.2 kb TILLING mutants Expression browsers SNP markers Description References Flow-sorted chromosome arms sequenced and assembled individually, allowing efficient separation of homoeologous sequences IWGSC (2014) W7984 Synthetic W7984 (6×) 9.1 Gb total N50contig = 6.7 kb WGS assembly combining pairedend and mate-pair libraries. Chapman et al. (2015) TGAC Chinese Spring (6x) 13.4 Gb total N50contig = 88 kb WGS assembly combining pairedend and mate-pair libraries with a modified version of the DISCOVAR assembler Preprint at bioRxiv Clavijo et al. (2016) Genome assemblies www.wheat-training.com 4 Updated March 2017 WGA v0.4 Chinese Spring (6x) 14.5 Gb total N50scaff = 7.1 Mb RefSeq v1.0 Chinese Spring (6x) 14.6 Gb total N50scaff = 22.8 Mb Genome assemblies www.wheat-training.com Coming soon WGS assembly incorporating Hi-C and optical sequencing information. The propriety DeNovoMAGIC assembler was used. unpublished Integration of WGA v0.4 with additional physical information unpublished 5