Download Genome assemblies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vectors in gene therapy wikipedia , lookup

Gene therapy wikipedia , lookup

Gene nomenclature wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genetic engineering wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

NUMT wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Oncogenomics wikipedia , lookup

Transposable element wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

X-inactivation wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene desert wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene wikipedia , lookup

Gene expression programming wikipedia , lookup

ENCODE wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

History of genetic engineering wikipedia , lookup

Polyploid wikipedia , lookup

Metagenomics wikipedia , lookup

Designer baby wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human genome wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Pathogenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Minimal genome wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome editing wikipedia , lookup

Genomic library wikipedia , lookup

Human Genome Project wikipedia , lookup

Genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Updated March 2017
Genome assemblies
a) Introduction to the wheat genome
Wheat is an allopolyploid of which there are two major types:

hexaploid common wheat (Triticum aestivum ssp. aestivum; 17 Gb genome size; AABBDD
genomes), which is mainly used for bread and biscuit products

tetraploid durum wheat (Triticum turgidum ssp. durum; 12 Gb genome size; AABB genomes),
used mainly for pasta.
Hexaploid wheat arose from a polyploidisation event ~ 10,000 years ago, whereas tetraploid wheat
arose ~ 400,000 years ago (Figure 1).
Figure 1: The evolutionary history of allopolyploid wheat. From Borrill et al., 2015. DOI:
10.1111/nph.13533
Hexaploid wheat contains three closely related genomes (A, B and D) which contain homoeologous
genes in a conserved order. Wheat homoeologues share over 95 % sequence identity within coding
regions and most wheat genes are expected to be present as three copies in the A, B and D genome.
Due to the high sequence conservation between homoeologues, genes may be functionally
redundant or act in a dose dependent manner. This means that often all three copies must be
knocked out to cause a strong phenotype. However in other cases the homoeologous genes have
developed specialised functions or become pseudogenized over time due to reduced selection
pressure on duplicated genes.
b) Multiple genome assemblies are available
The large size of the wheat genome has made it very challenging to produce a reference genome.
Different strategies have been used to create draft genome assemblies. Currently several genome
assemblies need to be used in a complementary fashion because no single reference is the best
across all regions. A brief explanation about each genome assembly is given below alongside any
caveats.
Genome assemblies
www.wheat-training.com
1
Updated March 2017
c) Introduction to each assembly
Chinese Spring Survey Sequence (CSS)
The International Wheat Genome Sequencing Consortium (IWGSC) has used flow-sorting to
separate out individual chromosome arms. The landrace used (Chinese Spring) had aneuploid
genetic stocks available, which have only one arm of each chromosome e.g. the short arm of
chromosome 1A is deleted so chromosome arm 1AL can separated from the other chromosomes.
This enabled the IWGSC to separate each individual chromosome arm by flow sorting, before
sequencing each chromosome arm separately. Individual chromosome arms were sequenced to 30240x coverage using Illumina NGS, generating a 10.2 Gb assembly of Chinese Spring (termed
Chinese Spring survey sequence (CSS)).
Gene models were created by mapping RNA-seq data and using gene models from related species.
A total of 99,386 protein-coding genes were predicted, with 193,667 transcripts and splice variants.
The gene models presented rely on the scaffolds assembled and in some cases gene models are
incomplete because the underlying genomic scaffolds are not full length assemblies. For example in
some cases genes are split between two different genomic scaffolds (shown below), therefore one
gene is given two different identifiers.
Figure 2: CSS gene models are affected by truncated scaffolds.
Chromosome 3B was assembled using a BAC by BAC approach and is currently considered the
“gold standard” for the reference genome assembly. 3B gene models were created separately using
RNA-seq data.
CSS reference: IWGSC 2014, DOI: 10.1126/science.1251788
CSS data access: http://archive.plants.ensembl.org/Triticum_aestivum/Info/Index
W7984
A whole genome sequencing approach was undertaken in the synthetic hexaploid wheat “Synthetic
W7984”. This approached used large-insert sequencing libraries and enabled separate assemblies
of the three homoeologues genomes, to a total assembly size of 9.1 Gb. In some regions the W7984
scaffolds are more continuous than the CSS, but for other regions the CSS is more continuous than
the W7984. Using both references will give the most complete picture of genomic regions of interest.
W7984 does not have any gene models associated with it.
W7984 reference: Chapman et al., 2015 DOI: 10.1186/s13059-015-0582-8
W7984 data access: http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/blast_WGS.php
Genome assemblies
www.wheat-training.com
2
Updated March 2017
TGAC v1
A whole genome shot gun sequence assembly of Chinese Spring was carried out using nested long
mate-pair libraries alongside a modified version of the DISCOVAR algorithm for assembly. This
method created an assembly of total length 13.4 Gb, with approximately 10x longer N50 than the
CSS and W7984 assemblies.
Gene models from IWGSC were projected onto the TGAC assembly, with 99 % of the total genes
located on the TGAC assembly. De novo gene prediction has been carried out for the TGAC
assembly resulting in a total of 273,739 transcripts (including non-coding and transcript variants). Of
these 104,305 are high confidence protein coding genes. In general these gene models are more
complete than the CSS gene models due to the longer contig length within the TGAC assembly.
These gene models are available from http://plants.ensembl.org
TGAC reference: preprint available at bioRxiv – Clavijo et al., 2016 DOI: 10.1101/080796
TGAC data access: http://plants.ensembl.org/Triticum_aestivum/Info/Index?db=core
https://wheatis.tgac.ac.uk/grassroots-portal/blast
WGA v0.4
A whole genome assembly has been carried out by the IWGSC in collaboration with the company
NRGene. Using a proprietary algorithm DeNovoMAGIC with Illumina sequencing data a 14.5 Gb
assembly was produced. This assembly has much larger contigs (N50 7.1 Mb) and represents a
draft genome more similar in quality to rice and other model species. Sequences have been ordered
using POPSEQ data and Hi-C (chromosome conformation capture) to generate 21 pseudomolecules
representing the majority of the wheat genome.
WGA v0.4 reference: unpublished
WGA v0.4 data access: available after registration with URGI.
All data downloadable at https://wheat-urgi.versailles.inra.fr/Seq-Repository/Assemblies
RefSeq v1.0
The IWGSC RefSeq v1.0 is an integration of the IWGSC WGA v0.4 with the IWGSC chromosomebased strategy and additional resources including, but not limited to, physical maps of all
chromosomes, sequenced BACs, BioNano optical maps, alignment to RH maps (radiation hybridbased maps) and GBS (genotyping by sequencing maps) of the SynOp RIL population (a well
characterised mapping population). The RefSeqv1.0 is an improvement on the WGA v0.4: the
assembly size remains similar (14.6 Gb) but the chromosomal scaffold N50 increased to 22.8 Mb.
Gene models are currently being determined in a collaborative effort using multiple pipelines and will
be available in the near future.
Resources including the TGACv1 gene models and SNP arrays (90k iSelect, 35k and 820k axiom,
see Variation Data for more information) have been aligned to the RefSeq v1.0 to obtain physical
chromosomal coordinates. These files are available for download on the useful links page under
‘Genomic Data’.
RefSeq v0.1 reference: unpublished
Genome assemblies
www.wheat-training.com
3
Updated March 2017
RefSeq v0.1 data access: available after registration with URGI.
All data downloadable at https://wheat-urgi.versailles.inra.fr/Seq-Repository/Assemblies
d) Varieties other than Chinese Spring
All the above mentioned assemblies are of the variety Chinese Spring. Assemblies for other varieties
are now becoming available as we move towards a wheat ‘pangenome’. For example, the method
used for the TGACv1 assembly of Chinese Spring has also been used to construct assemblies for a
number of other varieties. These varieties include the tetraploid variety Kronos in addition to 4
hexaploid UK varieties (Cadenza, Paragon, Claire, and Robigus). These assemblies are available
to BLAST against at https://wheatis.tgac.ac.uk/grassroots-portal/blast. Additionally, 18 Australian
cultivars have been sequenced to create a draft wheat pangenome, reported by Montenegro et al.
(Plant Journal 2017, DOI: 10.1111/tpj.13515). Additional varieties from across the world are currently
being sequenced and assembled; this document will be updated as they become available.
e) Comparison between assemblies
Currently all 4 reference genomes have their merits due to the differences in gene annotation,
incorporation into other resources (e.g. expression browsers, SNP markers and TILLING mutants)
and the variety which was sequenced.
Table 1. Comparison between different genome assemblies.
Common Variety
Assembly
Gene
Related
name
size
models resources
CSS
Chinese
Spring
(6×)
10.2 Gb
total
N50contig =
4.2 kb
TILLING
mutants
Expression
browsers
SNP
markers
Description
References
Flow-sorted
chromosome arms
sequenced and
assembled
individually,
allowing efficient
separation of
homoeologous
sequences
IWGSC
(2014)
W7984
Synthetic
W7984
(6×)
9.1 Gb total
N50contig =
6.7 kb
WGS assembly
combining pairedend and mate-pair
libraries.
Chapman
et al. (2015)
TGAC
Chinese
Spring
(6x)
13.4 Gb
total
N50contig =
88 kb
WGS assembly
combining pairedend and mate-pair
libraries with a
modified version of
the DISCOVAR
assembler
Preprint at
bioRxiv
Clavijo et
al. (2016)
Genome assemblies
www.wheat-training.com
4
Updated March 2017
WGA
v0.4
Chinese
Spring
(6x)
14.5 Gb
total
N50scaff =
7.1 Mb
RefSeq
v1.0
Chinese
Spring
(6x)
14.6 Gb
total
N50scaff =
22.8 Mb
Genome assemblies
www.wheat-training.com
Coming
soon
WGS assembly
incorporating Hi-C
and optical
sequencing
information. The
propriety
DeNovoMAGIC
assembler was
used.
unpublished
Integration of WGA
v0.4 with additional
physical
information
unpublished
5