Download Gene models - Wheat Training

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

NEDD9 wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Transposable element wikipedia , lookup

Genomics wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene expression profiling wikipedia , lookup

The Selfish Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Gene therapy wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene desert wikipedia , lookup

Genome editing wikipedia , lookup

RNA-Seq wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
Gene models for wheat
A good genome assembly (as discussed in the Genome assemblies section) is an essential
prerequisite to obtaining high quality gene models: the models can only be as good as the assembly
they are based on. Transcriptome data and gene models from related species are often used to help
define correct gene models in a species. As with genome assemblies it is vital for researchers using
gene models to understand how they were created in order to be aware of shortcomings and
potential pitfalls. This section will look at the current gene models available for wheat. Please note
that the wheat genome sequencing efforts are rapidly updating the “state of the art” resources so
gene models and genome sequences may change.
a) PGSB version 2.2 gene models
When the first complete draft genome assembly of wheat (CSS) was published in 2014 a set of gene
models were created based on this assembly, transcriptome data and gene models from related
species (http://www.ncbi.nlm.nih.gov/pubmed/25035500).
These gene models are available on the archive version of the EnsemblPlants website
(http://archive.plants.ensembl.org/Triticum_aestivum/Info/Index) (also see EnsemblPlants primer)
as well as in the in silico TILLING database (www.wheat-tilling.com) for both the tetraploid Kronos
and hexaploid Cadenza populations (see TILLING mutant resource section).
This set of gene models (PGSB version 2.2) consists of 193,667 transcripts and splice variants, with
the longest transcript of each locus being defined as the main transcript; this results in a core set of
99,386 protein coding genes (Figure 1). The name of the gene models consists of three parts
separated by underscores:
1. Species designation; Traes is short for Triticum aestivum
2. Chromosome location; 1BS means “short arm of chromosome 1B”
3. Nine alphanumeric characters; these have no special meaning but are simply a way of
naming each gene model uniquely
A lot of work has gone into annotating these gene models correctly but, as mentioned above, the
accuracy of a prediction is largely limited by the genome assembly. The first genome draft of wheat
(see Genome assemblies) is highly fragmented into more than 10 million scaffolds. As a result, a
number of the gene models have been incorrectly annotated due to the fragmented nature of the
genome assembly.
The A genome homoeologue of Traes_1BL_447468BDE (Figure 1), Traes_1AL_729BF3204, is a
perfect example of this (Figure 2). The gene model for Traes_1AL_729BF3204 is the longest out of
the five transcripts predicted. However, the four alternative transcripts are located on a different
scaffold. The gene model for Traes_1AL_729BF3204 has one exon, whereas the homoeologous
gene models (Traes_1BL_447468BDE and Traes_1DL_ 729BF3204; the latter not shown) consist
of five exons indicating that the A-genome model is wrong.
In such a case researchers have to create their own version of the gene model by superimposing
the “correct” gene models from the homoeologous genomes onto the genome of the missing gene
Gene models
www.wheat-training.com
1
model (here the A-genome). Comparing the gene model with homologous proteins in other closely
related species, such as rice and barley, can also assist with this.
Figure 1: Example of PGSB version 2.2 gene models
This screenshot from the Archive EnsemblPlants wheat browser shows the large number of
transcripts and splice variants that can be predicted for a single locus. By definition, the longest
transcript (here transcript number two) has been selected as the default transcript. Filled boxes are
exons; hollow boxes are untranslated region; lines connecting boxes are introns.
Gene models
www.wheat-training.com
2
This was just one example, but unfortunately cases like this can occur frequently. Hence, before
starting any work using gene models it is highly recommended to check the integrity of the gene
models. As mentioned above, tell-tale signs of incorrect gene predictions are:
1. The gene model ends at the very start or end of the scaffold
2. The gene model has alternative transcripts on other scaffolds
3. The homoeologues of the gene model differ in structure
Figure 2: Example of wrong gene prediction
This screenshot from the EnsemblPlants wheat browser is a good example of an incorrectly
predicted gene model. The gene model terminates at the end of the scaffold, while at the same time
four alternative transcripts are predicted on the adjacent scaffold. The fragmented genome assembly
prevented this gene model from being correctly annotated.
b) TGAC gene models
As discussed in the Genome assemblies section, a new assembly of wheat cultivar Chinese Spring
was released in December 2015 (TGAC). The gene models for this new and improved assembly
have been created and are the default gene models on Ensembl Plants
(http://plants.ensembl.org/Triticum_aestivum/Info/Index). The gene names consist of six sections
separated by underscores (e.g. TRIAE_CS42_7DL_TGACv1_731555_AA2173560):
1. Species designation; TRIAE is short for Triticum aestivum
2. Accession sequenced: CS42 stands for Chinese Spring (variety of wheat sequenced) version
42
3. Chromosome location; 7DL means “long arm of chromosome 7D”
4. Assembly version; TGAC version 1
5. Six numeric characters; refer to the unique scaffold number
Gene models
www.wheat-training.com
3
6. Nine alphanumeric characters; these have no special meaning but are simply a way of
naming each gene model uniquely
Given the improvement in the scaffold length of the genome assembly, the new TGAC gene models
in general contain fewer inaccurate predictions than the PGSB v2.2 (CSS) set of gene models. The
number of predicted transcripts in the TGAC assembly is 273,739 with 154,798 transcripts coming
from the 104,091 coding genes. The number of coding genes is similar to the number in the PGSB
v2.2 (CSS) set of the gene models.
Gene models
www.wheat-training.com
4