Download Gene models - Wheat Training

Gene models for wheat A good genome assembly (as discussed in the Genome assemblies section) is an essential prerequisite to obtaining high quality gene models: the models can only be as good as the assembly they are based on. Transcriptome data and gene models from related species are often used to help define correct gene models in a species. As with genome assemblies it is vital for researchers using gene models to understand how they were created in order to be aware of shortcomings and potential pitfalls. This section will look at the current gene models available for wheat. Please note that the wheat genome sequencing efforts are rapidly updating the “state of the art” resources so gene models and genome sequences may change. a) PGSB version 2.2 gene models When the first complete draft genome assembly of wheat (CSS) was published in 2014 a set of gene models were created based on this assembly, transcriptome data and gene models from related species (http://www.ncbi.nlm.nih.gov/pubmed/25035500). These gene models are available on the archive version of the EnsemblPlants website (http://archive.plants.ensembl.org/Triticum_aestivum/Info/Index) (also see EnsemblPlants primer) as well as in the in silico TILLING database (www.wheat-tilling.com) for both the tetraploid Kronos and hexaploid Cadenza populations (see TILLING mutant resource section). This set of gene models (PGSB version 2.2) consists of 193,667 transcripts and splice variants, with the longest transcript of each locus being defined as the main transcript; this results in a core set of 99,386 protein coding genes (Figure 1). The name of the gene models consists of three parts separated by underscores: 1. Species designation; Traes is short for Triticum aestivum 2. Chromosome location; 1BS means “short arm of chromosome 1B” 3. Nine alphanumeric characters; these have no special meaning but are simply a way of naming each gene model uniquely A lot of work has gone into annotating these gene models correctly but, as mentioned above, the accuracy of a prediction is largely limited by the genome assembly. The first genome draft of wheat (see Genome assemblies) is highly fragmented into more than 10 million scaffolds. As a result, a number of the gene models have been incorrectly annotated due to the fragmented nature of the genome assembly. The A genome homoeologue of Traes_1BL_447468BDE (Figure 1), Traes_1AL_729BF3204, is a perfect example of this (Figure 2). The gene model for Traes_1AL_729BF3204 is the longest out of the five transcripts predicted. However, the four alternative transcripts are located on a different scaffold. The gene model for Traes_1AL_729BF3204 has one exon, whereas the homoeologous gene models (Traes_1BL_447468BDE and Traes_1DL_ 729BF3204; the latter not shown) consist of five exons indicating that the A-genome model is wrong. In such a case researchers have to create their own version of the gene model by superimposing the “correct” gene models from the homoeologous genomes onto the genome of the missing gene Gene models www.wheat-training.com 1 model (here the A-genome). Comparing the gene model with homologous proteins in other closely related species, such as rice and barley, can also assist with this. Figure 1: Example of PGSB version 2.2 gene models This screenshot from the Archive EnsemblPlants wheat browser shows the large number of transcripts and splice variants that can be predicted for a single locus. By definition, the longest transcript (here transcript number two) has been selected as the default transcript. Filled boxes are exons; hollow boxes are untranslated region; lines connecting boxes are introns. Gene models www.wheat-training.com 2 This was just one example, but unfortunately cases like this can occur frequently. Hence, before starting any work using gene models it is highly recommended to check the integrity of the gene models. As mentioned above, tell-tale signs of incorrect gene predictions are: 1. The gene model ends at the very start or end of the scaffold 2. The gene model has alternative transcripts on other scaffolds 3. The homoeologues of the gene model differ in structure Figure 2: Example of wrong gene prediction This screenshot from the EnsemblPlants wheat browser is a good example of an incorrectly predicted gene model. The gene model terminates at the end of the scaffold, while at the same time four alternative transcripts are predicted on the adjacent scaffold. The fragmented genome assembly prevented this gene model from being correctly annotated. b) TGAC gene models As discussed in the Genome assemblies section, a new assembly of wheat cultivar Chinese Spring was released in December 2015 (TGAC). The gene models for this new and improved assembly have been created and are the default gene models on Ensembl Plants (http://plants.ensembl.org/Triticum_aestivum/Info/Index). The gene names consist of six sections separated by underscores (e.g. TRIAE_CS42_7DL_TGACv1_731555_AA2173560): 1. Species designation; TRIAE is short for Triticum aestivum 2. Accession sequenced: CS42 stands for Chinese Spring (variety of wheat sequenced) version 42 3. Chromosome location; 7DL means “long arm of chromosome 7D” 4. Assembly version; TGAC version 1 5. Six numeric characters; refer to the unique scaffold number Gene models www.wheat-training.com 3 6. Nine alphanumeric characters; these have no special meaning but are simply a way of naming each gene model uniquely Given the improvement in the scaffold length of the genome assembly, the new TGAC gene models in general contain fewer inaccurate predictions than the PGSB v2.2 (CSS) set of gene models. The number of predicted transcripts in the TGAC assembly is 273,739 with 154,798 transcripts coming from the 104,091 coding genes. The number of coding genes is similar to the number in the PGSB v2.2 (CSS) set of the gene models. Gene models www.wheat-training.com 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Gene models - Wheat Training