* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene models - Wheat Training
Epigenetics of neurodegenerative diseases wikipedia , lookup
Transposable element wikipedia , lookup
Public health genomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
History of genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene expression profiling wikipedia , lookup
The Selfish Gene wikipedia , lookup
Genome (book) wikipedia , lookup
Gene therapy wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene desert wikipedia , lookup
Genome editing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome evolution wikipedia , lookup
Microevolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene models for wheat A good genome assembly (as discussed in the Genome assemblies section) is an essential prerequisite to obtaining high quality gene models: the models can only be as good as the assembly they are based on. Transcriptome data and gene models from related species are often used to help define correct gene models in a species. As with genome assemblies it is vital for researchers using gene models to understand how they were created in order to be aware of shortcomings and potential pitfalls. This section will look at the current gene models available for wheat. Please note that the wheat genome sequencing efforts are rapidly updating the “state of the art” resources so gene models and genome sequences may change. a) PGSB version 2.2 gene models When the first complete draft genome assembly of wheat (CSS) was published in 2014 a set of gene models were created based on this assembly, transcriptome data and gene models from related species (http://www.ncbi.nlm.nih.gov/pubmed/25035500). These gene models are available on the archive version of the EnsemblPlants website (http://archive.plants.ensembl.org/Triticum_aestivum/Info/Index) (also see EnsemblPlants primer) as well as in the in silico TILLING database (www.wheat-tilling.com) for both the tetraploid Kronos and hexaploid Cadenza populations (see TILLING mutant resource section). This set of gene models (PGSB version 2.2) consists of 193,667 transcripts and splice variants, with the longest transcript of each locus being defined as the main transcript; this results in a core set of 99,386 protein coding genes (Figure 1). The name of the gene models consists of three parts separated by underscores: 1. Species designation; Traes is short for Triticum aestivum 2. Chromosome location; 1BS means “short arm of chromosome 1B” 3. Nine alphanumeric characters; these have no special meaning but are simply a way of naming each gene model uniquely A lot of work has gone into annotating these gene models correctly but, as mentioned above, the accuracy of a prediction is largely limited by the genome assembly. The first genome draft of wheat (see Genome assemblies) is highly fragmented into more than 10 million scaffolds. As a result, a number of the gene models have been incorrectly annotated due to the fragmented nature of the genome assembly. The A genome homoeologue of Traes_1BL_447468BDE (Figure 1), Traes_1AL_729BF3204, is a perfect example of this (Figure 2). The gene model for Traes_1AL_729BF3204 is the longest out of the five transcripts predicted. However, the four alternative transcripts are located on a different scaffold. The gene model for Traes_1AL_729BF3204 has one exon, whereas the homoeologous gene models (Traes_1BL_447468BDE and Traes_1DL_ 729BF3204; the latter not shown) consist of five exons indicating that the A-genome model is wrong. In such a case researchers have to create their own version of the gene model by superimposing the “correct” gene models from the homoeologous genomes onto the genome of the missing gene Gene models www.wheat-training.com 1 model (here the A-genome). Comparing the gene model with homologous proteins in other closely related species, such as rice and barley, can also assist with this. Figure 1: Example of PGSB version 2.2 gene models This screenshot from the Archive EnsemblPlants wheat browser shows the large number of transcripts and splice variants that can be predicted for a single locus. By definition, the longest transcript (here transcript number two) has been selected as the default transcript. Filled boxes are exons; hollow boxes are untranslated region; lines connecting boxes are introns. Gene models www.wheat-training.com 2 This was just one example, but unfortunately cases like this can occur frequently. Hence, before starting any work using gene models it is highly recommended to check the integrity of the gene models. As mentioned above, tell-tale signs of incorrect gene predictions are: 1. The gene model ends at the very start or end of the scaffold 2. The gene model has alternative transcripts on other scaffolds 3. The homoeologues of the gene model differ in structure Figure 2: Example of wrong gene prediction This screenshot from the EnsemblPlants wheat browser is a good example of an incorrectly predicted gene model. The gene model terminates at the end of the scaffold, while at the same time four alternative transcripts are predicted on the adjacent scaffold. The fragmented genome assembly prevented this gene model from being correctly annotated. b) TGAC gene models As discussed in the Genome assemblies section, a new assembly of wheat cultivar Chinese Spring was released in December 2015 (TGAC). The gene models for this new and improved assembly have been created and are the default gene models on Ensembl Plants (http://plants.ensembl.org/Triticum_aestivum/Info/Index). The gene names consist of six sections separated by underscores (e.g. TRIAE_CS42_7DL_TGACv1_731555_AA2173560): 1. Species designation; TRIAE is short for Triticum aestivum 2. Accession sequenced: CS42 stands for Chinese Spring (variety of wheat sequenced) version 42 3. Chromosome location; 7DL means “long arm of chromosome 7D” 4. Assembly version; TGAC version 1 5. Six numeric characters; refer to the unique scaffold number Gene models www.wheat-training.com 3 6. Nine alphanumeric characters; these have no special meaning but are simply a way of naming each gene model uniquely Given the improvement in the scaffold length of the genome assembly, the new TGAC gene models in general contain fewer inaccurate predictions than the PGSB v2.2 (CSS) set of gene models. The number of predicted transcripts in the TGAC assembly is 273,739 with 154,798 transcripts coming from the 104,091 coding genes. The number of coding genes is similar to the number in the PGSB v2.2 (CSS) set of the gene models. Gene models www.wheat-training.com 4