* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt - Sol Genomics Network
Ridge (biology) wikipedia , lookup
Genomic library wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
X-inactivation wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Public health genomics wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Transposable element wikipedia , lookup
Human genome wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Protein moonlighting wikipedia , lookup
Gene therapy wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Point mutation wikipedia , lookup
Minimal genome wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene desert wikipedia , lookup
Genome editing wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Tomato genome annotation pipeline in Cyrille2 Erwin Datema Contents of the annotation pipeline Annotation on the BAC level Gene prediction Repeat identification Other features Annotation on the gene level (work in progress) blastx vs NCBI’s nr InterProScan (sequence similarity) (domain identifcation) Ab initio gene structure prediction Ab initio predictors included in the pipeline Genscan GlimmerHMM GeneId SNAP Augustus (trained on tomato!) (has been trained on Solanaceae) (predicts alternative spliced variants) Alignment-based gene structure prediction (1) Transcript alignment (blastn + Sim4) SGN tomato UniGenes SGN potato UniGenes SGN coffee UniGenes SGN pepper UniGenes SGN petunia Unigenes SGN S. melongena UniGenes NCBI full-length tomato cDNAs (34.829 UniGenes) (31.072 UniGenes) (13.171 UniGenes) (9.554 UniGenes) (5.135 UniGenes) (1.841 UniGenes) (678 cDNAs) Protein alignment (tblastn + GeneWise) TAIR6 Arabidopsis thaliana proteome TIGR4 Oryza sativa proteome UniProt Plant division (30.690 proteins) (62.827 proteins) (17.831 proteins) Additional feature prediction Repeat Identification Tandem Repeats Finder RepeatMasker • RepBase + ‘default’ features (low complexity, etc) • TIGR Solanum lycopersicon repeat library V2 • SGN Solanum lycopersicon UniRepeats Feature prediction tRNAscan-SE MarScan GeneSplicer Marker identification (blastn + Sim4) Preliminary results Annotation of chromosome 6 BACs phase 1, 2 and 3 632 contigs Older version of the pipeline • • • • GlimmerHMM only trained on Arabidopsis 2 UniGene sets (tomato, potato) 2 protein sets (Arabidopsis, UniProt plant) Protein alignment parameters too strict The genomic landscape of chromosome 6 632 contigs have been annotated Length of contigs varies between 348 – 148.256 nt Average length of 9.061 nt, median length of 5.105 nt Total length of 5.726.791 nt GC content: 29.9% min, 34.1% avg, 42.2% max (sequences longer than 10.000 nt) Ab initio gene prediction genes exons exons/gene exon length gene length Genscan 1065 4630 4.3 249 1084 0.19 GlimmerHMM 1218 3901 3.2 272 872 0.21 GeneId 1210 4002 3.3 273 903 0.21 SNAP 1782 5059 2.8 230 653 0.31 Augustus 1888 8810 4.7 227 1061 0.33 Note: Augustus predictions include up to 3 splice variants per gene Estimated gene density is 1 gene per 5 kb ~1.200 genes in currently sequenced BACs genes/kb Transcript alignment-based gene prediction Tomato 34.829 UniGenes (derived from 239.593 ESTs) 574 hits to the contigs Potato 31.072 UniGenes (derived from 133.657 ESTs) 631 hits to the contigs Protein alignment-based gene prediction UniProt Plant proteins 17.378 protein sequences from the plant kingdom 195 hits to the contigs Arabidopsis thaliana TAIR6 annotation 30.690 protein sequences 228 hits to the contigs Repeat density TIGR Tomato Repeat Library (95 repeats) SGN Tomato UniRepeats (668 repeats) 118 regions spanning 53.024 nt Minimum 48 nt, average 449 nt, maximum 7.675 nt 2.860 regions spanning 1.220.101 nt Minimum 10 nt, average 427 nt, maximum 8.896 nt Tandem repeats 1.313 regions spanning 157.921 nt Minimum 24 nt, average 120 nt, maximum 2.526 nt Additional features 74 markers could be aligned alignment quality unverified 39 predicted tRNA genes 1.301 predicted MAR/SAR elements Generic Genome Browser (1) Generic Genome Browser (2) Generic Genome Browser (3) Recent work GeneModelCollector Tries to find ‘full’ open reading frames in aligned UniGenes Automatic generation of gene predictor training set Parameters? JIGSAW Appears not to provide a prediction for every region which contains annotations Training? Future Work – Tomato Annotation Pipeline Gene prediction Combining predictions into a single consensus model Train individual predictors with recently curated tomato gene set Automated functional annotation of genes “Giving a biological meaning to the nicely colored bars” blastx InterProScan Future Work – Tomato Genome Browser Annotation of features Meaningful names for features such as genes, marker alignments, blast hits More detailed and better readable data when clicking on a feature Links to external data sources NCBI GenBank SGN Acknowledgements Cyrille2 development Tomato BAC sequencing (chromosome 6) Mark Fiers Ate van der Burgt Joost de Groot Greenomics Supervision Willem Stiekema Roeland van Ham