Download ppt - Sol Genomics Network

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ridge (biology) wikipedia , lookup

Genomic library wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

X-inactivation wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Transposable element wikipedia , lookup

Human genome wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Protein moonlighting wikipedia , lookup

Gene therapy wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Point mutation wikipedia , lookup

Minimal genome wikipedia , lookup

Copy-number variation wikipedia , lookup

Genomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

RNA-Seq wikipedia , lookup

NEDD9 wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene desert wikipedia , lookup

Genome editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Tomato genome annotation pipeline in
Cyrille2
Erwin Datema
Contents of the annotation pipeline

Annotation on the BAC level




Gene prediction
Repeat identification
Other features
Annotation on the gene level (work in progress)


blastx vs NCBI’s nr
InterProScan
(sequence similarity)
(domain identifcation)
Ab initio gene structure prediction

Ab initio predictors included in the pipeline





Genscan
GlimmerHMM
GeneId
SNAP
Augustus
(trained on tomato!)
(has been trained on Solanaceae)
(predicts alternative spliced variants)
Alignment-based gene structure prediction (1)

Transcript alignment (blastn + Sim4)








SGN tomato UniGenes
SGN potato UniGenes
SGN coffee UniGenes
SGN pepper UniGenes
SGN petunia Unigenes
SGN S. melongena UniGenes
NCBI full-length tomato cDNAs
(34.829 UniGenes)
(31.072 UniGenes)
(13.171 UniGenes)
(9.554 UniGenes)
(5.135 UniGenes)
(1.841 UniGenes)
(678 cDNAs)
Protein alignment (tblastn + GeneWise)



TAIR6 Arabidopsis thaliana proteome
TIGR4 Oryza sativa proteome
UniProt Plant division
(30.690 proteins)
(62.827 proteins)
(17.831 proteins)
Additional feature prediction

Repeat Identification


Tandem Repeats Finder
RepeatMasker
• RepBase + ‘default’ features (low complexity, etc)
• TIGR Solanum lycopersicon repeat library V2
• SGN Solanum lycopersicon UniRepeats

Feature prediction




tRNAscan-SE
MarScan
GeneSplicer
Marker identification (blastn + Sim4)
Preliminary results

Annotation of chromosome 6 BACs



phase 1, 2 and 3
632 contigs
Older version of the pipeline
•
•
•
•
GlimmerHMM only trained on Arabidopsis
2 UniGene sets (tomato, potato)
2 protein sets (Arabidopsis, UniProt plant)
Protein alignment parameters too strict
The genomic landscape of chromosome 6

632 contigs have been annotated




Length of contigs varies between 348 – 148.256 nt
Average length of 9.061 nt, median length of 5.105 nt
Total length of 5.726.791 nt
GC content: 29.9% min, 34.1% avg, 42.2% max
(sequences longer than 10.000 nt)
Ab initio gene prediction
genes
exons
exons/gene exon length gene length
Genscan
1065
4630
4.3
249
1084
0.19
GlimmerHMM
1218
3901
3.2
272
872
0.21
GeneId
1210
4002
3.3
273
903
0.21
SNAP
1782
5059
2.8
230
653
0.31
Augustus
1888
8810
4.7
227
1061
0.33
Note: Augustus predictions include up to 3 splice variants per gene

Estimated gene density is 1 gene per 5 kb

~1.200 genes in currently sequenced BACs
genes/kb
Transcript alignment-based gene prediction

Tomato



34.829 UniGenes (derived from 239.593 ESTs)
574 hits to the contigs
Potato


31.072 UniGenes (derived from 133.657 ESTs)
631 hits to the contigs
Protein alignment-based gene prediction

UniProt Plant proteins



17.378 protein sequences from the plant kingdom
195 hits to the contigs
Arabidopsis thaliana TAIR6 annotation


30.690 protein sequences
228 hits to the contigs
Repeat density

TIGR Tomato Repeat Library (95 repeats)



SGN Tomato UniRepeats (668 repeats)



118 regions spanning 53.024 nt
Minimum 48 nt, average 449 nt, maximum 7.675 nt
2.860 regions spanning 1.220.101 nt
Minimum 10 nt, average 427 nt, maximum 8.896 nt
Tandem repeats


1.313 regions spanning 157.921 nt
Minimum 24 nt, average 120 nt, maximum 2.526 nt
Additional features

74 markers could be aligned

alignment quality unverified

39 predicted tRNA genes

1.301 predicted MAR/SAR elements
Generic Genome Browser (1)
Generic Genome Browser (2)
Generic Genome Browser (3)
Recent work

GeneModelCollector




Tries to find ‘full’ open reading frames in aligned
UniGenes
Automatic generation of gene predictor training set
Parameters?
JIGSAW


Appears not to provide a prediction for every region
which contains annotations
Training?
Future Work – Tomato Annotation Pipeline

Gene prediction



Combining predictions into a single consensus model
Train individual predictors with recently curated tomato
gene set
Automated functional annotation of genes



“Giving a biological meaning to the nicely colored bars”
blastx
InterProScan
Future Work – Tomato Genome Browser

Annotation of features



Meaningful names for features such as genes, marker
alignments, blast hits
More detailed and better readable data when clicking
on a feature
Links to external data sources


NCBI GenBank
SGN
Acknowledgements

Cyrille2 development




Tomato BAC sequencing (chromosome 6)


Mark Fiers
Ate van der Burgt
Joost de Groot
Greenomics
Supervision


Willem Stiekema
Roeland van Ham