Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
VectorBase genome annotation VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK VectorBase SWG 2006 1 Overview of current annotation system Assembled genome Sequencing centre gene predictions VectorBase gene predictions Merge into canonical set Protein analysis Display on genome browser Release to GenBank/EMBL/DDBJ VectorBase SWG 2006 2 Merging gene sets Gene set Gene set #1 #2 Reduce to single predictions per locus Compare exon/intron structures Identical structure s Compatib Different Merge/Spli le structure t structures structure s s Add isoform predictions based on EST/Peptide data Canonical gene set VectorBase SWG 2006 Complex No Map 3 Data types used for gene prediction/validation Protein sequences ‘Self’ (i.e. species to be predicted) Taxonomic splits of UniprotKB Transcript sequences mRNAs ESTs Evidence of expression Microarray SAGE tags Ditags MPSS Proteomics data Sequence statistics Coding potential Splice site prediction VectorBase SWG 2006 4 VectorBase gene prediction pipeline Blessed predictions Manual annotations (Apollo) Community (Genewise, Exonerate, submissions Apollo) Similarity (Genewise) predictions Species-specific (Genewise) predictions ncRNA (Rfam) predictions Transcript based (Exonerate) predictions Canonic al predictio ns VectorBase SWG 2006 Protein family (Genewise) HMMs Ab initio gene (SNAP) predictions 5 VectorBase curation database pipeline for manual/community annotation Manual annotation (Harvard) Apoll o ChadoXML Community annotation (in collaboration with Harvard) Community annotation (Community representatives) Curation warehouse db Chad o ChadoXML Apoll o GFF3 Ensembl Gene build db VectorBase SWG 2006 6 Overview of current re-annotation system Full gene build Partial Gene build Blessed genes New gene build Compar e Species-specific gene prediction Curren t gene set Updat ed gene set VectorBase SWG 2006 Merge 7 Comparing new gene builds with the old one • Use of manual annotation for validation of automated gene build improvements • Simple statistics (CDS length, intron size, CDS matching TE’s) • BRC annotation metrics – Supporting evidence for a gene prediction (citation,expression,orthology) – Attachment of Standard Operating Procedures (SOPs) VectorBase SWG 2006 8 Gene build schedules Full gene build Triggers for re-annotation 4 months • Temporal • Data • New EST data for species 1 month Partial gene build • New genomes • Re-annotated genomes VectorBase SWG 2006 9 VectorBase annotation capacity with increased number of genomes Gene builds per year per genome 2 full 2 partial 1 full 3 partial 1full 2 partial 1 full 1 partial 2 genomes Yes Yes Yes Yes 3 genomes Yes Yes Yes Yes 4 genomes No Yes Yes Yes 5 genomes No Yes Yes Yes 6 genomes No No Yes Yes 7 genomes No No No Yes 8 genomes No No No No VectorBase SWG 2006 10 Re-annotation questions • Triggers for re-annotation – Strict temporal triggers • Always do a full gene build every year? – Data triggers • How much new data is enough? • Knock-on effects of related species (re)annotation? • Encouraging community submissions – How can we get more community annotation input? • Outreach at conferences (Roadshow) VectorBase SWG 2006 11