Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Havana manual annotation Jen Harrow Wellcome Trust Sanger Institute overview • General intro work/aims of havana • Annotation tools • Vega view Havana Overview • 24 team members (2 work remotely Manchester /Glasgow) • 3 genomes annotated different levels of evidence (human /mouse/ zebrafish) • Mouse EUCOMM/KOMP/NORCOMM project external collaborators • Feedback from experts (ENCODE/biosapiens/ vega website) • Recently funded to complete human genome as part of Encode scale up Havana Annotation • Coding and non-coding loci (known and novel) • Pseudogenes (processed/unprocessed/ transcribed) • polyA sites/signal • Check and submit to nomenclature databases (MGI/HGNC/zfin) • Specialise annotation/categorisation variants • Freely accesible via Vega/Ensembl browser Rules for selecting CDS_type CDS? Yes Identical* to Swissprot or Refseq NP_? Yes Known_CDS No Novel first or last coding exon? No Yes Has cross-species Yes Swissprot/Trembl or gene family support? No Putative_CDS Shares >60% length of known_CDS, Swissprot or Refseq? No Has No cross-species Swissprot/Trembl or gene family support? Yes Yes Novel_CDS Novel_CDS Pfam No domain structure identical to Known _CDS? Yes Novel_CDS Tagging transcripts Known CDS (uniprot or refseq) Novel CDS NMD putative *now protein CDS coding transcripts • Biosapiens looked at novel /putative variants • Found little they evidence that had same function as known variants Transcript variant1? No CDS flowchart Transcript assignment Yes Literature confirmed as non-coding? No Literature confirmed as antisense? Yes Yes Non_coding Antisense No >1 possible ORF2? Fully 3 No retains or runs No Full-length No transcript subject >200 bases into to NMD? intron? Yes Ambiguous_ORF Yes Retained_intron Yes NMD4 Transcript7 EST supported 2-3 exon variant5? Yes Putative mRNA supported, No non-canonical splice site?6 Yes Artefact No Consortium links for Encode Scale up for whole human genome CCDS: HGNC/ Uniprot NCBI/Refseq Ensembl UCSC HSF Manual annotation: WTSI:Havana Computational: UCSC:Data centre/QC WashU:Brent/QC Yale:Gerstein/Pseudogene MIT:Kellis/Comparative analysis CRG:Guigo tracking/primer design Sanger:Ensembl QC/tracking evidence Experimental: Lausanne CRG: Gingeras/Txfrag (ENCODE) Biosapiens Goals for mouse annotation • Funded EUCOMM project 3 annotators goal ~8000 KO (12000 genes a mix old and newfunded until 2009) • Collaborate CCDS mouse annotate difficult cases (updates highlighted in RT- not funded) • Collaborate washU KOMP project/NORCOMM (no direct funding for annotation) • Internal collaborations WTSI mouse faculty • Updates usually only on request/ vega/uniprot(HSF) feedback structure of manual annotation pipeline Compute farm Clone Annotation Submitted EMBL Database Ensembl pipeline database (clone based raw analysis) incremental updates datasets possible Otter/loutre annotation database lace/spandit client mysql ana_notes for clone selection and tracking Convert xace Lace and Zmap: editing interfaces use temporary local database mysql regular QC dumps GFF files etc. vega database http das Vega /Ensembl browsers New Annotation tool (move from Fmap to Zmap) Annotation Interface: • data sets Annotation Interface: •Viewers and editors: Lace Annotation Interface: Viewers and editors: Zmap overview Clone tiling path Zmap:viewing homologies EST Manual Annotation Predictions Repeats RNA Protein Refseqs mRNA alignments Expand hits “Traffic light” links EST alignments Vertical and horizontal splitting Enables viewing Homology hits of long genes Annotation Interface: •Viewers and editors: Zmap Annotation Interface: •Viewers and editors: Blixem Annotation Interface: •Viewers and editors: Dotter Why is consistency important? • Mouse projects EUCOMM/KOMP (EU Conditional Mouse Mutant/KnockOut Mouse Project (USA)) • 5 externally funded annotators • 2 annotators based at WashU • At least 3000 targets annotated /year • If annotation isn’t reliable/consistent then experiments can fail Assessing scale of problem • Test all annotators on unannotated mouse region >30 loci chr10 • Time limit 2 days, no discussion between team members 2 days • Software modified to allow multiple sessions opened on same clones and not seen by other members • Laurens reference annotation to compare against Annotation consistency mouse chr10 reference CDS ensembl transcript CDS transcript Mum1 locus CDS transcript CDS transcript CDS transcript Inconsistencies between annotators Cirpb1 RP23-6P9.7 Efna1 Result of the test • Guidelines needed to be updated • Clarify variant assignment (CDS and transcript) although assignment of transcript more troublesome • Mostly, rate of annotation is attributed to experience. Current Vega mouse statistics Nod region Mouse chrX finished Nov 06 contigview:polyA sites and signals visible Locus report CCDS Date of annotation xref Transcript classes included known Retained intron Evidence used to build transcript Merge genes in Ensembl Knockout transcripts:EUCOMM and KOMP Compare KO:transcript against original transcript KO MIG database gene lists Summary:mouse annotation • Change from sequencing to KO targets or CCDS • Reannotation not automatic (feedback/request) • Encode project will help improve overall havana annotation (computational QC) • Vega main portal to access data (updated every 2 months) • Move to multi-genome annotation with improvements with Zmap Acknowledgements Havana: Jeff Almeida Clara Amid If Barnes Denise Cavalhoe Silva Sarah Donaldson Adam Frankish Elizabeth Hart Rhoda Kinsella Gavin Laird Jane Loveland Jonathan Mudge Jeena Rajan Harminder Sehra Catherine Snow Charles Steward Marie-M. Suner Mark Thomas Laurens Wilming Informatics: James Gilbert Chao-Kung Chen Leo Gordon Mustapha Larbaoui Ensembl: Steve Searle Val Curwen Acedb/Zmap: Ed Griffiths Roy Storey Vega: Stephen Trevanion Project Coordinators: Tim Hubbard Gencode collaborators: Roderic Guigo France Denoued Alex Reymond Taf1, TAF1 RNA polymerase II, TATA box binding protein (TBP)-associated factor Phastcons track showing conservation Annotation, vector design NB Exon/exons deleted Not whole gene 96 BACs 3 rounds of Recombineering in 96well boxes 1 FR T bgal FR T 2 lox P 3 lox P 96 targeting constructs Annotation system: ANALYSIS PIPELINE mRNA, EST, protein BLAST; Genscan, Fgenesh gene predictions; RepeatMasker; tandem repeats; CpG islands; RefSeq; Ensembl; ..... LACE MYSQL DB analysis data transcript editing interface FMAP/ZMAP viewer MYSQL DB annotation VEGA ENSEMBL Manual annotation: • manual annotation of finished genomic sequence every exon of every transcript supported by homology (mRNA / EST / protein) splice variants pseudogenes nomenclature gene clusters • interpretation of problematic evidence • examination of literature • • • • • NMD highly sophisticated pathway Neu-Yilik et al, Genome Biol 2004 4:1218 Consistent annotation of all coding variant essential Critical exon Exon missing in variant