Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
VectorBase annotation metrics Daniel Lawson VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK VectorBase BRC4 2006 1 Topics • Annotation metrics – Numbers (Gene numbers & xrefs) – Data types (Availability & Integration) • Annotation SOPs – Genome specific – Gene specific – Gene build profile & prediction confidence VectorBase BRC4 2006 2 AaegL1.1 AgamP3.3 Yeast Worm Fly Human Gene Gene count 16,691 13,765 7,098 21,105 14,752 31,206 15,419 (92.4 %) 13,277 (96.5 %) 6,680 20,060 14,086 23,245 1,272 ( 7.6 %) 488 (3.5 %) 418 1,045 666 7,961 18,061 14,127 - - - - 16,789 (93.0 %) 13,639 (96.5 %) - - - - 1,272 (7.0 %) 488 (3.5 %) - - - - Manually reviewed 0 (0.0 %) 261 (1.9 %) 6,680 20,060 14,086 6,995 Community input 0 (0.0 %) 667 (4.9 %) 4,684 7,228 9,945 16,887 Combined 11,487 (74.5) 9,782 (73.7 %) - - - - A.aegypti n/a 8,907 (67.1 %) 2,202 4,416 7,991 6,590 A.gambiae 9,923 (54.9 %) n/a 2,228 4,444 7,702 6,612 C.elegans 4,923 (29.5 %) 4,442 (33.4 %) 2,185 n/a 4,598 6,121 D.melanogaster 9,078 (50.3 %) 7,649 (57.6 %) 2,228 4,543 n/a 6,654 H.sapiens 5,510 (33.0 %) 5,046 (38.0 %) 2,326 4,473 5,109 n/a S.cerevisiae 2,520 (15.1 %) 2,350 (17.7 %) n/a 2,349 2,470 3,265 GO terms 9,335 (51.7 %) 7,601 (55.7 %) 4,176 11,334 10,226 17,000 EC numbers 2,950 (16.3 %) 2,230 (16.4 %) 4,103 * 5,240 * 4,009 * 13,245 * 11,536 (74.8 %) 9,869 (72.4 %) 4,611 14,730 10,475 18,199 Combined 12,350 (80.0 %) 7,557 (55.4 %) - - - - cDNA/EST 9,270 (60.1 %) 7,557 (55.4 %) - - - - microarray 9,143 (59.2 %)† 0 (0.0 %)‡ - - - - MPSS 3,984 (25.8 %)† n/a - - - - Protein-coding other Transcript Transcript count Protein-coding other Manual effort Orthologs Functional annotation InterPro Expression evidence VectorBase BRC4 2006 3 Considerations • Importance of calculating all metrics using similar methodology from the same data set • Metrics calculated from Ensembl using BioMart & raw SQL queries. • GO terms - many ways of calculating (InterPro2GO, projection from Drosophila orthologs) • No VectorBase capability to automatically assign EC numbers VectorBase BRC4 2006 4 AaegL1.1 AgamP3.3 Sequence Yes Download, search, visualization Yes Download, search, visualization Polymorphisms No n/a Yes Search, visualization Genetic maps Yes Not integrated Yes Visualization Syntenic alignment Yes Visualization Yes Visualization cDNAs & ESTs Yes Download, search, visualization Yes Download, search, visualization SAGE tags No n/a No n/a Microarrays Yes Visualization Yes Visualization MPSS Yes Not integrated No n/a Proteomics No n/a No n/a Structures No n/a No n/a Interactome data No n/a No n/a Pathways No n/a No n/a Orthology profiles Yes Visualization Yes Visualization Essentiality data No n/a No n/a VectorBase BRC4 2006 5 VB:SOP010 VectorBase gene prediction pipeline (SOP) Blessed predictions Manual annotations Community submissions VB:SOP007 Similarity predictions VB:SOP002 & SOP003 Species-specific predictions VB:SOP001 Canonical Gene set ncRNA predictions VB:SOP008 Transcript based predictions VB:SOP004 VectorBase BRC4 2006 Protein family HMMs VB:SOP009 Ab initio gene predictions VB:SOP005 6 Assignment of SOPs to VectorBase genes: AgamP3.3 SOP No. genes VB:SOP001 Confirmed 674 VB:SOP002 Protein-based with transcript support 3765 VB:SOP003 Protein-based 4830 VB:SOP004 Transcript-based 2857 VB:SOP005 Supported ab initio 585 VB:SOP006 ab initio 0 VB:SOP007 Manual annotation 928 VectorBase BRC4 2006 7 Display of Metrics & SOPs • Metrics – VectorBase wiki – Species-page containing the three tables available from the VectorBase species homepage – Expansion of documents relating to genomic resources (citations, links to primary data where possible) – Single collated table for BRC as separate download • SOPs – VectorBase wiki – ‘Documents’ section of main site VectorBase BRC4 2006 8 VectorBase BRC4 2006 9 Manual annotation progress Protein-coding gene No. VectorBase manual Community submission Anopheles gambiae AgamP3.3 13,277 current 261 ( 2.0 %) 667 ( 5.0 %) 2474 (18.6 %) 667* ( 5.0 %) 0 ( 0.0 %) 0 ( 0.0 %) 0 ( 0.0 %) 341 ( 2.2 %) Aedes aegypti AaegL1.1 15,419 current VectorBase BRC4 2006 10 Merging gene sets Gene set #1 Gene set #2 Reduce to single predictions per locus Compare exon/intron structures Identical structures Compatible structures Different structures Merge/Split structures Complex No Map Add isoform predictions based on EST/Peptide data Canonical gene set VectorBase BRC4 2006 11