Download SWG diagrams

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
VectorBase annotation metrics
Daniel Lawson
VectorBase-EBI, European Bioinformatics Institute,
Wellcome Trust Genome Campus, Hinxton UK
VectorBase BRC4 2006
1
Topics
• Annotation metrics
– Numbers (Gene numbers & xrefs)
– Data types (Availability & Integration)
• Annotation SOPs
– Genome specific
– Gene specific
– Gene build profile & prediction confidence
VectorBase BRC4 2006
2
AaegL1.1
AgamP3.3
Yeast
Worm
Fly
Human
Gene
Gene count
16,691
13,765
7,098
21,105
14,752
31,206
15,419 (92.4 %)
13,277 (96.5 %)
6,680
20,060
14,086
23,245
1,272 ( 7.6 %)
488 (3.5 %)
418
1,045
666
7,961
18,061
14,127
-
-
-
-
16,789 (93.0 %)
13,639 (96.5 %)
-
-
-
-
1,272 (7.0 %)
488 (3.5 %)
-
-
-
-
Manually reviewed
0 (0.0 %)
261 (1.9 %)
6,680
20,060
14,086
6,995
Community input
0 (0.0 %)
667 (4.9 %)
4,684
7,228
9,945
16,887
Combined
11,487 (74.5)
9,782 (73.7 %)
-
-
-
-
A.aegypti
n/a
8,907 (67.1 %)
2,202
4,416
7,991
6,590
A.gambiae
9,923 (54.9 %)
n/a
2,228
4,444
7,702
6,612
C.elegans
4,923 (29.5 %)
4,442 (33.4 %)
2,185
n/a
4,598
6,121
D.melanogaster
9,078 (50.3 %)
7,649 (57.6 %)
2,228
4,543
n/a
6,654
H.sapiens
5,510 (33.0 %)
5,046 (38.0 %)
2,326
4,473
5,109
n/a
S.cerevisiae
2,520 (15.1 %)
2,350 (17.7 %)
n/a
2,349
2,470
3,265
GO terms
9,335 (51.7 %)
7,601 (55.7 %)
4,176
11,334
10,226
17,000
EC numbers
2,950 (16.3 %)
2,230 (16.4 %)
4,103 *
5,240 *
4,009 *
13,245 *
11,536 (74.8 %)
9,869 (72.4 %)
4,611
14,730
10,475
18,199
Combined
12,350 (80.0 %)
7,557 (55.4 %)
-
-
-
-
cDNA/EST
9,270 (60.1 %)
7,557 (55.4 %)
-
-
-
-
microarray
9,143 (59.2 %)†
0 (0.0 %)‡
-
-
-
-
MPSS
3,984 (25.8 %)†
n/a
-
-
-
-
Protein-coding
other
Transcript
Transcript count
Protein-coding
other
Manual effort
Orthologs
Functional annotation
InterPro
Expression evidence
VectorBase BRC4 2006
3
Considerations
• Importance of calculating all metrics using similar
methodology from the same data set
• Metrics calculated from Ensembl using BioMart & raw
SQL queries.
• GO terms - many ways of calculating (InterPro2GO,
projection from Drosophila orthologs)
• No VectorBase capability to automatically assign EC
numbers
VectorBase BRC4 2006
4
AaegL1.1
AgamP3.3
Sequence
Yes
Download, search, visualization
Yes
Download, search, visualization
Polymorphisms
No
n/a
Yes
Search, visualization
Genetic maps
Yes
Not integrated
Yes
Visualization
Syntenic alignment
Yes
Visualization
Yes
Visualization
cDNAs & ESTs
Yes
Download, search, visualization
Yes
Download, search, visualization
SAGE tags
No
n/a
No
n/a
Microarrays
Yes
Visualization
Yes
Visualization
MPSS
Yes
Not integrated
No
n/a
Proteomics
No
n/a
No
n/a
Structures
No
n/a
No
n/a
Interactome data
No
n/a
No
n/a
Pathways
No
n/a
No
n/a
Orthology profiles
Yes
Visualization
Yes
Visualization
Essentiality data
No
n/a
No
n/a
VectorBase BRC4 2006
5
VB:SOP010
VectorBase gene prediction pipeline (SOP)
Blessed predictions
Manual annotations
Community submissions
VB:SOP007
Similarity predictions
VB:SOP002 & SOP003
Species-specific predictions
VB:SOP001
Canonical
Gene set
ncRNA predictions
VB:SOP008
Transcript based predictions
VB:SOP004
VectorBase BRC4 2006
Protein family HMMs
VB:SOP009
Ab initio gene predictions
VB:SOP005
6
Assignment of SOPs to VectorBase genes: AgamP3.3
SOP
No. genes
VB:SOP001
Confirmed
674
VB:SOP002
Protein-based with
transcript support
3765
VB:SOP003
Protein-based
4830
VB:SOP004
Transcript-based
2857
VB:SOP005
Supported ab initio
585
VB:SOP006
ab initio
0
VB:SOP007
Manual annotation
928
VectorBase BRC4 2006
7
Display of Metrics & SOPs
• Metrics
– VectorBase wiki
– Species-page containing the three tables available from the
VectorBase species homepage
– Expansion of documents relating to genomic resources (citations,
links to primary data where possible)
– Single collated table for BRC as separate download
• SOPs
– VectorBase wiki
– ‘Documents’ section of main site
VectorBase BRC4 2006
8
VectorBase BRC4 2006
9
Manual annotation progress
Protein-coding
gene No.
VectorBase
manual
Community
submission
Anopheles gambiae
AgamP3.3
13,277
current
261 ( 2.0 %)
667 ( 5.0 %)
2474 (18.6 %)
667* ( 5.0 %)
0 ( 0.0 %)
0 ( 0.0 %)
0 ( 0.0 %)
341 ( 2.2 %)
Aedes aegypti
AaegL1.1
15,419
current
VectorBase BRC4 2006
10
Merging gene sets
Gene set #1
Gene set #2
Reduce to single predictions per locus
Compare exon/intron structures
Identical
structures
Compatible
structures
Different
structures
Merge/Split
structures
Complex
No Map
Add isoform predictions based on EST/Peptide data
Canonical gene set
VectorBase BRC4 2006
11
Related documents