Download Pruitt - Mouse Genome Informatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Consensus CoDing Sequence
(CCDS) Database
Kim D. Pruitt
Mouse Genome Annotation Summit Meeting
March 12-13, 2008
National Center for Biotechnology Information
Why is the CCDS project needed?
The Problem:
Annotation of the genome sequence is essential
– but beware of different interpretations!
• The availability of the human and mouse genome sequence has
had a significant impact on disease and health research.
• Most scientists rely on annotation information when designing,
interpreting, and evaluating research results.
• Inconsistencies in annotation results among the main public
resources hampers use of this important data.
• Researchers may not realize that a different annotation result is
available elsewhere – possibly leading to erroneous or incomplete
interpretations.
National Center for Biotechnology Information
CCDS - A collaborative project
• Initiated by the main public annotation/browser groups to address
concerns by the scientific community about inconsistencies in the
human and mouse genome annotation.
• Built by consensus among the collaborating members, which include:
European Bioinformatics Institute (EBI)
National Center for Biotechnology Information (NCBI)
University of California, Santa Cruz (UCSC)
Sanger Institute (WTSI)
National Center for Biotechnology Information
What is the CCDS project?
• Project Goals
– identify a core set of protein-coding genes that are consistently
annotated and of high quality
– support convergence toward a standard set of gene annotations
• Scope:
– Human and mouse protein coding regions
• Update frequency
– Variable
– Depends on frequency of genome annotation updates
National Center for Biotechnology Information
Process flow – calculating updates
NCBI (computational)
Havana (manual) Ensembl (computational)
RefSeq (manual)
Compare
CDS
(Annotation
+
Sequence)
Ensembl
merged
annotation
QA
Identical
Similar
Novel
Existing CCDS Retain
Retain
Lost
New match
Out of scope
Out of scope
New CCDS ID
National Center for Biotechnology Information
Assessing Quality
CCDS status is conservatively applied:
• Annotated CDS coordinates are identical
• Annotation is of high quality and passes QA tests, or curator review
• Existing CCDS proteins can be flagged for review by the collaborating
members
• Updates and removals are by consensus agreement.
Quality assessment tests include:
–
–
–
–
–
–
–
–
Consensus splice sites ("GY..AG" or "AT..AC")
Valid start and stop codons with no internal stops
NMD
QA test results are reviewed by curators
Low complexity
Over-rides are set to retain supported CDSs
Repeat-containing
Insufficient protein homology
Genome conservation
Putative pseudogene
National Center for Biotechnology Information
CCDS Counts
Date
Build
CDS IDs
GeneIDs
Mar-05 Hs35.1
14,795
13,142
Feb-07 Hs36.2
18,290
16,008
Oct-06 Mm36.1
13,374
13,014
Nov-07 Mm37.1
17,707
16,893
Step
Source
Annotation
NCBI
Annotation
Ensembl
Matching CDS
QA & curation rejections 1331
Accepted rejections
Final CCDS ID
Genes Proteins
24765
27209
18185
1350
1292
16893
26851
39941
19048
1341
17707
National Center for Biotechnology Information
Curation – how are updates curated and coordinated?
•
Any member of the collaboration can flag a CCDS for review
– Update the CDS definition (alter N-terminus extent internal splice site)
– Withdraw the CCDS ID (insufficiently supported, or non-protein coding)
•
NCBI provides a collaboration web site to coordinate this review
•
All collaborators must agree with a change to finalize a decision
•
Withdrawal of a CCDS may happen between genome annotation updates
•
An update to a CCDS is indicated by:
– Status change: a status of ‘pending update’ is reported when there is
collaborative agreement that a change is needed
– Version change: The CCDS version number is incremented once the change
is reflected in public annotation. This only occurs after a genome annotation
update and CCDS analysis has taken place.
CCDS curation is fully integrated with RefSeq curation
National Center for Biotechnology Information
CCDS update & curation stats
Curation-based changes:
name
human
human
human
human
mouse
mouse
mouse
mouse
action
update
update
withdraw
withdraw
update
update
withdraw
withdraw
Mouse: ~5200 curated CCDS genes
status
pending
agreed
pending
agreed
pending
agreed
pending
agreed
count
366
923
557
189
709
519
185
242
57
16
24
8
Annotation pipeline-based changes:
name build status
count
human 35.1 Withdrawn, inconsistent annotation 133
human 36.2 Withdrawn, inconsistent annotation
29
mouse 36.1 Withdrawn, inconsistent annotation
29
mouse 37.1 Withdrawn, inconsistent annotation
4
National Center for Biotechnology Information
Curation considerations
•
•
•
•
•
•
Alignments
Track low quality sequences (‘kill list’)
Protein conservation
Publications
Personal communications
QA measures
National Center for Biotechnology Information
Access – How do I know if an annotation has a CCDS ID?
• Genome browser displays
– NCBI
– UCSC
• Gene reports
– Ensembl
– NCBI
– UCSC
– Vega
• Other:
– RefSeq annotation (NCBI)
– CCDS web site
– FTP
http://www.ncbi.nlm.nih.gov/CCDS/
National Center for Biotechnology Information
NCBI Map Viewer (chr.5)
Link to
CCDS Browser
National Center for Biotechnology Information
UCSC Browser
chr5:30270000-30650000
National Center for Biotechnology Information
UCSC Browser – Tyms gene
CCDS Browser
National Center for Biotechnology Information
Access of CCDS data at NCBI
•CCDS Database & Browser interface
•Project Description
•Query support
•Reports attributes of the CCDS
•Location data
•Sequence members
•Status
•FTP reports
National Center for Biotechnology Information
CCDS Browser
History
Find all CCDSs for the Gene
Entrez Gene
View CCDS Details
National Center for Biotechnology Information
CCDS Browser
•Mouse-over highlights codon
•Click to highlight codon and
corresponding amino acid
National Center for Biotechnology Information
Biology is complex – some CCDS curation examples
• 1 vs 2 vs ‘n’ genes
• translation start site
National Center for Biotechnology Information
1 vs. 2 vs. ‘n’ genes
Curation Considerations:
– Nomenclature
– History (scientific use, publications, etc.)
– Different (but similar) products vs. distinct products
– Shared promoters
National Center for Biotechnology Information
carnitine palmitoyltransferase 1b,
choline kinase beta
National Center for Biotechnology Information
1 vs. 2 vs. ‘n’ genes
Current RefSeq representation of the region
- two protein coding loci
- one non-coding loci for the non-coding
transcript product (a read-through transcript)
Chkb (CCDS27750.1)
Cpt1b (CCDS27749.1 )
Chkb-cpt1b (PMID:12761301 )
National Center for Biotechnology Information
Translation start site
• Curation Considerations
– Publication reports (CDS begins at ‘n’)
– Other cDNA sequencing reveals the ORF can be extended
further upstream
– Evaluate:
• Genome conservation
• Literature reports for the protein
• Putative Kozak signals
• Presence of in-frame upstream stop codon
• INSDC submissions from an experimental lab source that do
have the longer ORF extent annotated.
• Consult with an expert
National Center for Biotechnology Information
Internal CCDS browser (restricted access)
Jmjd2d jumonji domain containing 2D (chr 19)
National Center for Biotechnology Information
Update is agreed on
by all parties
Resulting in a
258 aa N-terminal
extension
National Center for Biotechnology Information
Examples – no CCDS ID
EBI+WTSI and NCBI transcript annotation
may differ even though the gene includes
annotations with CCDS IDs
National Center for Biotechnology Information
Examples –no CCDS ID
EBI/WTSI
NCBI
EBI/WTSI
NCBI
Reasons:
•not found by one group
•different CDS length
•different splice sites
•different internal exon
•Curation removal
EBI/WTSI
NCBI
EBI/WTSI
National Center for Biotechnology Information
NCBI
Acknowledgements
Donna Maglott
Josh Cherry
Keith Oxenride
Craig Wallin
Andrei Shkeda
Collaborators at Ensembl, UCSC, Vega
Jen Ashurst & Vega curator group
Rachel Harte
Mark Diekhans
Steve Searle
RefSeq Curators
NCBI Genome Annotation Group
NCBI Map Viewer Group
National Center for Biotechnology Information
Ensembl – Tyms gene
National Center for Biotechnology Information
Vega browser Tyms gene (chromosome 5 30388989-30404404)
National Center for Biotechnology Information
Related documents