Download Slides - Celebrating the 20th anniversary of Swiss-Prot

The Yoyo Has Stopped Reviewing the Evidence for a Low Basal Human Protein Number In Silico Analysis of Proteins: Celebrating the 20th Anniversary of Swiss-Prot Fortaleza, Brazil, August 2006 Christopher Southan Molecular Pharmacology, AstraZeneca R&D, Mölndal Presentation Outline • • • • • • • • • • • The importance of gene number Gene definition and detection Genome inflation arguments Post-completion changes in model eukaryotes Ensembl pipeline numbers The smORF question Completed chromosomes International Protein Index Novel gene skimming Updates Conclusions So Who Cares About Human Protein Coding Gene Number? • Central to evolutionary questions of gene number expansion vs. protein diversity from alternative splicing and post-translational modifications • Mammalian gene totals expected to be similar but clade-specific genes may be important for speciation • Accurate ORF delineation essential for genetic association studies and transcript profilling • MS-based proteomics needs a complete ORFome for the peptide and protein identification search space • For Pharma and Biotech the numbers set finite limits for potential drug targets and therapeutic proteins • The Swiss-Prot Human Proteomics Initiative (HPI) team Definitions • The basal (unspliced) protein-coding gene number: “transcriptional units that translate to one or more proteins that share overlapping sequence identity and are products of the same unique genomic locus and strand orientation” • However, the Guidelines for Human Gene Nomenclature define a gene as: "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterised by sequence, transcription or homology" • The increasing complexity of the transcriptome makes the wider definition of “gene” more difficult e.g. micro and antisence RNA Identifying Protein Coding Genes In silico • • • • • • Detection of protein identity in genomic DNA Gene prediction with protein similarity support Matches with ESTs that include ORFs and/or splice sites Cross-species comparisons for orthologous exon detection Presence of gene anatomy features e.g. CpG islands, promoters, transcription start sites, polyadenylation signals Absence of pseudogene disablements or repeat elements In vitro • • • • • • Cloning of predicted genes Detection of active transcription by Northern blot, RT-PCR or microarray hybridisation Loss-of-function approaches High-throughput transcript sampling by EST, MPSS or SAGE tags Heterologous expression of cDNAs Direct verification of protein sequence by Edman sequencing, massmapping and/or MS/MS sequencing Historical Arguments and Estimates for High Gene Numbers • • • • • • • • • Initial eukaryote (yeast/worm/fly) numbers assumed to be underestimates Gene prediction programs have a significant false-negative rate The Ensembl gene annotation pipeline is conservative Mammalian protein and transcript coverage is incomplete Chromosome annotation teams find more genes than automated pipelines Selective transcript skimming experiments have revealed new genes Extensive mamallian genomic sequence conservation outside known exons Postulated large numbers of undetected small proteins (“smORFs” or “dark matter”) EST clustering and commecial “gene inflation” claims Genesweep 2000 Liang et a. (ESTs) Jun-00 120000 Wright et al. (transcripts) Jun-01 70 000 Das et al. (RT-PCR) Sep-01 43 000 Xuan et al. (mouse comparison) Dec-02 Venter et al. (Celera, upper limit) Feb-01 Ewing et al. (ESTs) Jun-00 40 000 38 000 35 000 Colins et al. (chromosome 22) Dec-02 32 500 Lander et al. (IHGSC, upper limit) Feb-01 32 000 Roest Crollius et al. (Exofish) Jun-00 31 000 Literature estimates Model Eukaryotes: No Significant Post-Completion Gene Increases 25000 20 059 19 099 Gene number 20000 15000 1360113689 10000 5000 4824 4990 6257 5784 S.pombe Feb-02 S.pombe Nov-05 S.cerevisiae May-97 S.cerevisiae Nov-05 D.melanogaster Mar-00 D.melanogaster Nov-05 C.elegans Dec-98 C.elegans Nov-05 0 Organism • • • • S.pombe: 3% increase since 2002 S.cerevisiae: 8% decrease since 1997 C.elegans: 5% increase since 1998 D.melanogaster: 0.2% increase since 2001 Little increase in spite of global functional genomics focus Human Transcripts: Post-genomic mRNA Growth in UniGene nr-mRNA mRNA Number of mRNAs 120000 100000 80000 60000 40000 20000 16 0 15 8 15 6 15 4 15 2 15 0 14 8 14 5 14 3 14 1 13 9 13 7 13 5 13 3 0 UniGene Release • • • Rapid growth in redundant mRNA But slow growth in clustered set ~ 9,000 over 2 years with plateau ~ 28,000 Includes splice variants and some spurious ORFs Ensembl Human Gene Number 30 000 25 000 20 000 15 000 10 000 5 000 m ar -0 m 1 aj -0 1 ju l- 0 se 1 p0 no 1 v0 ja 1 n0 m 2 ar -0 m 2 aj -0 2 ju l- 0 se 2 p0 no 2 v0 ja 2 n0 m 3 ar -0 m 3 aj -0 3 ju l- 0 se 3 p0 no 3 v0 ja 3 n0 m 4 ar -0 m 4 aj -0 4 ju l- 0 se 4 p0 no 4 v04 0 • • • • • Only 22,218 genes, a decrease of 1826 over 4 years Knowns: from 90% < 95% Novel genes: 12,398 > 2,263 Exons-per-gene: 6.5 < 9.6 Alternative splicing: from 3,669 < to 8,078 Addressing the smORF Question: Protein Size Distributions in Human SPTr 3500 Pre Oct-01 6.3% > 100aa 3000 2500 2000 1500 1000 500 0 1-100 100200 200300 300400 400500 500600 600700 700800 800900 9001000 <1000 1-100 100200 200300 300400 400500 500600 600700 700800 800900 9001000 <1000 300400 400500 500600 3500 Post Oct-01 5.5% > 100aa 3000 2500 2000 1500 1000 500 0 “Novel” in title 3.4% > 100aa 1200 1000 800 600 400 200 0 1-100 100200 200300 600700 700800 800900 9001000 <1000 Summarising the smORF Question • • • • • • The “triple postulate” i.e. a combination of gene prediction failiure, no homology and absence of transcription data, seems unlikely No database evidence for increased bsence smORF discovery mammals The observation that only ~1% of mouse genes have no detectable human homology contradicts the idea of large order-specific gene expansion in mammals Although small proteins evolve more rapidly there is no precedent for complete loss of ortholog simillarity signal Those much shorter than 100 residues will fall below the threshold necessary to fold into the domain structures necessary for biological function No evidence for de-novo gene “invention” in higher eukaryotes Release History of the International Protein Index: Only Slow Increases in the Non-redundant Protein Sets 56537 Entries Experimental Transcript Skimming as Evidence for High Protein Numbers • • • • • Exon arrays (Dunham et al. 1999) Gene arrays (Penn et al. 2000) RT-PCR (Das et al. 2001) SAGE-tags (Saha et al. 2002, Chen et al. 2002) Oligo tiling from 21 and 22 (Kapranov et al. 2002, Kampa, et al 2004) • Necessary to submit a full length ORF with the features of gene anatomy to the public databases before the discovery of novel proteins can be claimed – none of these publications submitted any There is increasing evidence for significant amounts of non-ORF transcription in human and mouse • Gene Numbers for Individual Completed Chromosomes • • • • • • • Averaging the completed chromosomes exceeds Ensembl genes by ~12% Extrapolates to ~ 25,000 genes without “novel transcripts” or “putatives” Extrapolates to ~ 28,000 genes without “putatives” Extrapolates to ~ 31,000 genes with “putatives repeat elements The chromosome reports were made at different times using different assemblies and different grades of gene definition and evidence support (e.g. different results for chromosome 7) Difficult to explicitly cross-map VEGA vs. Ensembl chromosome gene numbers Future status of novel transcripts and putative genes unclear – most will be non-coding Disappearing Novelty Name Accession and date Ens 31 NCBI 31 EST Earlier sequence Testicular OR-4 associated AY101377 01-MAR-2003 _ XP_072027 √ _ 4.10 gene AY1377 03-MAR-2003 ENSG000 00172159 XP_171208 √ BC041376 24-DEC-2002 Zygote arrest-1 AY191416 22-JAN-2003 _ _ √ _ SBF2 AY234241 20-MAR-2003 ENSG000 00133812 XP_049218 √ AB051553 07-FEB-2001 CAGE-1 AF414185 27-FEB-2003 ENSG000 00164304 NP_786887 √ BC026194 09-APR-2002 Diabetes related ankyrin repeat AF492401 18-JAN-2003 ENSG000 00163126 _ √ AK092564 15-JUL-2002 Ligandgated channel subunit AF512521 12-JAN-2003 _ _ Taxilin AF516206 17-FEB-2003 ENSG000 00084652 NP_787048 √ L15344 25-MAY-1995 HGAL-IL4 inducible AF521911 14-JAN-2003 ENSG000 00174500 NP_689998 √ BC030506 20-MAY-2002 Zinc finger ZZaPK AY184389 28-JAN-2003 ENSG000 00075407 XP_166119 √ AF063599 02-JAN-2001 Dymeclin BK000950 26-FEB-2003 ENSG000 00141627 NP_060123 √ AK091256 15-JUL-2002 _ _ • EMBL hum cds March 2003 = 1491 • Plus “novel” = 159 • Plus PubMed 2003 = 120 • Novel in title = 11 • Previous cds = 8 • Novel genes = 2 • Now both in RefSeq and Ens 18.34 Human Proteome Sampling by MS/MS Identification: A Paucity of Novel Genes • • • • • • • • • 3778 from plasma (Muthusamy et al 2005) 2486 from liver cells (Yan et al. 2006) 615 from the human heart mitochondria (Taylor et al. 2003) 500 from breast cancer cell membranes (Adams et al. 2003) 491 from microsomal fractions (Han et al. 2001) 311 from the splicesome (Rappsilber et al. 2002) No verifiable data on gene prediction confirmation One novel gene reported from a genome-only peptide match by Kuster et al in 2001 but this appeared from a high-throughput project later in the same year (Tr Q96DA0) While there is no evidence of novel protein discovery there is a caveat on the availalable search space Conclusions • • • • • • • • • The model eukaryotes have shown no significant post-genomic rises in gene number The Ensembl gene number has been essentially flat since 2001 There is a set of ~2,000 predicted genes still eluding experimental verification – or may not be real ? Putative genes from curated chromosmes could raise protein numbers but the status of this class of transcripts is in doubt Early over-estimates explicable by non-ORF transcription Post-genomic transcript coverage is predominantly re-sampling known genes Database submissions of novel human genes have slowed to a trickle No evidence for large numbers of cryptic smORFs Proteomics has not revealed new proteins Maybe Swiss-Prot can pop the champane corks when HPI hits 20,000 ? Updates • October 2004 Nature paper on finished human genome “2025,000 protein-coding genes” • December 2005 Nature paper “The dog gene count (19,300) is substantially lower than the 22,000-gene models in the current human gene catalogue (EnsEMBL build 26). For many predicted human genes, we find no convincing evidence of a corresponding dog gene. Much of the excess in the human gene count is attributable to spurious gene predictions in the human genome (M. Clamp, personal communication).” • March 2006 Ensembl 23,701 • June 2006 Swiss-Prot HPI 14,445 Acknowledgments and Reference • Paul Kersey of the EBI for IPI figures • Lucas Wagner of the NCBI for the retrospective UniGene data • Numerous other people at NCBI, EBI, Swiss-Prot and Sanger Centre who graciously answered queries on their data collections • The Oxford Glycosciences Proteome Discovery Team Southan C. Has the Yo-yo stopped? An assesment of human protein-coding gene number (2004) Proteomics (6):1712-26. PMID: 15174140

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slides - Celebrating the 20th anniversary of Swiss-Prot