Download Slides - Celebrating the 20th anniversary of Swiss-Prot

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epistasis wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Genomics wikipedia , lookup

Human genome wikipedia , lookup

Primary transcript wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

X-inactivation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Copy-number variation wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Protein moonlighting wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Minimal genome wikipedia , lookup

Point mutation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

History of genetic engineering wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene therapy wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression programming wikipedia , lookup

NEDD9 wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene nomenclature wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome (book) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
The Yoyo Has Stopped
Reviewing the Evidence for a Low Basal Human
Protein Number
In Silico Analysis of Proteins: Celebrating the 20th Anniversary of Swiss-Prot
Fortaleza, Brazil, August 2006
Christopher Southan
Molecular Pharmacology, AstraZeneca R&D, Mölndal
Presentation Outline
•
•
•
•
•
•
•
•
•
•
•
The importance of gene number
Gene definition and detection
Genome inflation arguments
Post-completion changes in model eukaryotes
Ensembl pipeline numbers
The smORF question
Completed chromosomes
International Protein Index
Novel gene skimming
Updates
Conclusions
So Who Cares About Human Protein Coding Gene Number?
• Central to evolutionary questions of gene number expansion vs.
protein diversity from alternative splicing and post-translational
modifications
• Mammalian gene totals expected to be similar but clade-specific
genes may be important for speciation
• Accurate ORF delineation essential for genetic association studies
and transcript profilling
• MS-based proteomics needs a complete ORFome for the peptide
and protein identification search space
• For Pharma and Biotech the numbers set finite limits for potential
drug targets and therapeutic proteins
• The Swiss-Prot Human Proteomics Initiative (HPI) team
Definitions
• The basal (unspliced) protein-coding gene number:
“transcriptional units that translate to one or more proteins that
share overlapping sequence identity and are products of the
same unique genomic locus and strand orientation”
• However, the Guidelines for Human Gene Nomenclature define
a gene as: "a DNA segment that contributes to
phenotype/function. In the absence of demonstrated function a
gene may be characterised by sequence, transcription or
homology"
• The increasing complexity of the transcriptome makes the wider
definition of “gene” more difficult e.g. micro and antisence RNA
Identifying Protein Coding Genes
In silico
•
•
•
•
•
•
Detection of protein identity in genomic DNA
Gene prediction with protein similarity support
Matches with ESTs that include ORFs and/or splice sites
Cross-species comparisons for orthologous exon detection
Presence of gene anatomy features e.g. CpG islands, promoters,
transcription start sites, polyadenylation signals
Absence of pseudogene disablements or repeat elements
In vitro
•
•
•
•
•
•
Cloning of predicted genes
Detection of active transcription by Northern blot, RT-PCR or
microarray hybridisation
Loss-of-function approaches
High-throughput transcript sampling by EST, MPSS or SAGE tags
Heterologous expression of cDNAs
Direct verification of protein sequence by Edman sequencing, massmapping and/or MS/MS sequencing
Historical Arguments and Estimates for High Gene Numbers
•
•
•
•
•
•
•
•
•
Initial eukaryote (yeast/worm/fly) numbers assumed to be underestimates
Gene prediction programs have a significant false-negative rate
The Ensembl gene annotation pipeline is conservative
Mammalian protein and transcript coverage is incomplete
Chromosome annotation teams find more genes than automated pipelines
Selective transcript skimming experiments have revealed new genes
Extensive mamallian genomic sequence conservation outside known exons
Postulated large numbers of undetected small proteins (“smORFs” or “dark
matter”)
EST clustering and commecial “gene inflation” claims
Genesweep 2000
Liang et a. (ESTs) Jun-00
120000
Wright et al. (transcripts) Jun-01
70 000
Das et al. (RT-PCR) Sep-01
43 000
Xuan et al. (mouse comparison) Dec-02
Venter et al. (Celera, upper limit) Feb-01
Ewing et al. (ESTs) Jun-00
40 000
38 000
35 000
Colins et al. (chromosome 22) Dec-02
32 500
Lander et al. (IHGSC, upper limit) Feb-01
32 000
Roest Crollius et al. (Exofish) Jun-00
31 000
Literature
estimates
Model Eukaryotes:
No Significant Post-Completion Gene Increases
25000
20 059
19 099
Gene number
20000
15000
1360113689
10000
5000
4824 4990
6257 5784
S.pombe Feb-02
S.pombe Nov-05
S.cerevisiae May-97
S.cerevisiae Nov-05
D.melanogaster Mar-00
D.melanogaster Nov-05
C.elegans Dec-98
C.elegans Nov-05
0
Organism
•
•
•
•
S.pombe:
3% increase since 2002
S.cerevisiae: 8% decrease since 1997
C.elegans: 5% increase since 1998
D.melanogaster: 0.2% increase since 2001
Little increase in spite of global functional genomics focus
Human Transcripts:
Post-genomic mRNA Growth in UniGene
nr-mRNA
mRNA
Number of mRNAs
120000
100000
80000
60000
40000
20000
16
0
15
8
15
6
15
4
15
2
15
0
14
8
14
5
14
3
14
1
13
9
13
7
13
5
13
3
0
UniGene Release
•
•
•
Rapid growth in redundant mRNA
But slow growth in clustered set ~ 9,000 over 2 years with plateau ~ 28,000
Includes splice variants and some spurious ORFs
Ensembl Human Gene Number
30 000
25 000
20 000
15 000
10 000
5 000
m
ar
-0
m 1
aj
-0
1
ju
l- 0
se 1
p0
no 1
v0
ja 1
n0
m 2
ar
-0
m 2
aj
-0
2
ju
l- 0
se 2
p0
no 2
v0
ja 2
n0
m 3
ar
-0
m 3
aj
-0
3
ju
l- 0
se 3
p0
no 3
v0
ja 3
n0
m 4
ar
-0
m 4
aj
-0
4
ju
l- 0
se 4
p0
no 4
v04
0
•
•
•
•
•
Only 22,218 genes, a decrease of 1826 over 4 years
Knowns: from 90% < 95%
Novel genes: 12,398 > 2,263
Exons-per-gene: 6.5 < 9.6
Alternative splicing: from 3,669 < to 8,078
Addressing the smORF Question:
Protein Size Distributions in Human SPTr
3500
Pre Oct-01
6.3% > 100aa
3000
2500
2000
1500
1000
500
0
1-100
100200
200300
300400
400500
500600
600700
700800
800900
9001000
<1000
1-100
100200
200300
300400
400500
500600
600700
700800
800900
9001000
<1000
300400
400500
500600
3500
Post Oct-01
5.5% > 100aa
3000
2500
2000
1500
1000
500
0
“Novel” in title
3.4% > 100aa
1200
1000
800
600
400
200
0
1-100
100200
200300
600700
700800
800900
9001000
<1000
Summarising the smORF Question
•
•
•
•
•
•
The “triple postulate” i.e. a combination of gene prediction failiure, no
homology and absence of transcription data, seems unlikely
No database evidence for increased bsence smORF discovery mammals
The observation that only ~1% of mouse genes have no detectable
human homology contradicts the idea of large order-specific gene
expansion in mammals
Although small proteins evolve more rapidly there is no precedent for
complete loss of ortholog simillarity signal
Those much shorter than 100 residues will fall below the threshold
necessary to fold into the domain structures necessary for biological
function
No evidence for de-novo gene “invention” in higher eukaryotes
Release History of the International Protein Index:
Only Slow Increases in the Non-redundant Protein Sets
56537 Entries
Experimental Transcript Skimming as Evidence for High
Protein Numbers
•
•
•
•
•
Exon arrays (Dunham et al. 1999)
Gene arrays (Penn et al. 2000)
RT-PCR (Das et al. 2001)
SAGE-tags (Saha et al. 2002, Chen et al. 2002)
Oligo tiling from 21 and 22 (Kapranov et al. 2002, Kampa, et al 2004)
•
Necessary to submit a full length ORF with the features of gene
anatomy to the public databases before the discovery of novel proteins
can be claimed – none of these publications submitted any
There is increasing evidence for significant amounts of non-ORF
transcription in human and mouse
•
Gene Numbers for Individual Completed Chromosomes
•
•
•
•
•
•
•
Averaging the completed chromosomes exceeds Ensembl genes by
~12%
Extrapolates to ~ 25,000 genes without “novel transcripts” or “putatives”
Extrapolates to ~ 28,000 genes without “putatives”
Extrapolates to ~ 31,000 genes with “putatives repeat elements
The chromosome reports were made at different times using different
assemblies and different grades of gene definition and evidence
support (e.g. different results for chromosome 7)
Difficult to explicitly cross-map VEGA vs. Ensembl chromosome gene
numbers
Future status of novel transcripts and putative genes unclear – most
will be non-coding
Disappearing Novelty
Name
Accession
and date
Ens 31
NCBI 31
EST
Earlier
sequence
Testicular
OR-4
associated
AY101377
01-MAR-2003
_
XP_072027
√
_
4.10 gene
AY1377
03-MAR-2003
ENSG000
00172159
XP_171208
√
BC041376
24-DEC-2002
Zygote
arrest-1
AY191416
22-JAN-2003
_
_
√
_
SBF2
AY234241
20-MAR-2003
ENSG000
00133812
XP_049218
√
AB051553
07-FEB-2001
CAGE-1
AF414185
27-FEB-2003
ENSG000
00164304
NP_786887
√
BC026194
09-APR-2002
Diabetes
related
ankyrin
repeat
AF492401
18-JAN-2003
ENSG000
00163126
_
√
AK092564
15-JUL-2002
Ligandgated
channel
subunit
AF512521
12-JAN-2003
_
_
Taxilin
AF516206
17-FEB-2003
ENSG000
00084652
NP_787048
√
L15344
25-MAY-1995
HGAL-IL4
inducible
AF521911
14-JAN-2003
ENSG000
00174500
NP_689998
√
BC030506
20-MAY-2002
Zinc finger
ZZaPK
AY184389
28-JAN-2003
ENSG000
00075407
XP_166119
√
AF063599
02-JAN-2001
Dymeclin
BK000950
26-FEB-2003
ENSG000
00141627
NP_060123
√
AK091256
15-JUL-2002
_
_
• EMBL hum cds March
2003 = 1491
• Plus “novel” = 159
• Plus PubMed 2003
= 120
• Novel in title = 11
• Previous cds = 8
• Novel genes = 2
• Now both in RefSeq and
Ens 18.34
Human Proteome Sampling by MS/MS Identification:
A Paucity of Novel Genes
•
•
•
•
•
•
•
•
•
3778 from plasma (Muthusamy et al 2005)
2486 from liver cells (Yan et al. 2006)
615 from the human heart mitochondria (Taylor et al. 2003)
500 from breast cancer cell membranes (Adams et al. 2003)
491 from microsomal fractions (Han et al. 2001)
311 from the splicesome (Rappsilber et al. 2002)
No verifiable data on gene prediction confirmation
One novel gene reported from a genome-only peptide match by Kuster
et al in 2001 but this appeared from a high-throughput project later in
the same year (Tr Q96DA0)
While there is no evidence of novel protein discovery there is a caveat
on the availalable search space
Conclusions
•
•
•
•
•
•
•
•
•
The model eukaryotes have shown no significant post-genomic rises in
gene number
The Ensembl gene number has been essentially flat since 2001
There is a set of ~2,000 predicted genes still eluding experimental
verification – or may not be real ?
Putative genes from curated chromosmes could raise protein numbers but
the status of this class of transcripts is in doubt
Early over-estimates explicable by non-ORF transcription
Post-genomic transcript coverage is predominantly re-sampling known
genes
Database submissions of novel human genes have slowed to a trickle
No evidence for large numbers of cryptic smORFs
Proteomics has not revealed new proteins
Maybe Swiss-Prot can pop the champane corks when HPI hits 20,000 ?
Updates
• October 2004 Nature paper on finished human genome “2025,000 protein-coding genes”
• December 2005 Nature paper “The dog gene count (19,300) is
substantially lower than the 22,000-gene models in the current
human gene catalogue (EnsEMBL build 26). For many predicted
human genes, we find no convincing evidence of a
corresponding dog gene. Much of the excess in the human gene
count is attributable to spurious gene predictions in the human
genome (M. Clamp, personal communication).”
• March 2006 Ensembl 23,701
• June 2006 Swiss-Prot HPI 14,445
Acknowledgments and Reference
• Paul Kersey of the EBI for IPI figures
• Lucas Wagner of the NCBI for the retrospective UniGene data
• Numerous other people at NCBI, EBI, Swiss-Prot and Sanger
Centre who graciously answered queries on their data collections
• The Oxford Glycosciences Proteome Discovery Team
Southan C. Has the Yo-yo stopped? An assesment of human
protein-coding gene number (2004) Proteomics (6):1712-26.
PMID: 15174140