Download the Highest Connected Isoforms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomics wikipedia , lookup

Gene desert wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Oncogenomics wikipedia , lookup

Point mutation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression programming wikipedia , lookup

Protein moonlighting wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Non-coding DNA wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Human genome wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Pathogenomics wikipedia , lookup

Essential gene wikipedia , lookup

History of genetic engineering wikipedia , lookup

NEDD9 wikipedia , lookup

Designer baby wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genomic imprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome (book) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Minimal genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Considerations for multi-omics
data integration
Michael Tress
CNIO,
GENCODE genome annotation
Predictions:
• Ensembl/GENCODE automatic pipelines,
• HBM and GENCODE RNA-seq data,
• individual large-scale studies.
• Coding potential is determined from similarity to known proteins,
conservation, the presence of Pfam functional domains.
• Some transcripts that are annotated as coding or non-coding based
on the balance of probabilities. Good proteomics evidence could
help here.
• A few years ago the human reference genome was missing a number
of coding genes, in part due to gaps in the reference build used for
Ensembl and RefSeq. Now the coding genes are probably almost
complete.
We collected peptides from a number of large
scale proteomics resources
NIST
Kim
Muñoz
Wilhelm
PeptideAtlas
Geiger
Nagaraj
Ezkurdia
We wanted to make sure
3
that we had reliably
identified peptides
The older the ancestral gene, the higher the chance
of detecting peptides.
Gene family ages based
on ENSEMBL Compara
Genes that
appeared since
primates are
practically not
detected!
4
Ezkurdia, Juan et al, Hum Mol Gen, 2014
Genes with no protein features at all (structure,
function, etc.) were not detected
Y-axis
% of genes
in each bin
detected in
proteomics
experiments
We found evidence for just 282 splice events many were of ancient origin
Abascal et al, PLoS Comp Biol, 2015
Paralogues
ACSL1, ACSL6
ACTN1, ACTN2,
ACTN4
ATP2B1, ATP2B2,
ATP2B3, ATP2B4
DNM1, DNM2
GNAL, GNAS
ITGA3, ITGA6
PDLIM3, LDB3
TPM1, TPM2,
TPM3, TPM4
Ancestor
Jawed vertebrates
One AS in fruitfly,
one in vertebrates.
Bilateria
Vertebrates
Jawed vertebrates
Vertebrates
Chordates
Vertebrates
• All 60 homologous exons were conserved in jawed vertebrates, e.g. fugu and
zebrafish, which implies that they evolved at least 460 million years ago.
• As a comparison mouse and human conserve fewer than 20% of AS exons.
Most detected alternative isoforms would not
break Pfam domains
ISE = isoforms detected with peptide evidence – GENCODE20 is background of whole genome,
AI genes are all isoforms annotated for the 246 genes with detected alternative isoforms.
Multi-omics considerations
• What does that mean for proteogenomics analyses?
• Most (but not all!) detected novel coding genes/isoforms are likely to
have little evolutionary history and few protein features.
• We find that standard proteomics experiments are less likely to detect
peptides for these regions.
• If many novel regions are identified in the study quality control is
needed because many will have been identified by less reliable
peptides (semi-tryptic peptides, low scoring PSM, poor spectra).
XXX ORFs – no protein features
A recent paper that identified many
peptides for these new ORFs.
These candidates are short and have no
protein features.
Results: More than 200 previously
uncharacterized coding regions
Problem: Peptides were cleaved by
trypsin in the experiment, yet more
than 80% of the peptides are semitryptic or non-tryptic.
Caveat: that is not to say that these
novel regions do not code for proteins,
just that they are not found in standard
proteomics experiments.
Proteogenomics strategy
Nesvizhskii AI. Proteogenomics: concepts, applications and computational
strategies. Nat Methods. 2014
• Novel peptides identified using proteogenomics should be held to a higher
standard of evidence than known peptides (spectra!).
• it is important to use a a multi-stage data analysis strategy
If you search with a combined database and few modifications you will find
that many pseudogenes express peptides.
Initial searches should be first be carried out against known coding genes (with
a range of possible modifications) and possibly known SAV.
Pseudogene detection - PeptideAtlas
Spectrum matched
(incorrectly) to
peptide EITALAPSIMK
from putative POTEPK
gene. The match is
nearly perfect.
The same spectrum
matched (probably
correctly) to actin
peptide EITALAPSTMK
with a lysine
dimethylation. This
peptide is identified
63,000 times in
PeptideAtlas.
Dominant isoforms
We found evidence of AS in just over 1% of
human genes, so 98% of protein coding genes
have evidence for just a single isoform
Can we predict this isoform?
Five methods for selecting a reference isoform
LONGEST
RNASEQ
APPRIS
Standard reference
isoform in all
databases/large
scale experiments
5-fold dominant
transcripts from HBM
data Gonzalez-Porta et
al, Gen. Biol. 2013
Principal isoforms based
on structure, function
and conservation
(Rodriguez et al, NAR,
2012)
HCI
Highest connected
isoforms trained on RNAseq data in Li et al, JPR,
2015
CCDS
Unique CCDS. CCDS variants
are consensus between
RefSeq, and
Ensembl/GENCODE
Five means of selecting reference isoforms
We calculated % agreement between the main proteomics isoform we found
and the five reference methods: the longest sequence, APPRIS principal
isoforms, unique CCDS variants, the dominant RNAseq transcripts and the
Highest Connected Isoforms
77.7%
98.6%
97.8%
77.2%
78%
For those 3,000+ genes with a main experimental isoform, an
APPRIS principal isoform and a unique CCDS variant, all three
isoforms agreed over 99% of the genes.
The clear agreement between three orthogonal sources (and
the large number of tissues sampled) suggests that the main
proteomics isoform is the dominant protein isoform in the cell.
Ezkurdia et al, J. Proteome Res, 2015
Indeed alternative isoforms (non-APPRIS principal isoforms)
“are significantly enriched in amino acid-changing variants,
particularly those that have a strong impact on protein
function“
Liu et al, Molecular BioSystems, 2015