* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download the Highest Connected Isoforms
Gene desert wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Oncogenomics wikipedia , lookup
Point mutation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression programming wikipedia , lookup
Protein moonlighting wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Non-coding DNA wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Microevolution wikipedia , lookup
Human genome wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Pathogenomics wikipedia , lookup
Essential gene wikipedia , lookup
History of genetic engineering wikipedia , lookup
Designer baby wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Genomic imprinting wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome (book) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Considerations for multi-omics data integration Michael Tress CNIO, GENCODE genome annotation Predictions: • Ensembl/GENCODE automatic pipelines, • HBM and GENCODE RNA-seq data, • individual large-scale studies. • Coding potential is determined from similarity to known proteins, conservation, the presence of Pfam functional domains. • Some transcripts that are annotated as coding or non-coding based on the balance of probabilities. Good proteomics evidence could help here. • A few years ago the human reference genome was missing a number of coding genes, in part due to gaps in the reference build used for Ensembl and RefSeq. Now the coding genes are probably almost complete. We collected peptides from a number of large scale proteomics resources NIST Kim Muñoz Wilhelm PeptideAtlas Geiger Nagaraj Ezkurdia We wanted to make sure 3 that we had reliably identified peptides The older the ancestral gene, the higher the chance of detecting peptides. Gene family ages based on ENSEMBL Compara Genes that appeared since primates are practically not detected! 4 Ezkurdia, Juan et al, Hum Mol Gen, 2014 Genes with no protein features at all (structure, function, etc.) were not detected Y-axis % of genes in each bin detected in proteomics experiments We found evidence for just 282 splice events many were of ancient origin Abascal et al, PLoS Comp Biol, 2015 Paralogues ACSL1, ACSL6 ACTN1, ACTN2, ACTN4 ATP2B1, ATP2B2, ATP2B3, ATP2B4 DNM1, DNM2 GNAL, GNAS ITGA3, ITGA6 PDLIM3, LDB3 TPM1, TPM2, TPM3, TPM4 Ancestor Jawed vertebrates One AS in fruitfly, one in vertebrates. Bilateria Vertebrates Jawed vertebrates Vertebrates Chordates Vertebrates • All 60 homologous exons were conserved in jawed vertebrates, e.g. fugu and zebrafish, which implies that they evolved at least 460 million years ago. • As a comparison mouse and human conserve fewer than 20% of AS exons. Most detected alternative isoforms would not break Pfam domains ISE = isoforms detected with peptide evidence – GENCODE20 is background of whole genome, AI genes are all isoforms annotated for the 246 genes with detected alternative isoforms. Multi-omics considerations • What does that mean for proteogenomics analyses? • Most (but not all!) detected novel coding genes/isoforms are likely to have little evolutionary history and few protein features. • We find that standard proteomics experiments are less likely to detect peptides for these regions. • If many novel regions are identified in the study quality control is needed because many will have been identified by less reliable peptides (semi-tryptic peptides, low scoring PSM, poor spectra). XXX ORFs – no protein features A recent paper that identified many peptides for these new ORFs. These candidates are short and have no protein features. Results: More than 200 previously uncharacterized coding regions Problem: Peptides were cleaved by trypsin in the experiment, yet more than 80% of the peptides are semitryptic or non-tryptic. Caveat: that is not to say that these novel regions do not code for proteins, just that they are not found in standard proteomics experiments. Proteogenomics strategy Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods. 2014 • Novel peptides identified using proteogenomics should be held to a higher standard of evidence than known peptides (spectra!). • it is important to use a a multi-stage data analysis strategy If you search with a combined database and few modifications you will find that many pseudogenes express peptides. Initial searches should be first be carried out against known coding genes (with a range of possible modifications) and possibly known SAV. Pseudogene detection - PeptideAtlas Spectrum matched (incorrectly) to peptide EITALAPSIMK from putative POTEPK gene. The match is nearly perfect. The same spectrum matched (probably correctly) to actin peptide EITALAPSTMK with a lysine dimethylation. This peptide is identified 63,000 times in PeptideAtlas. Dominant isoforms We found evidence of AS in just over 1% of human genes, so 98% of protein coding genes have evidence for just a single isoform Can we predict this isoform? Five methods for selecting a reference isoform LONGEST RNASEQ APPRIS Standard reference isoform in all databases/large scale experiments 5-fold dominant transcripts from HBM data Gonzalez-Porta et al, Gen. Biol. 2013 Principal isoforms based on structure, function and conservation (Rodriguez et al, NAR, 2012) HCI Highest connected isoforms trained on RNAseq data in Li et al, JPR, 2015 CCDS Unique CCDS. CCDS variants are consensus between RefSeq, and Ensembl/GENCODE Five means of selecting reference isoforms We calculated % agreement between the main proteomics isoform we found and the five reference methods: the longest sequence, APPRIS principal isoforms, unique CCDS variants, the dominant RNAseq transcripts and the Highest Connected Isoforms 77.7% 98.6% 97.8% 77.2% 78% For those 3,000+ genes with a main experimental isoform, an APPRIS principal isoform and a unique CCDS variant, all three isoforms agreed over 99% of the genes. The clear agreement between three orthogonal sources (and the large number of tissues sampled) suggests that the main proteomics isoform is the dominant protein isoform in the cell. Ezkurdia et al, J. Proteome Res, 2015 Indeed alternative isoforms (non-APPRIS principal isoforms) “are significantly enriched in amino acid-changing variants, particularly those that have a strong impact on protein function“ Liu et al, Molecular BioSystems, 2015