* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Correlating mRNA and protein Abundance
Endogenous retrovirus wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Signal transduction wikipedia , lookup
Polyadenylation wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Metalloprotein wikipedia , lookup
Biochemical cascade wikipedia , lookup
Point mutation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Paracrine signalling wikipedia , lookup
Secreted frizzled-related protein 1 wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Gene regulatory network wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Magnesium transporter wikipedia , lookup
Protein structure prediction wikipedia , lookup
Gene expression profiling wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Western blot wikipedia , lookup
Protein purification wikipedia , lookup
Proteolysis wikipedia , lookup
Interactome wikipedia , lookup
Expression vector wikipedia , lookup
Gene expression wikipedia , lookup
Messenger RNA wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Correlating mRNA and protein abundance via genomic and proteomic characteristics Dov Greenbaum Gerstein Lab Thesis Seminar April 21, 2004 outline Why analyze mRNA and protein correlations Background Disparate Data Sources Correlating mRNA and Protein Results Other analyses Formalism – comparing genome, transcriptome and proteome in terms of broad categories New Data Sets Analysis via Broad Categories Analysis of factors affecting correlations Another reason to expect correlations Expression and Protein Interactions Why Correlate mRNA & Protein? Experiments 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 mRNA Protein Both mRNA and Protein Levels are necessary for complete analysis Shown mathematically in Hatzimanikatis et al Biotechnology 1999 Combinations of RNA and protein detection approaches have recently aided in the identification of biomarkers in cancer Hegde et al Current Opinion in Biotech 2003 Relationship between mRNA and Protein levels dPi = k mRNA - k P s;i * i d;i i dt where ks,i and kd,i are the protein synthesis and degradation rate constants, respectively, ks;i * mRNAi At steady state: Pi = kdi Methods for determining mRNA expression Each have Strengths and Weaknesses Methods for determining protein abundance 2DE Gel Electrophoresis – • • • (Klose, 1975; O’Farrell, 1975) Multiple staining options Small dynamic range limited in what it can detect Methods for determining protein abundance ICAT – ICAT reagent-- relative levels – VB dynamic range – Cannot detect posttranslational modifications – it require proteins to contain cysteine residues, & these residues must be in the region of a peptide that is produced during proteolytic cleavage MudPit Really only HT that can detect PT modifications Other Methods for determining protein abundance DIGE – e.g. Cy3 vs cy5 labeling – Very big dynamic range 2D-electrophoresis Tap Tagging Weissman & O’Shea (Oct 2003) Other Methods for determining protein abundance 4000 3000 2000 1000 80000 60000 40000 20000 Max Max Prot Affy Tap MP ICAT DIG 2DE 0 TAP MP ICAT DIG 2DE 0 Same mRNA levels yet protein data varied > 20X N ~100, r = 0.9 Protein Quantification via measurement of radioactivity Gygi et al Molecular and Cellular Biology,1999. Same mRNA levels yet protein data varied > 20X Do some ORFs bias the results? 73 proteins (69%) R = 0.356 mRNA vs Protein r = 0.74 Protein Quantification via image analysis Futcher et al Molecular and Cellular Biology, 1999 Jury is out… Gygi et al: “This study revealed that transcript levels provide little predictive value with respect to the extent of protein expression.” Futcher et al: “there is a good correlation between protein abundance and mRNA abundance for the proteins that we have studied”. mRNA vs Protein Greenbaum et al Bioinformatics 2001 r =0.67 3 Genes in Lung Adenocarcinomas Op18, Annexin IV, and GAPD r = 0.025 Chen et al Molecular & Cellular Proteomics, 2002. murine hematopoietic precursor MPRO change in expression 0 - 72 hr murine hematopoietic precursor MPRO change in expression 0 - 72 hr R = 0.58 ~ 80% of the genes are located in the first and third quadrants Ratios of wt+gal to wt gal ICAT vs microarray N ~ 290, r = 0.6 Ideker et al Science, 2001 Yeast growth under two different media r = 0.45 but almost 1.0 for same loci in same pathway Washburn et al PNAS 2003 Integrating multiple sources of Information The challenge for computational biology is to provide methodologies for transforming high-throughput heterogeneous data sets into biological insights about the underlying mechanisms. Although highthroughput assays provide a global picture, the details are often noisy, hence conclusions should be supported by several types of observations. Integration of data from assays that examine cellular systems from different viewpoints (for instance, gene expression and protein-protein interactions) can lead to a more coherent reconstruction and reduce the effects of noise. Nir Friedman Science 2004 Sources of Data Data set mRNA expression Protein abundance Description Size [ORFs] Reference Young Gene chip profiles yeast cells with mutations that affect transcription 5455 Holstege et al. (1998) Church Gene chip profiles of yeast cells under four different conditions 6263 Roth et al. (1998) Samson Comparing gene chip profiles for yeast cells subjected to alkylating agent 6090 Jelinsky et al. (1998) SAGE Yeast cells during vegetative growth 3778 Velculescu et al. (1997) Reference expression Scaling and integrating the mRNA expression set into one data source 6249 - 2-DE #1 Measurement of yeast protein abundance by two-dimensional (2D) gel electrophoresis and mass spectrometry 156 2-DE #2 Similar to 2-DE set #1 Transposon Large-scale fusions of yeast genes with lacZ by transposon insertion Reference abundance Scaling and integrating the 2-DE data sets into one data source Annotated Localization Subcellular localizations of yeast proteins 2133 (6280) Drawid et al. (2000) Transmem-brane segments Predicted transmembrane and soluble proteins in yeast 2710 (6280) Gerstein (1998) MIPS functions Functional categories for yeast ORFs 3519 (6194) Mewes et al. (2000) GOR secondary structure Predicted secondary structure for yeast ORFs 71 1410 181 Gygi et al. (1999) Futcher et al. (1999) Ross-Macdonald et al. (1999) - Annotation 6280 Gerstein (1998) Reference mRNA Sets Young Church Samson SAGE Fitting Protein Data Original Set mRNA vs Protein Greenbaum et al Bioinformatics 2001 r =0.67 Outliers below trendline above trendline (2STDEV from the mean) ORF YBR118W YER065C YMR303C YOL086C YJR009C YGR192C YJR104C YML054C YJL052W YKR059W YML008C YFL022C YJL008C YPL160W YOR361C YCL030C YNL209W FUNCTION translation elongation factor eEF1 alpha-A chain Isocitrate Lyase Alcohol dehydrogenase II Alcohol dehydrogenase I Glyceraldehyde-3-phosphate dehydrogenase 2 Glyceraldehyde-3-phosphate dehydrogenase 3 Copper-zinc superoxide dismutase lactate dehydrogenase cytochrome b2 glyceraldehyde-3-phosphate dehydrogenase 1 Translation initiation factor S-adenosyl-methionine delta-24-sterol-c-methyltransferase Phenylalanine-- tRNA Ligase beta chain Component of chaperonin-containing T-complex leucine--tRNA ligase translation initiation factor eIF3 subunit phosphoribosyl-AMP cyclohydrolase heat shock protein of HSP70 family High Protein Metabolism (1) Energy(2) MIPS 5,30 1,2, 30 1, 2, 30 1, 2, 30 1, 2, 30 1, 2, 30 11,30 1,2,30 1,2,30 5,30 1,30 5,30 6,30 5,30 3,5,30 1 5,30 Low Protein Prot. Syn. (5) Prot. Fate (6) Later larger datasets concurred with these results in that Generally… 10000000 1000000 protein 100000 10000 1000 100 10 1 0.1 1 10 100 mRNA AA metabolism & Energy are 2X as likely to have high protein vs mRNA than the general population 1000 Protein synthesis (~35% of all protein synthesis genes) and Protein fate (folding, modification, destination) are more likely to have low protein vs mRNA than the general population Non-Outliers Generally… Tight Regulation by the cell Only 3% of transcription associated genes (n = 441) have significantly uncorrelated mRNA and protein levels (2STDEV from trendline) Transcription Assoc. genes are 25% of the essential genes in yeast. Essential Genes as a group have higher correlations than the general yeast population 7% of Cell Cycle associated genes (n = 432) have significant non-correlation Quick Summary • Why correlate mRNA and protein levels? • Merged Disparate Data Sets – Distinct but complimentary • Global Correlations • Outliers are interesting: – Metabolism & Energy Relatively high protein levels – Protein Synthesis & Protein Fate low protein levels Data Set Size ~170 ORFs 2 DE-gel datasets ~6,000 ORFs 5 Affymetrix GeneChips + SAGE data ~6,000 ORFs Enrichments (Feature, [v,S], [w,G]) = (F,[v,S]) -(F,[w,G]) (F,[w,G]) V & W are weights (expression level) of Sets S & G Visual Formalism ~170 ORFs ~6,000 ORFs Depletion of Random Coil Secondary Structure STABILITY Concurrence with data from Perczel et al Chemistry 2003 Regarding stability of specific secondary structures Enrichment of Amino Acids STABILITY Alanine’s, Glycines, Valines result in more compact structures More compact = more stable (i.e. thermophilic enzymes tend to be very compact) Enrichment of Amino Acids Simple story: translatome is enriched in same way as transcriptome Enrichment of Molecular Weights/Biomass Abundant proteins are smaller = reduces cost Effect of transcription yeast cell favors the expression of shorter ORFs over longer ones (as opposed to long lightweight ORFs – see MW of aa) This selection is happening, for the most part at the transcriptome level -------------------------------------------------------------------------------------------------Neg Correlation between ORF length and mRNA expression Jansen & Gerstein 2000 (And to a lesser degree with Protein Abundance) Enrichment of Molecular Weights/Biomass Abundant proteins are smaller = reduces cost Effect of transcription CONCURS with experimental results from Akashi, Genetics 2003 See also: Akashi,Genetics 1996 & Moriyama and Powell, NAR 1998 hypothesize that this trend exists in S. cerevisiae, D. melanogaster and E. coli. (although probably not in C. elegans) Enrichment of Functional Categories 10000000 1000000 protein 100000 10000 1000 100 10 1 0.1 1 10 mRNA 100 1000 Depletion Functional Categories Transcription & Cell Growth Molecular switches Require only minimal expression Enrichment of localization - BIAS? (Drawid & Gerstein. 2000), Review Formalism Different gene sets b/c of limited data Enrichments concur with experimental results Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set Aebersold Aebersold Futcher Reference Yates Gygi mRNA 125 Futcher Reference Yates Gygi mRNA 29 113 102 116 125 73 61 56 64 69 150 143 128 150 1436 785 1346 1504 1480 6250 Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set Aebersold Aebersold Futcher Reference Yates Gygi mRNA 125 Futcher Reference Yates Gygi mRNA 29 113 102 116 125 73 61 56 64 69 150 143 128 150 1436 785 1346 1504 1480 6250 Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set Aebersold Aebersold Futcher Reference Yates Gygi mRNA 125 Futcher Reference Yates Gygi mRNA 29 113 102 116 125 73 61 56 64 69 150 143 128 150 1436 785 1346 1504 1480 6250 Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set Aebersold Aebersold Futcher Reference Yates Gygi mRNA 125 Futcher Reference Yates Gygi mRNA 29 113 102 116 125 73 61 56 64 69 150 143 128 150 1436 785 1346 1504 1480 6250 Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set Aebersold Aebersold Futcher Reference Yates Gygi mRNA 125 Futcher Reference Yates Gygi mRNA 29 113 102 116 125 73 61 56 64 69 150 143 128 150 1436 785 1346 1504 1480 6250 Global Correlation mRNA Set 6249 ORFs Protein Set # 2 2 2DE sets & 2 Mudpit ~2000 ORFs Protein Abundance 1000 100 10 MudPit (1) MudPit (2) 2DE (1) 2DE (2) R = 0.66 1 0.1 0.1 1 10 mRNA Expression 100 1000 Functional Categories Cell Cycle (R=0.71) 1000 Reference Data (R=0.66) Protein Abundance Cell Rescue (R=0.45) 100 Co-regulated proteins 10 1 0.1 0.1 1 10 mRNA Expression 100 Subcellular Localization Mudpit does not have the 2DE biases Subcellular Localization Protein Abundance 100 Lack of correlation in mitochondria Concurs with experimental results from Ohlmeier S et al. JBC 2004 10 Nucleolus (R=0.8) 1 Cell Periphery (R=0.74) Reference Data (R=0.66) Mitochondria (R=0.42) 0.1 0.1 1 10 mRNA Expression 100 Expression as a function of localization is well correlated with protein levels (latest data) Membrane r = 0.73 Bud r =0.76 r global = 0.46 Nucleus r = 0.49 P ER r = 0.61 M Cytoplasm r = 0.50 Cell Wall r =0.52 Mitochondria r = 0.50 Extracellular r = 0.33 Golgi r = 0.28 Endosome r = 0.87 Why would we not find strong correlations? Post translational modifications Protein degradation Error and Bias Ribosomal Occupancy Arava et al. (2003) Proc. Natl. Acad. Sci. USA 1 0.9 0.8 Occupancy CAI Coefficient of Variation Top Top Correlation 0.7 0.6 Top 0.5 0.4 0.3 Bottom Our results concurred with Bottom experimental findings by Brown and Herschlag’s groups: 0.2 0.1 Bottom 0 Ribosomal Occupancy Top Frac. Bot. Frac. 0.78 0.30 Moreover: mRNAs not associated with any polysomes have even less of a correlation r = 0.2 v. strong translational control 1 0.9 Coefficient of Variation Top 0.8 Correlation 0.7 0.6 0.5 0.4 0.3 0.2 Bottom 0.1 0 mRNA Expression Variability Top Frac. Bot. Frac. 0.89 0.20 mRNA expression Variability of mRNA expression time 1 0.9 Coefficient of Variation Top 0.8 Correlation 0.7 0.6 0.5 0.4 0.3 0.2 Bottom 0.1 0 mRNA Expression Variability Top Frac. Bot. Frac. 0.89 0.20 mRNA expression Variability of mRNA expression time Codon Adaptation Index 0.6 CAI 0.5 Top Correlation 0.4 0.3 0.2 0.1 Bottom 0 Codon Usage Top Frac. Bot. Frac. 0.48 0.02 Concurs with experimental data: CAI does not Predict mRNA and protein the same way shown to be the result of different levels of degredation Another summary Newer, larger data set Looking at Broad Catagories I Post translational modifications? where we expect PT control --> low r. Where we don’t expect --> high r Occupancy Variability II Protein Degradation? CAI III Experimental Error? next section Expression and interactions Types of protein-protein interactions – Protein complexes • For example: proteasome, ribosome – Aggregated interactions • Yeast two-hybrid (Y2H) • Genetic/physical interactions from MIPS Relationship of P-P-interactions to abs. expression level Dij Ei E j Ei E similar protein results Protein-Protein Interactions & Expression Correlations Cell Cycle CDC28 expt. (Davis) Sets of interactions (all pairs, control) Pairwise interactions (from MIPS) (Uetz et al.) between selected expression timecourses (strong interactions in permanent complexes, clearly diff.) Protein-Protein Interactions & Expression Correlations Cell Cycle CDC28 expt. (Davis) Sets of interactions (all pairs, control) Pairwise interactions (from MIPS) (Uetz et al.) between selected expression timecourses (strong interactions in permanent complexes, clearly diff.) Permanent vs. Transient Complexes 1 0.8 Rosetta 0.6 transient Permanent . L Ribosome 0.4 S Ribosome 0.2 SAGA 0 -0.2 -0.2 0 0.2 0.4 0.6 CC 0.8 1 1.2 correlation ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 Representing Expression Correlations within a Large Complex in a Matrix MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 Permanent? Transient? correlation L7/L12 Cell degrades all excess riboosmal proteins, except L7 & L12 correlation Expression Correlations Segment Large Replication Complex into Component Parts MCM3 MCM6 CDC47 MCM2 CDC46 CDC54 Temporally transient MCMs prots. Polym. d&e ORC DPB3 CDC45 DPB2 CDC2 CDC7 POL2 HYS2 POL32 DBF4 ORC2 ORC6 ORC5 ORC4 ORC3 ORC1 Proteasome No distinction visible between components Proteasome Overall .43 20S .50 19S .51 indicative of the possibility that the two components are really one? Division is an artifact of their discovery—M Hochstrasser %ORFs in complexes with significant correlation Complex (> 2 ORFS, P < 0.001) n alpha Cdc15 Cdc28 Rosetta Alpha, al-treh. anchor (50) 4 Cacinerum B (100) 3 67% 67% Chaperone containing T-complex TRiC (130) 8 50% 25% Pho85p (133.20) 6 Glycine decarboxylase (200) 3 67% ATPase (210) 4 100% TRAPP (260.60) 10 Vps4p ATPase (260.70) 3 Nucleosome protein (320). 8 Cytochrome bc1 complex (420.30) 9 Cytochrome c oxidase (420.40) 8 F0/F1 ATP synthase (complex V)(420.5) 75% 33% 50% 40% 67% 100% 50% 87% 37% 75% 44% 78% 78% 38% 88% 50% 15 60% Ribonucleoside reductase (430) 4 Nuclear processing (440.10.10) 5 RNA polymerase I (510.10) 8 38% RNA polymerase II (510.40.10) 9 44% Tornow & Mewes NAR 2003 75% 50% 40% 38% 50% Average Expression of all subnunits in a complex mRNA expression (x103 ) 1 10 100 10000000 1.0635 y = 3028.4x 2 R = 0.6076 1000000 100000 10000 1000 100 10 1 protein abundance 0.1 PP INT Summary Complexes broad catagories minimize noise – Permanent complexes show strong co-expression Posttranscriptional regulation functions at a whole complex level (Washburn et al PNAS 2003) – Transient complexes have weaker co-expression Aggregated BINARY interactions (Y2H, physical, genetic) Weak co-expression similar to transient complexes --noisy data? ERROR ? minimized in larger groups Global Summary mRNA expression is related to protein abundance Broad categories minimize noise that prevents us from seeing this correlation Integrating various genomic data is integral to an analysis Biologically relevant results can be seen when looking at mRNA and protein populations Future Research Further indepth analysis into protein degredation Integrate new Tap Tagging data into protein abundance ref set More intensive modeling of the relationship between mRNA and protein Relationship between mRNA and Protein levels dPi = k mRNA - k P s;i * i d;i i dt where ks,i and kd,i are the protein synthesis and degradation rate constants, respectively, and is the growth rate ks;i * mRNAi At steady state: Pi = kdi N end rule PEST? N End Rule in Yeast Met Pro Val Gly Thr Ser Ala Cys AA Glu Ile Tyr Gln Asp His Asn Fast Decay Slow Decay Trp Leu Phe Lys Arg 1 10 100 In Vivo Hallf Life (Min) 1000 10000 Results of protein degredation Significantly higher correlation for fast decaying proteins Not for slow decay high decay rate is indicative of greater cellular control over level e.g. proteins with half lives of days – cell can’t tightly control Results are same for mRNA degredation -half lives have been quantified Acknowledgments Gerstein Lab Weissman Lab Zheng Lian This work Ronald Jansen (MSKCC) Yuval Kluger (NYU) Keck (HHMI Biopolymer Laboratory and W. M. Keck Other Projects Haiyuan Yu Hedi Hegyi Jimmy Lin Rajdeep Das Jiang Qian Nick Luscombe Entire Gerstein Lab Foundation Biotechnology Resource Laboratory) Christopher Colangelo Ken Williams Thesis Committee Mark Gerstein Sherman Weissman Kevin White Genetics Department SABRINA Liana