Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
基因表达和蛋白丰度之间的比较和分析 Comparative analysis of different label-free mass spectrometry based protein abundance estimates and their correlation with RNA-Seq gene expression data 宁康 [email protected] 计算生物学研究小组 中国科学院青岛生物能源与过程研究所(QIBEBT-CAS) http://www.qibebt.ac.cn/ http://www.bioenergychina.org/ http://www.computationalbioenergy.org/ 11/15/2012 Outline Background General analysis scheme Transcriptome analysis Proteome analysis Associated analysis Explanation of the correlations Technical issues Biological issues The best techniques… The important biological questions Everything goes high-throughput… The underline process in transcription and translation?Transcriptome Proteome Gene expression Protein abundance But not very high correlation… The techniques for these questions On the proteomic side LC–MS/MS or shotgun proteomics is the method of choice for large-scale protein identification Labeled Label-free methods and labeled methods Label-free The techniques for these questions On the proteomic side Label-free: MS-1 based “peak intensity” or MS-2 based “spectrum counting” Spectrum counting Peak intensity J. Proteome Res, 2012, 11(4), 2261-2271 The techniques for these questions On the transcriptomic side Next-generation sequencing has recently emerged as a promising alternative to established microarray based methods Microarray RNA-Seq The main objectives Comparative analysis of different label-free protein quantification methods using several software tools on the proteomic side Correlation analysis of gene expression data derived using microarray and RNA-Seq methods on the genomic side Better understanding of correlation between gene and protein expression Outline Background General analysis scheme Transcriptome analysis Proteome analysis Associated analysis Explanation of the correlations Technical issues Biological issues The best techniques… The overall scheme J. Proteome Res, 2012, 11(4), 2261-2271 The datasets comprehensively analyzed mouse mitochondrial genes and proteins in various mouse tissues MitoCarta database http://www.broadinstitute.org/pub s/MitoCarta/ GNF1M tissue atlas RNA-Seq profiling (http://woldlab.ca ltech.edu/rnaseq/) The analysis procedure mzXML X!Tandem PiptideProphet ProteinProphet Spectra count (NSAF) msInspect msBID SpectrumMill RPKM values Outline Background General analysis scheme Transcriptome analysis Proteome analysis Associated analysis Explanation of the correlations Technical issues Biological issues The best techniques… The number of comparable genes Mitochondrial (all) genes that could be compared at proteomic level method brainstem liver msInspect 611 (1457) 596 (1102) msBID 566 (1197) 586 (996) Spectral count 650 (1693) 641 (1267) SpectrumMill 679 700 J. Proteome Res, 2012, 11(4), 2261-2271 Different techniques for expression measurements At gene expression level At protein abundance level Correlation between Gene Expression and Protein Abundances Spectrum Mill SpectrumMill msInspect 0.91 (0.92) msBID NSAF RPKM Microarray 0.91 (0.91) 0.90 (0.90) 0.49 (0.51) 0.36 (0.40) 0.89 (0.91) 0.87 (0.88) 0.51 (0.53) 0.40 (0.44) 0.84 (0.89) 0.54 (0.54) 0.41 (0.42) 0.51 (0.53) 0.42 (0.44) msInspect 0.91 (0.92) msBID 0.91 (0.91) 0.89 (0.91) NSAF 0.90 (0.90) 0.87 (0.88) 0.84 (0.89) RPKM 0.49 (0.51) 0.51 (0.53) 0.54 (0.54) 0.51 (0.53) Microarray 0.36 (0.40) 0.40 (0.44) 0.41 (0.42) 0.42 (0.44) 0.62 (0.61) 0.62 (0.61) Correlation between Gene Expression and Protein Abundances MS-1 based “peak intensity” MS-2 based “spectrum count” Changes of expressions in different tissues mRNA vs. protein Direction of changes The majority of genes exhibited same direction of change based on gene expression by mRNA-Seq and protein abundance by msInspect for brainstem against liver Gene expression Brainstem vs. Liver Protein abundance Technical Factors Affecting the Correlation The lengths of genes The gene length affect both gene expression and protein abundance values Technical Factors Affecting the Correlation The low-abundance genes The inclusion of lower intensity genes and proteins does not significantly affect the overall correlation. Technical Factors Affecting the Correlation Does number matter? The standard deviation of correlation coefficients gradually increased: a noticeable shift in the correlation coefficients toward lower values… Increasing R set size RPKM-NSAF RPKM-msBID NSAF-msBID all (527) 0.54 0.54 0.89 200 0.53 (0.03) 0.54 (0.03) 0.89 (0.01) 100 0.55 (0.06) 0.56 (0.06) 0.88 (0.02) 50 0.57 (0.10) 0.58 (0.11) 0.89 (0.03) 20 0.50 (0.12) 0.50 (0.11) 0.88 (0.05) Technical Factors Affecting the Correlation “coding region dominant” genes Does gene structure matter? Restricting the analysis to these genes only (termed “coding region dominant” genes) improved the correlation slightly Biological Factors Affecting the Correlation The effect of functional annotations Correlation between gene and protein abundances for selected GO categories term count RPKM-NSAF RPKM-msBID mean RPKM mean NSAF CC:organelle inner membrane CC:mitochondrial lumen 189 0.67 (183) 0.64 (165) 0.44 0.52 mean msBID 0.49 120 0.51 (113) 0.52 (102) –0.21 0.01 –0.05 BP:generation of precursor metabolites and energy BP:cofactor metabolic process CC:ribosome CC:mitochondrial outer membrane CC:respiratory chain 94 0.60 (90) 0.62 (84) 0.92 1.06 1.02 52 0.67 (48) 0.70 (43) 0.07 0.12 0.25 68 33 0.00 (65) 0.32 (27) 0.22 (51) 0.35 (25) –0.07 0.16 –0.4 0.15 –0.61 0.17 45 0.13 (42) 0.27 (42) 1.22 1.21 1.14 MF:hydrogen ion transmembrane transporter activity BP:organic acid catabolic process MF:iron ion binding 29 0.57 (26) 0.61 (21) 1.41 1.36 0.96 22 0.48 (18) 0.64 (15) –0.41 –0.2 0.1 34 0.58 (26) 0.45 (23) 0.53 0.39 0.51 MF:nucleotide binding 136 0.57 (125) 0.57 (108) –0.1 –0.08 0 BP:nitrogen compound biosynthetic process 38 0.74 (35) 0.73 (23) 0.43 0.33 0.48 Biological Factors Affecting the Correlation The sub-location annotation issue Correlation based on these inner membrane genes is better than based on all mouse mitochondrial genes • Among the top 5 most read articles in the journal in April 2012 (publication month) Biological Factors Affecting the Correlation The RNA/protein stability issue mRNA and protein half-lives in the mouse Protein and mRNA stability are among the most significant factors governing the correlation between gene and protein abundances Quantitative model of gene expression in growing cells Chen, et al., Nature, 2011 Next step: from analysis to prediction mRNA expression Translation rate mathematics model predict Protein expression Degradation rate Issues: 1. The translation and protein degradation rates are difficult to detect 2. The model is on the basis of stead-state in cell. 3. …… Divide-and-conquer based on biclustering? Quantification and explanations Bi-clustering of gene expression / protein abundance Bi-clustering of expressions… Cluster 1 mRNA Protein halfhalf-life life short unstable mRNAs and unstable proteins short 2 short long unstable mRNAs and stable proteins 3 long long stable mRNAs and stable proteins 4 long short stable mRNAs and unstable proteins Factors for protein degradation • Enzyme activities: Enzyme(energy metabolism) high weight Ribosomal protein Dehydrogenase medial Regulatory factor Carrier low weight • Other factors: • Amino acids W,C, T,F,Y,V are enriched in labile proteins, but E,D,K,N,R,Q are enriched in stable protein. • Short half-life proteins are enriched for membrane proteins and signal transduction proteins, whereas long–half-life proteins are enriched for cytoskeleton proteins and nuclear proteins with housekeeping functions Preliminary results Hierarchical Cluster Mouse liver tissue Preliminary results Bi-clustering y = 0.7847x + 5.8634 R² = 0.7497 y = 0.8396x + 5.9941 R² = 0.8004 Preliminary results Clusters of interests Preliminary results Bi-clustering result analysis Preliminary results 1. Stable mRNA and protein (1)Enzymes(citric acid cycle, energy metabolism) (2)Reductases We reason that many housekeeping genes tend to have stable mRNAs and proteins. 2. Stable mRNA and labile protein (1)Regulated genes expression products (2)Dehydrogenases (3)Oxidases Preliminary results Mathematics modeling Use SVM (support vector machine) to combine multiple features? Cluster1 Protein1 The effect of single factor --- enzyme activity Cluster2 Cluster3 Cluster4 ? ? Protein2 Plus: 3D structure, enzyme activity, etc. Protein3 SVM modeling Protein4 Summary Spectral counts good as a basis for a more comprehensive strategy of evaluating protein abundance trends Using the top 3 normalized peptide area intensities from MS1 for protein abundance correlated best with gene expression data collected through RNA-Seq Both technical and biological factors affect the correlations of gene expression and protein abundance Divide-and-conquer method for designing robust computational model for extracting gene and label-free protein abundance information http://www.computationalbioenergy.org/ Genotype Phenotype Enterotype Big-Data (genomics, proteomics, Raman profiling, etc.) Community Pure Strain Single-cell (Metagenomic method) (Genomic method) (Single-cell method) Bioenergy Agriculture Medicine Cell biology Healthcare Environmental monitoring Synthetic biology Ecology Food screening Microbial community Bionics Fermentation Bioresources Molecular biology Biomaterial Biodefense …… Metagenomic technology Single-cell technology Single-cell data analysis platform Single-cell data analysis platform Single-cell manipulation / sorting Automatic phenotyping Acknowledgements Members: Stuff:Q Zhou, XQ Su, LH Ren, JY Wang, AH Wang, XZ Chang, YH Qiao Student:RR Huang, XJ Wang, BX Song, W Fang, JQ Hu, M Gabriel (visiting), XW Cheng, J Wang Collaborators: JIANG Tao,(UC riverside, USA; ACM Fellow)(on metagenomics) WONG Limsoon (NUS, Singapore) (on network) CUI Xingping (UC riverside,USA) (on SNP detection and metagenomics) Yiu SM, Li SC (Hong Kong) (on network) Jan Baumbach (MPI, Germany) (on network) WEI Chaochun (SJTU, China) (on metagenomics) Alexey Nesvizhskii (U of Michigan, USA) (on proteomics) Ansgar Poetsch (RUB, Germany) (on proteomics) Thank you! http://ComputationalBioenergy.org Research areas Released software Example software Hardware platform Thank you! Qingdao / Tsingdao