Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A simple statistical model for deciphering the cdc15synchronized yeast cell cycle-regulated genes expression data Ker-Chau Li , Robert Yuan Statistics, UCLA Ming Yan Biochemistry , UCLA The goal of this study is to demonstrate how simple statistical models can be employed for helping the organization and explanation of complex gene expression patterns Outlines • • • • • • • Introd : Micro-array and cell-cycle Data : cdc15 experiment A statistical model Phase determination Comparison with Spellman et al(1998) Regularly oscillated genes Further discussion MicroArray • Allows measuring the mRNA level of thousands of genes in one experiment -- system level response • The data generation can be fully automated by robots • Common experimental themes: – Time Course – Mutation/Knockout Response Time Course: Expression level 1 0 Time Change of Condition Or: A B C D E A -- 2.1 0.8 1.3 0.5 B 0.2 -- -0.5 2.3 0.22 … -1.2 -- 0.3 -1.1 ….. Mic roArra y T ec hniq ue: Synthesize Gene Sp ec ific DNA Oligos Tissue or Cell Atta c h oligo to Solid Sup p ort extra c t m RNA Am p lific a tion a nd La b eling Hyb rid ize Sc a n a nd Qua ntita te Yeast Cell Cycle (adapted from Molecular Cell Biology, Darnell et al) Getting a homogeneous population of cells: cell cycle Cells at various stages of cell cycle Synchronization conditions: -Temperature shift to 37 C for CDC15 yeast ts-strain -add pheromone -Elutriation Release back into cell cycle Take sample as cells progress through cycle simultaneously The data set available at http:cellcycle-www.standford.edu We focus on one experiment in which a strain of yeast(cdc15-2) was incubated at a high temperature(35 degrees C) for a long time, causing cdc15 arrest. Cells were then shifted back to a low temperature( 23 degrees C) and the monitoring of gene expression is taken every 10 min for 300 min. Data from some chips are not available We concentrate on those from the 19 Consecutive time points from 70 mins To 250 mins 24 Time points: (mins) 10 30 50 70 80 ..... 240 250 270 290 ----------> 10 mins apart Use of full data will be discussed later. Genes with missing values are also Deleted There are 4530 genes remaining The data can be represented by a 4530 by 19 matrix Example of the time curve: Histone Genes: (HTT2) ORF: YNL031C Time course: YKL164C YNL082W Preliminary study with two-way anova This is to investigate the constancy of average expression Level over the time for each gene and the constancy of The average expression level over all genes at each time Point. > cdc15 Factor gene time residual total df | 4529 | 18 |81522 |86069 | | | | SS 5.2408E+2 2.9745E+2 1.4701E+4 1.5522E+4 | | | MS 1.1572E-1 1.6525E+1 1.8033E-1 | | F 6.4169E-1 9.1638E+1 Gene insignificant Time appears statistically significant; But …………(next slide) Column mean (Time) from Anova result The values are small The expression level is log_2 of ratio of red/green Red = light intensity for red channel - “noise” Green = light intensity of green channel - “noise” Red channel = mRNA from cells at one time point Green channel =mRNA from unsynchronized cells .5 fold increase = log_2 1.5=.585 ; 2^.15 =1.11=.11 fold increase A statistical model • Motivation : modeling each curve with simple functions such as linear, quadratic, sine, cosine appears reasonable but inflexible; • Parsimony and accuracy can be gained if basis curves are chosen by data themselves • The model : each gene expression curve = c0 c1V1 c2 V2 c3 V3 V1 ,1st basis curve V2 , 2nd basis curve V3 ,3rd basis curve The model -continued The errors have mean zero, uncorrelated ,same variance cross the time; But the variance may depend on genes (This is important) It turns out that we can find the basis functions from an application of PCA. (see pdf file for pca) Enhanced PCA for curve fitting Choose the number of basis curves by eigenvalues Assess the goodness of each curve fitting by R-squared and by residual sum of squares Identify genes that comply well to the model Interactive plotting helps resetting userspecified parameters PCA: For a list of vectors, PCA could be used for finding the common basis based on the scaling matrix. Covariance Matrix: (X )'(X ) The directions found will have highest variance along those directions. Find the directions by eigenvalue decomposition: i i Model the curves by the PCA directions: Xi a1i1 a2i2 akik Here, we chose first three PCA directions as our basis. 1st PCA direction 2nd PCA direction 3rd PCA direction Eigenvalues 1. Compliance Check: H0 : three- bases model holds Reject if Ri 2 0.56 & RSSi 7.25 (Corr. Coff between fit and observed < .75 And error s.d. Bigger than .70 , which is equivalent to .5 fold increase.) 2. Cycle Component Check: H 0 : a 2 i a3 i 0 Reject if (a2i 2 a3i2 ) / 2 F2,15(0.95) 3.68 RSSi /15 3. Smoothness Check: Reject if H0 :a1 i 0 a1i t15 (0.975) 2.131 RSSi /15 6178 missing values 1648 complete 4530 non-compliance compliance 4489 41 insignificant cycle comonents Significant cyclle components 2824 1665 Smooth 714 Non-smooth 951 For the non-compliance group, visual examination of each curve pattern is done . *** of these 41 have visible cycle patterns. l Noncompliance genes (41) . High overall expression levels . May or may not show cycle patterns … Recommendation : inspect each gene separately Phase determination • The second and the third basis curves show clear cycle patterns. The third basis appears to be a 40 min-delayed version of the second basis, with an R-squared value of .78 • Linear combinations of these two basis curves show a variety of expression patterns. Construction of A Compass plot • • • • Use of known cycle-regulated genes Compliance checking with RSS/R^2 plot Cycle- exhibition checking with projection angles Coherent pattern checking by ANOVA • ( A list of 104 known genes with 6 groups) Phases of genes: Identify the phases of genes: Prior Knowledge: There were 104 know genes whose phases were determined by traditional experiment methods. Known genes: There are 6 groups of genes. SCB (G1 phase) MCB (G1 phase) Histone (S phase) S/G2 phase G2/M phase M/G1 phase The noncompliance genes and without significant cycle components are excluded The group of genes, SCB, are also excluded due to the inconsistent patterns within their expression vectors. 82 non-missing known phase genes Remove genes with insignificant cycle component Points obtained by normalizing the loading coeff. for 2nd and 3rd bases to unit length Late G1, SCB regulated genes: Compass plot for phase assignment Histone genes S G1 S/G2 M/G1 G2/M Phase Assignment Smooth Non-smooth G1 108 S 31 S/G2 352 G1 103 S S/G2 27 255 90 295 M/G1 165 G2/M 239 M/G1 90 G2/M Comparison • For the 800 cell-regulated genes classified by Spellman et al, we re-classified them with our method. If a gene does not comply with our model or does not have significant second or third regression coefficients, we would not assign the phase. • Contingency tables of mismatched and unclassified cases. 800 missing values complete 654 non-compliance compliance 645 9 insignificant cycle comonents Significant cyclle components 130 515 Smooth 293 Non-smooth 222 The group of 130 insiginicant cycle components appear quite bumpy. A non-compliance gene YJL159W : Spellman et.al’s Score : 10.86 R2: 0.36273 (M/G1) RSS: 14.15322 Angle: -2.43803 Least Squares Estimates: Constant Variable 0 Variable 1 Variable 2 -4.794002E-16 (0.222846) 1.28464 (0.971364) -2.04016 (0.971364) -1.49779 (0.971364) Black: data curve Red : fitted curve (full model) Blue : fitted curve (cyclic model) Locus_info: Other_name PIR2 YJL159W CCW7 ORE1 Gene_class HSP Gene_Info HSP150 Gene_product Heat shock protein, secretory glycoprotein Function cell wall structural protein Cellular_Component cell wall Process cell wall organization and biogenesis Phenotype Null mutant is viable Locus_notes 14 HSP150 has also been called gp400 Position_info: Chromosome X ORF_name YJL159W An example of our non-compliance geneLocus_info: Other_name YDR055W YDR055W : Spellman et.al’s Score : 7.266 R2: 0.30136 (M/G1) RSS: 7.94018 Angle: -2.81396 (Insig. Coef.) Least Squares Estimates: Constant Variable 0 Variable 1 Variable 2 -5.428720E-16 (0.166914) 1.47329 (0.727561) -1.07451 (0.727561) -0.316032 (0.727561) Black: data curve Red : fitted curve (full model) Blue : fitted curve (cyclic model) Gene_class PST Gene_Info PST1 Description Protoplasts-secreted Gene_product The gene product has been detected among the proteins secreted by regenerating protoplasts Phenotype Viable Position_info: Chromosome IV ORF_name YDR055W An example of gene non-compliance YNL082W : Spellman et.al’s Score : 4.843 R2: 0.229191 (G1) RSS: 18.247480537500003 Least Squares Estimates: Constant (0.253035) Variable 0 Variable 1 Variable 2 -6.087129E-16 1.51725 -1.74757 0.263945 (1.10295) (1.10295) (1.10295) Black: data curve Red : fitted curve (full model) Blue : fitted curve (cyclic model) Top 10 scores and gene names from insignificant Cycle component group 3.69 3.85 3.874 4.022 4.048 4.13 4.41 5.047 6.28 6.716 "YOR263C" "YOR320C" "YGR035C" "YCR042C" "YPR019W” "YJL194W" "YJR010W" "YEL068C" "YGR124W" "YKL172W" 78 genes score higher than 6.716; 188 genes score higher than 4.022 213 genes score higher than 3.69 Yet these genes appear very bumpy; see next slide An example of insignificant cycle component gene YGR124W : Spellman et.al’s Score: 6.28 R2: 0.364945 (small) RSS: 0.812496 (small) Angle: 3.13118 Locus_info: Other_name YGR124W Gene_class ASN Gene_Info ASN2 Description Asn1p and Asn2p are isozymes Gene_product asparagine synthetase Phenotype Null mutant is viable; L(S/G2) asparagine auxotrophy occurs upon mutation of both ASN1 and ASN2 Position_info: Chromosome VII ORF_name YGR124W 250 mins CDC15 70 mins EBP2: YKL172W TSM1: YCR042C YOR263C Non-smooth group from 800 genes Our\their G1 S S/G2 G2/M M/G1 Total G1 59 4 1 0 18 82 S 6 3 7 0 0 16 S/G2 0 0 31 3 0 34 G2/M 0 0 17 47 4 68 M/G1 0 0 0 1 21 22 Total | 65 | 7 | 56 | 51 | 43 | 222 Smooth group from 800 genes Low overall expression level Our\their G1 S S/G2 G2/M M/G1 Total G1 74 7 5 0 43 129 S 8 10 11 0 0 29 S/G2 0 1 43 1 0 45 G2/M 0 0 17 39 3 59 M/G1 1 0 1 1 28 31 Total | 83 | 18 | 77 | 41 | 74 | 293 CLN2: YPL256C HTA1: YDR225W (S) (G1) YJL091C (Phase ??) CLB4: YLR210W (S/G2) CLN2: YPL256C HTA1: YDR225W (S) (G1) FKS1: YLR342W (Phase ??) From 5 cell CLB4: YLR210W (S/G2) From 1 , total SS small YOR264W Least Squares Estimates: Constant Variable 0 Variable 1 Variable 2 -5.706461E-16 (4.704328E-2) -0.170979 (0.205057) 0.479678 (0.205057) 0.762583 (0.205057) R Squared: 0.571396 Sigma hat: 0.205057 Number of cases: 19 Degrees of freedom: 15 Oscillated genes • First curve basis is oscillating in a extremely regular way • There are over 200 genes with such regular oscillating patterns • Role unknown : Systematic error ? Common upstream promoter region ? DIM1 (YPL266W) Locus_info: Other_name YPL266W Gene_class DIM Gene_Info DIM1 Description Dimethyladenosine transferase, (rRNA(adenine-N6,N6-)-dimethyltransferase),reponsible for m6[2]Am6[2]A dimethylation in 3'-terminal loop of 18S rRNA Gene_product dimethyladenosine transferase Function rRNA (adenine-N6,N6-)-dimethyltransferase Cellular_Component nucleolus Process 35S primary transcript processing rRNA modification Phenotype Null mutant is inviable Position_info: Chromosome XVI ORF_name YPL266W PRS1A (YLR441C) Locus_info: Other_name YLR441C RP10A Gene_class RPS Gene_Info RPS1A Description Homologous to rat S3A Gene_product Ribosomal protein S1A (rp10A) Function structural protein of ribosome Cellular_Component cytosolic small ribosomal (40S)-subunit Process 0006416 protein biosynthesis Locus_notes 13 RP10A (RPS1A) and RP10B (RPS1B) are nearly identical; this gene has also been called PLC1, but should not be confused with PLC1 on chromosome XVI encoding a phosphoinositide-specific phospholipase Position_info: Chromosome XII ORF_name YLR441C GLN1: YPR035W One gene from non-smooth group Not in Spellman et. al.’s list. Least Squares Estimates: Constant Variable 0 Variable 1 Variable 2 R Squared: Sigma hat: -6.276471E-16 (4.762055E-2) -2.47649 (0.207573) 3.958405E-2 (0.207573) 1.01860 (0.207573) 0.917337 0.207573 Further discussion • • • • • Others who use PCA Clustering Other data set Use of SIR/PHD Without a time scale ? B-cell lymphoma data • Pathway study . Genes with overall small expression levels could have been Removed from the beginning??? YGR231C One gene from smooth group Not in Spellman et. al.’s list. Least Squares Estimates: Constant Variable 0 Variable 1 Variable 2 R Squared: Sigma hat: -5.803153E-16 (4.131369E-2) -0.156478 (0.180082) -1.59995 (0.180082) -0.623201 (0.180082) 0.859375 0.180082 Total sum of squares equals to 3.4591 which is about 71.6 percentile among all genes. The median of the total sum of squares is 2.27735. THE END YBL002W YER124C YDR224C YJL159W YKL163W YKL164C YKL185W YMR003W YMR011W YNL160W YDR055W