Download Comparison of Statistical Models for Affymetrix GeneChip

Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays William J. Lemon, Jeffrey J.T. Palatini, Ralf Krahe, Fred A. Wright Department of Biostatistics, University pf North Carolina, Chapel Hill Division of Human Cancer Genetics Ohio State University Measuring gene expression with the Affymetrix GeneChip polyA Coding portion of gene X Perfect Match (PM) Mismatch (MM) ... PM - 25 bases complementary to region of gene MM - Middle base is different •cRNA from sample mRNA is put on the chip •intensity of binding reflects gene expression Reproducibility of Probe Sensitivities Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001. The Li-Wong Model Li-Wong Full (LWF) PM ij   j   j i   j i  e MM ij   j   j i  e, e ~ N (0, 2 ) Identifiability constraint Li-Wong Reduced (LWR)  j  J 2 j yij  PM ij  MM ij   j i   ,  ~ N (0, 2 ),  2  2 2 Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001. The Li-Wong Model Li-Wong Full (LWF) ith array jth probe pair PM ij   j   j i   j i  e MM ij   j   j i  e, e ~ N (0, 2 ) Identifiability constraint Li-Wong Reduced (LWR)  j  J 2 Total no. probe pairs j yij  PM ij  MM ij   j i   ,  ~ N (0, 2 ),  2  2 2 Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001. The Li-Wong Model expression Li-Wong Full (LWF) ith array jth probe pair PM ij   j   j i   j i  e MM ij   j   j i  e, sensitivities e ~ N (0, 2 ) Identifiability constraint Li-Wong Reduced (LWR)  j  J 2 Total no. probe pairs j yij  PM ij  MM ij   j i   ,  ~ N (0, 2 ),  2  2 2 Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001. How to compare gene expression indexes? •We get maximum likelihood estimates for  using either full data (LWF) or reduced data (LWR) •The Affymetrix software computes: Average Difference (AD) ˆ   y j J . j Log-Average (LA) 10  log( PM j / MM j ) / J j  •The log-average might perform particularly poorly. Note that if  terms are small and error variance is small, ( PM j / MM j )  ( j   j ) /( j )  ( j   j ) /( j ) •We gain insight by assuming Li-Wong model is true. Then what are the consequences? •For large sample sizes, the ’s and ’s will be wellestimated Compare LW estimators directly: 2 2     (    )   j j j var(ˆreduced ) j   2.0 RE( full, reduced)   2 j   J var(ˆ full )   Comparing to AD is tricky, but with a correction factor AD is also an unbiased estimate of  : ˆˆ   J   j  ˆ  j  var(ˆˆ ) 1 RE(reduced, AD)    1.0 ˆ var( reduced ) 1  var( ) •This also gives insight into “perfect match only” analyses: RE(full, PM-only)=  j 2 var(ˆPM ) j  1 2 var(ˆ full )  ( j   j ) j and 1  RE  2 Furthermore, PM-only is always at least twice as efficient as LWR Empirical Comparisons •We propose that an expression index is “good” if it has a high correlation with the underlying true expression (which is usually unknown). •this correlation can be estimated using a specially designed mixing experiment •if r is the correlation coefficient between the measured index and true expression, the “relative efficiency” of two indexes  and  can be estimated as r /(1  r ) 2 2 r /(1  r ) 2 2 Suppose the true underlying gene expression for a given gene is . Consider two indices of gene expression ˆ   0   1  e , e ~ N (0,   2 ) ˆ   0   1  e , e ~ N (0,   2 ). ˆˆ   (ˆ   0 ) / 1 And we have is an unbiased estimate of  ˆˆ var( )   2  / 1 2 ˆˆ )   /  1 var(  RE (ˆ,ˆ )   2 ˆˆ  /  2  1 var( ) 2 2 Can we estimate this relative efficiency? •Suppose we could do a regression of ˆ on . •the ratio of explained to residual variance in the model can be shown to be 1 var( ) /   2 2 2 r   2 1  r and similarly for ̂ , so r /(1  r ) 2 2 r /(1  r ) 2 2  RE (ˆ,ˆ ) Can we estimate r without ever knowing true expressions ? •Yes, with a specially designed mixing experiment •we seek two contrasting conditions in which many genes will be differentially expressed Experimental Design (6 replicates for each condition) Human Fibroblasts (GM 08330) 20% FBS Cell culture 5 passages 20% Serum starvation Serum stimulation 0.1% FBS 48h Harvest total RNA 0.1% 20% FBS 24h Harvest total RNA RNA extraction Produce 50:50 group Produce duplicates each day for 3d Add Bacterial Control Genes Starved 50:50 Dap, Thr Dap, Thr, Lys, Phe Stimulated Lys, Phe Synthesize cDNA, cRNA; fragment Add Hybridization Control Genes Hybridize Data Reduction BioB, BioC, BioD, Cre HuGeneFL Gene Expression Indexes BIN1 expression ˆ full Stim 50:50 Starved True expression = average of Stim, Starved BIN1 expression ˆ full Stim 1 50:50 2 Starved 3 Note that r  rˆ ,  rˆ  ,X    or  r  ˆ , X Where X=1, 2, 3 (say) for Stim, 50:50 Starved, respectively Overall intensity higher in Stimulated Mean probe intensity per array Stim 50:50 Starved Coefficients of variation for assay (individual probes) and gene expression indexes 0 .1 2 1 # g e n s 0 20 40 60 80 # g e n s 0 50 10 150 20 250 # P r o b e s 0 20 60 10 As s aL y W Sti F A m St ffy 0 .2 9 0 .1 4 9 0 .0 0 .5 1 .0 1 .5 2 .0 00 .0 1 .5 1 .0 2 .5 .0 00 .0 1 .5 1 .0 2 .5 CV CV CV Correlation matrix of 18 arrays as a colorized image for each expression index. LWR LWF Starved 50:50 Stim LA AD Starved 50:50 Stim Stim 50:50 Starved Stim 50:50 Starved Full Model Affymetrix Ave Diff Strv 2 Strv 3 Strv 1 Strv 6 Strv 5 Strv 4 Stim 2 Stim 4 50:50 1 Stim 1 Stim 6 Stim 3 Stim 5 50:50 3 50:50 5 50:50 4 50:50 2 50:50 6 Stim 2 Strv 1 Strv 3 Strv 2 Strv 6 Strv 5 Strv 4 Stim 1 Stim 6 Stim 3 Stim 5 Stim 4 50:50 5 50:50 4 50:50 3 50:50 2 50:50 1 50:50 6 Strv 3 Strv 4 Strv 6 Strv 5 Strv 2 Strv 1 Stim 2 Stim 1 Stim 4 Stim 5 Stim 6 Stim 3 50:50 5 50:50 4 50:50 2 50:50 1 50:50 6 50:50 3 Strv 1 Strv 4 Strv 2 Strv 5 Strv 3 Strv 6 50:50 3 50:50 5 50:50 4 50:50 2 50:50 1 50:50 6 Stim 4 Stim 6 Stim 5 Stim 3 Stim 1 Stim 2 Comparing Models Cluster Analysis Reduced Model Affymetrix Log Ave LA AD Unscaled LWR LWF LA AD LWR LWF 0. 0.5 1.0 1.5 Median(r2/(1-r2)) Relative Efficiency Scaled Correlation of duplicate measurements of 149 genes LWF median r=.74 LWR median r=.43 AD median r=.08 LA median r=.17 Number of unexpressed genes •Only 0.2% of the LW estimates are negative •50:50 group has fewest negative estimates •could this indicate very few unexpressed genes? Stim 50:50 Starved A conservative approach to estimating number of unexpressed genes •Let U denote number of unexpressed genes •genes are ranked according to expression index U  2  (median rank of U genes among all genes) •This is useful if we can get a random sample of unexpressed genes Unexpressed population Gene expression index •We use the spiked-out bacterial control genes as a sample of “unexpressed” genes •the 4 genes are are represented 3 times each (different portions of mRNA), for a total of 12 probe sets •Based on this reasoning, we estimate that greater than 88% of the genes are expressed, even in the Starved samples Rank of expression index variance across the 6 Stimulated arrays versus rank of index mean AD R20 ank(vr) 40 60 R a n k ( v r ) 0 20 40 60 LWF Truly absent in stim group D ap Thr P he L ys 0 2 4 60 00 0 0 2 00 4 00 60 00 000 000 Ra n k( Ra nk(me an Very low estimated expression for truly absent genes when using LWF Present/absent calls •We use the statistic ˆ z SE (ˆ) to declare genes present/absent (absolute call) •we find the vast majority of genes on the array appear to be present •for the spiked in/out genes, we find vastly improved present/absent calling using LW estimates 10. -FalseN0.g2tivRae (Sensitv0y.4) 0.6 0.8 1.0 ROC curve - spiked in/out genes LWF-Z LWR-Z Untrimmed AD LA AD Untrimmed LA Absolute Call 0 0 .0 0 .2 0 .4 0 .6 1 .8 .0 F alse (1 S Variability in estimates Reduced Model Full Model log(variance) Stim 50:50 Starved log(mean) Conclusions • • • • • • Model-based estimators are superior to simple averaging Full model superior to reduced this does not necessarily mean that the mismatch probes are a good idea - but if they are present we should use them we have demonstrated this using both analytic considerations and experimental data a carefully designed experiment can be used to address many issues Many more genes may be expressed than previously thought Other issues/ future work •Spiking genes might be used to calibrate and normalize arrays •relationship between variance and mean of expression indexes may be useful in planning experiments •our data may be useful for future work, especially in producing indexes that are resistant to probe saturation •all primary data, this Powerpoint presentation and a preprint are available at http://thinker.med.ohio-state.edu

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Comparison of Statistical Models for Affymetrix GeneChip