Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Tom Kepler Santa Fe Institute Normalization and Analysis of DNA Microarray Data by Self-Consistency and Local Regression [email protected] Rat mesothelioma cells control Rat mesothelioma cells treated with KBrO2 Normalization Method to be improved: 1. Assume that some genes will not change under the treatment under investigation. 2. Identify these core genes in advance of the experiment. 3. Normalize all genes against these genes assuming they do not change Normalization New Method: 1. Assume that some genes will not change under the treatment under investigation. 2. Choose these core genes arbitrarily. 3. Normalize (provisionally) all genes against these genes assuming they do not change. 4. Determine which genes do not change under this normalization. 5. Make this set the new core. If this core differs from the previous core, go to 3. Else, done. Error Model I c[mRNA] I = spot intensity [mRNA] = concentration of specific mRNA c = normalization constant Error Model I c[mRNA] I = spot intensity [mRNA] = concentration of specific mRNA c = normalization constant = lognormal multiplicative error Error Model Iijk cij [mRNA]ik ijk I = spot intensity [mRNA] = concentration of specific mRNA c = normalization constant = lognormal multiplicative error index 1, i: treatment group index 2, j: replicate within treatment index 3, k: spot (gene) Yijk log( Iijk ) log(cij ) log([mRNA]ik ) log(ijk ) Yijk ij (k ik ) ijk Y = log spot intensity = mean log concentration of specific mRNA = treatment effect (conc. specific mRNA) = normalization constant = normal additive error index 1, i: treatment group index 2, j: replicate within treatment index 3, k: spot (gene) Yijk ij (k ik ) ijk Model: Identifiability constraints: k 0 k niik 0 i Estimate by ordinary least squares: xk Y k Y aij i Yij dik i Yi k Y k Yi Y Yijk ij (k ik ) ijk Model: Identifiability constraints: k 0 k niik 0 i But note: cannot identify between and Self-consistency: wk ( )ik 0 k The weight wk() is small if the kth gene is judged to be changed; close to one if it is judged to be unchanged. Procedure is iterative. log intensity, array 2 6 4 2 0 -2 -2 0 2 4 log intensity, array 1 6 log intensity, array 2 6 4 2 0 -2 -2 0 2 4 log intensity, array 1 6 Failure of Model Generalized Model Yijk ij (k ) (k ik ) ij (k )ijk The normalization ij(k) and the heteroscedasticity function ij(k) are slowly varying functions of the intensity, . Estimate by Local Regression Local Regression data Predict value at x=50: weight, linear regression Predict whole function similarly Compare to known true function Simulation-based Validation 1. Reproduce observed bias. Simulation-based Validation 2. Reproduce observed heteroscedasticity. Test based on z statistic: d 2 k d1k zk 1 1 sk n1 n2 Choice of significance level: expected number of false positives: E(false positives) = N But minimum detectable difference increases as gets smaller E(fp) min diff min ratio 0.05 0.01 0.001 0.0001 250 50 5 0.5 2.5 3 3.6 5 0.916 1.09 1.29 1.61 bias “-fold change” Proportion changed spots Validation of method against simulated data 3. Hypothesis testing: Simulated from stated model “rate false pos.” = mean observed / expected Simulated data: mis-specified model — multiplicative + additive noise bias “-fold change” Proportion changed spots Validation of method against simulated data 4. Hypothesis testing: Simulated from “wrong” model: additive + multiplicative noise. Acknowledgments Lynn Crosby North Carolina State University Kevin Morgan Strategic Toxicological Sciences GlaxoWellcome Santa Fe Institute www.santafe.edu postdoctoral fellowships available (apply before the end of the year) [email protected]