Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
(2) Ratio statistics of gene expression levels and applications to microarray data analysis Bioinformatics, Vol. 18, no. 9, 2002 Yidong Chen, Vishnu Kamat, Edward R. Dougherty, Michael L. Bittner, Paul S. Meltzer1, and Jeffery M. Trent Outline Introduction Ratio Statistics Quality Metric for Ratio Statistics Conclusion Introduction Motivation Expression-based analysis for large families of genes has recently become possible owing to the development of cDNA microarrays, which allow simultaneous measurement of transcript levels for thousands of genes. For each spot on a microarray, signals in two channels must be extracted from their backgrounds. This requires algorithms to extract signals arising from tagged mRNA hybridized to arrayed cDNA locations and algorithms to determine the significance of signal ratios. Introduction Results 1. estimation of signal ratios from the two channels, and the significance of those ratios. 2. a refined hypothesis test is considered in which the measured intensities forming the ratio are assumed to be combinations of signal and background. The new method involves a signal-to-noise ratio, and for a high signal-to-noise ratio the new test reduces (with close approximation) to the original test. The effect of low signal-to-noise ratio on the ratio statistics constitutes the main theme of the paper. 3. a quality metric is formulated for spots Ratio Statistics Ratio Statistics assuming a constant coefficient of variation Consider a microarray having n genes, with red and green fluorescent expression values labeled by R1 , R2 ,..., Rn and G1 , G2 ,...,Gn , respectively. H 0 : Rk Gk Hypothesis test: H1 : Rk Gk Rk c Rk Assumption: Gk c Gk under H 0 Rk Gk Ratio Statistics assuming a constant coefficient of variation (cont.) Ratio test statistics: Tk Rk / Gk Assuming Rk and Gk to be normally and identically distributed, T has the density function k fTk (t; c) 1 ˆc n (1 t ) 1 t 2 c(1 t ) n 2 2 (ti 1) 2 (t 2 1) i 1 i 2 exp[ (t 1) 2 2c (1 t ) 2 ], Ratio Statistics assuming a constant coefficient of variation (cont.) self-self experiment Duplicate T t / t', log Tk log t k log t k' (log Rk log Rk' ) (log Gk log Gk' ), 1 c n n i 1 ( log Rk ) 2 log Rk where log R (log R log R ). Ratio Statistics assuming a constant coefficient of variation (cont.) Therefore, 2 2 2 2 logT ( log ) ( R log R ' log G log G ' ) 4c 2 Confidence interval 1. Integrating the ratio density function 2. The C.I. is determined by the parameter c, one can either use the par. derived from pre-selected housekeeping genes or a set of duplicate genes. Ratio Statistics for low signalto-noise ratio The actual expression intensity measurement is of the form Rk (SRk BRk ) BRk where SRk is the expression intensity measurement of gene k , BRk is the fluoresent background level, and BR k is the mean background level Ratio Statistics for low signalto-noise ratio (cont.) Null hypothesis of interest: Rk E[ Rk ] SRk SGk E[(SRk BRk ) BRk ] SRk H 0 : SRk SGk H 0 : Rk Gk test statistics: Tk Rk / Gk Ratio Statistics for low signalto-noise ratio (cont.) Major difference: 1. the assumption of a constant cv applies to SRk and SGk , not to Rk and Gk 2. the density of Tk is not applicable SNR (signal-to-noise ratio) SNR (signal-to-noise ratio) Assuming that SRk and BRk are independent, 2 2 2 R2k SR ( c ) BRk SRk BRk k SNRRk cR2k R k R k 2 SRk E[ SRk ] E[ BRk BRk ] BRk BRk (c SRk ) 2 SRk 2 2 BR k c 2 2 BR k 2 SR k 1 c SNRR k 2 2 The Expression intensity scatter plot Confidence interval for the test statistics Assumption: SRk , SGk , BRk , BGk are normally distribute d and independen t. Rk ( SRk BRk ) BRk Tk Gk ( SGk BGk ) BGk T N ( p, p ) N ( BR , BR ) BR N ( p, p ) N ( BG , BG ) BG under H 0 , p SRk SGk ( Rk Gk ) Confidence interval for the test statistics (cont.) Under the assumption of constant cv for the signal (without the background), cp p B max{ BR , BG } (variance par.) s p / B (signal - to - noise ratio) BR / BG (background std ratio) N ( s B , cs B ) N (0, BG ) T N ( s B , cs B ) N (0, BG ) The 99% confidence interval for ratio statistic c 0.2, (a) BR BG 100 (or 1) (b) 1 Correction of background estimation Owing to interaction between the fluorescent signal and background, local-background estimation is often biased. To estimate the bias difference, we find the relationship between the red and green intensities under the null hypothesis by assuming a linear relation, G = aR+b. Correction of background estimation (cont.) T N ( p, p ) N (0, BG ) N ( p, p ) N (0, BG ) Simulation 1. generate 10,000 data points from exp. dist. with 2,000 to simulate 10,000 gene expression levels, 2. The intensity measurement for each channel is further simulated by using a normal dist. with mean intensity from the exp. dist. and a constant cv of 0.2 3. simulate background level by a normal dist. (1) no bias: background level ~ N (0,100) (2) some bias: background level ~ N (b,100) Scatter plot of simulated expression data (a) 10,000 data points with no bias from background estimation (b) 10,000 data points with background estimation bias of 500 dog-leg effect Correction of background estimation (cont.) G = aR+b we employ a chi-square fitting method that minimizes N (Gk (aRk b)) 2 k 1 R2k G2 k 2 b k 1 N 2 2 2 1 ˆ ˆ ( c ( R G ) 2 2 BR BG ) k 1 k k N 2 2 1 (c 2 ( Rk Gk ) 2ˆ BR 2ˆ BG ) (Gk Rk ) Quality Metric for Ratio Statistics For a given cDNA target, the following factors affect ratio measurement quality: (1) Weak fluorescent intensities A smaller than normal detected target area A very high local background level A high standard deviation of target intensity (2) (3) (4) (1)Fluorescent intensity measurement quality Under the null hypothesis, the signal means are equal, so that min{ SNRR , SNRG } R R max{ BR , BG } B We replace R and B by their null - hypothesis estimators, (R G)/2 and ˆ B , to obtain 0, RG wI , 6 ˆ B 1, RG 3 2 ˆ B RG 3 6 2 ˆ B otherwise (2)Target area measurement quality Let AM be the area of mask of the cDNA target for a particular print - tip, and let ATk be the area of the two largest connected components of the target k . The proportional are a of each target is a k ATk / AM . We define the are a measurement quality by 0, a sm in max{10 / AM ,0.05} a-sm in wa , sm in a sb 0.20 sm in sb 1, otherwise (3)Background flatness quality Define background flatness wb min{ wBR , wBG }, where 1, BRk BR 4 BR ( BR 6 BR ) BRk wBR , BR 4 BR BRk BR 6 BR 3 BR 0, BRk BR 6 BR and wBG is defined similarly. (4)Signal intensity consistency quality Typical target shap cv=0.48 cv=0.81 cv=0.45 cv=0.98 cv=0.31 cv=0.59 (4)Signal intensity consistency quality (cont.) Letting cvmin,k denote the minimun between the intensity coefficient of variation for the red and green channels, 1.1 cvmin,k 0, cvmin,k 0.9 ws , 0.9 cvmin,k 1.1 0.2 cvmin,k 0.9 1,