Download Practical Issues in Microarray Data Analysis

Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland Overview  Scales for analysis  Systematic errors    Sample outliers & experimental consistency Useful graphics Implications for experimental design  Platform consistency  Individual differences Distribution of Signals •Most genes are expressed at very low levels •Even after log-transform the distribution is skewed •NB: Signal to abundance ratio NOT the same for different genes on the chip Explanation of Distribution Shape  Left hand steep bell curve probably due to measurement noise  Underlying real distribution probably even steeper + abundances + = noise = observed values Variation Between Chips  Technical variation: differences between measures of transcript abundance in same samples  Causes:     Sample preparation Slide Hybridization Measurement  Individual variation: variation between samples or individuals  Healthy individuals really do have consistently different levels of gene expression! Replicates in True Scale  Signals vary more between replicates at high end  Level of ‘noise’ increases with signal Comparison of chips (Affy) Std Dev as a function of signal across all chips chip 1 SD chip 2 mean signal Red line is lowess fit Replicates on Log Scale  Measures fold-change identically across genes  Noise at lower end is higher in log transform chip 1 vs chip 2 after log transform SD vs signal after log transform Ratio-Intensity (R-I) plots  Log scale makes it convenient to represent foldchanges up or down symmetrically  R = log(Red/Green); I = (1/2)log(Red*Green)  aka. MA (minus, add) plots (log) Ratio (log) Intensity Variance Stabilization  Simple power transforms (Box-Cox) often nearly stabilize variance  Durbin and Huber derived variance-stabilizing transform from a theoretical model:  y = a (background) + m eh (mult. error) + e (static error)  m is true signal; h and e have N(0,s) distribution  Transform: (  2 2 2 g ( y )  log y  a  ( y  a )  s s e h  Could estimate a (background) and sh/se empirically  In practice often best effect on variance comes from parameters different from empirical estimates  Huber’s harder to estimate Box-Cox Transforms •Simple power transformations (including log as extreme case), eg cube root •Often work almost as well as variancestabilizing transform Should you use Transforms?  Transforms change the list of genes that are differentially regulated  The common argument is that bright genes have higher variability  However you aren’t comparing different genes  Log transform expands the variability of repressed genes  Strong transforms (eg log) most suitable for situations where large fold-changes occur (eg. Cancers)  Weak transforms more suited for situations where small changes are of interest (eg. Neurobiology) Graphical methods  Aims:  Exploratory analysis, to see natural groupings, and to detect outliers  To identify combinations of features that usefully characterize samples or genes  Not really suitable for quantitative measures of confidence  Principal Components Analysis (PCA)  Standard procedure of finding combinations with greatest variance  Multi-dimensional scaling (MDS)  Represent distances between samples as a two- or three-dimensional distance  Easy to visualize MDS Plots Representing Groups Day 1 Chips Cluster diagram Multi-dimensional scaling Different Metrics – Same Scale  8 tumor; 2 normal tissue samples  Distances are similar in each tree  Normals close  Tree topologies appear different  Take with a grain of salt! Volcano Plot  Displays both biological importance and statistical significance log2(p-value) or t-score log2(fold change) Quantile Plot scores against tscores under random hypothesis  Statistically significant genes stand out Sample t-scores  Plot sample t- Corresponding quantiles of t-distribution Systematic Variation  Intensity-dependent dye bias due to ‘quenching’  Stringency (specificity) of hybridization due to ionic strength of hyb solution  How far hybridization reaction progresses due to variation in mixing efficiency  Spatial variation in all of the above Relevance for Experimental Designs  Balanced designs with several replicates built in have smaller standard errors than reference design with same number of chips – Kerr & Churchill  Assuming error is random!  In practice very hard to Sample 1 deal with systematic Sample 5 Sample 2 errors in a symmetric design  Sample 4 Sample 3 No two slides with comparable foldchanges Critique of Optimal Designs  Optimal for reduction of variance, if    All chips are good quality No systematic errors – only random noise In fact systematic error is almost as great as random noise in many microarray experiments  With loop designs single chip failures cause more loss of information than with reference designs Individual Variation  Numerous genes show high levels of inter- individual variation  Level of variation depends on tissue also  Donors, or experimental animals may be infected, or under social stress  Tissues are hypoxic or ischemic for variable times before freezing Frequent False Positives  Immuno-globulins, and stress response proteins often 5-10X higher than typical in one or two samples  Permutation p-values will be insignificant, even if t-score appears large Group 1 Group 2 frequency gene levels

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Practical Issues in Microarray Data Analysis