Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Analysis of (cDNA) Microarray Data: Part I. Sources of Bias and Normalisation Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics MICROARRAY ANALYSIS My (Educated?) View 1. Data included in GEXEX a. Whole data stored and “securely” available b. GP3xCLI on each hybridisation 2. Relaxed data acquisition criteria a. Signal to Noise > 1.00 (relaxer (sp?) exist) b. Mean to Median > 0.85 (Tran et al. 2002) 3. Data Normalisation 4. Mixed-Model Equations a. Check Residuals (plot Residuals vs Predicted) b. Check REML estimates of Variance Components c. Proportion of Total Variance due to Gene x Variety 5. Process Gene x Treatment BLUPs Differentially Expressed Genes a. t-statistics Z-score P-value b. Mixtures of Distributions Posterior Probabilities 6. Process Differentially Expressed genes a. Hierarchical clustering b. Gene ontology analysis Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics MICROARRAY ANALYSIS BASIC PIECES FOR SIGNAL DETECTION • Foreground RED and GREEN • Background RED and GREEN Rf Rb • Background-corrected RED GREEN R = Rf – Rb G = Gf – Gb • Log-transformed Log2(R) Log2(G) • Difference: “Minus” M = Log2(R) – Log2(G) = Log2(R/G) • Mean: “Average” A = 0.5 * ( Log2(R) + Log2(G) ) = 0.5 * Log2(R*G) Gf Gb True Signals! • MA-Plots …to come Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Acquisition Criteria The Red/Green Intensities can be spatially biased Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Acquisition Criteria The Red/Green Intensities can be intensity-biased MA-Plot Values should scatter around zero Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Background Correction: Why bother? Data Acquisition Criteria Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Acquisition Criteria Background Correction: Why bother? Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Acquisition Criteria RED versus GREEN Log-transformation: Why bother? Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Acquisition Criteria MA-Plots: All versus only valid signals Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Acquisition Criteria Signal to Noise Ratio S 2N Fg Bg Bg Mean to Median Correlation MinMean, Median M 2M MaxMean, Median Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation http://genome-www5.stanford.edu/mged/normalization.html • Normalisation is an attempt to correct for systematic bias. • Normalisation allows you to compare data from one array to another. • Systematic Bias can be introduced into microarray experiments at all stages. • Need to: – – – – Avoid it (as much as possible) Recognize it Correct for it Discard unrecoverable data • In practice we do not always understand the data inevitably some biology will be removed too (or at least not revealed). Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation Source: Catherine Ball (Stanford) Pool of Cell Lines Tumor Different amounts of Differential labeling starting material. efficiency of dyes Different amounts of Differential RNA inefficiency each channel Differential efficiency of scanning in each of hybridization over channel. slide surface. Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Systematic Bias Sources … • Different labeling efficiencies or dye effects • Scanner malfunction • Differences in concentration of DNA on arrays (plate effects) • Printing or tip problems • Uneven hybridization • Batch bias • Experimenter issues …and Dealing with it • Detect and recognize the effect Note something odd • Determine magnitude and effect on data Try a few methods • Identify source of bias Think big! • Eliminate or reduce contributing factors • Correct data • Discard uncorrectable data Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Systematic Bias Labeling Efficiencies Cause Bias • One channel of a twochannel array has higher intensity than the other (usually GREEN). • Most common source of recognizable bias. • Solution: Most easy to addressed (eg. dyeswaps, balanced loops). Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Systematic Bias Scanning (operator?) Bias • Mis-aligned lasers can cause big problems • In this case, the two channels are slightly out of register • Solution: fix the scanner and repeat Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Systematic Bias Printing (operator?) Bias • Irregular shaped spots are often observed (printing error) • Slides from the same printing batch cluster together • Solution: Probably limited to better printing technique and image analysis, rather than normalization Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Systematic Bias Probe Bias • Different concentrations of probes might produce patterns in arrays • Biological role of probes can produce patterns in arrays • These patterns can create a spatial bias that are not artificial, but biological Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Systematic Bias Probe Bias Coding regions • Probes arranged on the array based on biological function cause spatial bias • Solution: avoid arranging reporters based on function, know your experimental design Intergenic regions Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Systematic Bias Hybridisation (operator?) Bias • Poor technique during hybridisation can cause a spatial bias • Operator is one of the largest sources of systematic bias • Experiments done by the same operator often cluster together more tightly than warranted by the biology • Solution: Consistent methods, successful techniques Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation …and other beautifying techniques Technique Choices Aim (Real) Aim (Ideal) Transformation “To Near Normality” Log2 Numerically tractable Gaussian Normalisation “Location” Location Parameter: 1. Mean 2. Median 3. Regression(s) (LOWESS) Account for systematic effects Gaussian Standardisation “Scale” Scale Parameter Stabilise variance Gaussian Lin-Log Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation Transformation …to near normality Solution: Explore the entire Box-Cox family of power transformations: x 1 x ( ) ln( x) 0 0 1 1 n ( ) ( ) 2 l ( ) ln ( x j x ) 2 n j 1 n ( 1) ln( x j ) j 1 Maximum at λ 0, hence use the log-transformation Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation Transformation …to near normality Raw Data …exponential-like Log2 Transformed …normal-like Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation Transformation …to near normality Lin-Log Transformation x 1 log 2 ( x) ( ) x log ( ) 1 x x 2 x = background corrected = Fg - Bg Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation Transformation …to near normality • The Edwards’ transformation as well as the Lin-Log transformation are an attempt to use the entire data, not only those for which foreground is greater than background. • The reasoning is that errors are linear and multiplicative for small and large signals, respectively. • The search for and choice of could be rather unconvincing (eg. Different for different array slides). • Solution: Use Log2 if Foreground > Background Otherwise, use a small arbitrary value (say 0), Or simply disregard. Alternatively: Use only Foreground and Log2 it Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation Log2(R/G) – c = M - c Location Parameter GLOBAL: Mean: Median: c = Mean of M’s c = Median of M’s Assumption: Changes roughly symmetric around Mean or Median LOWESS: c = Weighted Regress of M on A Assumption: Changes roughly symmetric at all intensities LOCAL: LOWESS: c = c(i) = Weighted Regression of M on A within print-tip-group i LOWESS = Locally WEighted Regression and Smoothing Scatterplots Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation LOWESS = Locally WEighted Regression and Smoothing Scatterplots Source: G Rosa 2003. Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation LOWESS = Locally WEighted Regression and Smoothing Scatterplots SAS Code Source: G Rosa 2003. Genetic analysis of complex traits using SAS ISBN 1-59047-507-0 Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation LOWESS = Locally WEighted Regression and Smoothing Scatterplots Normalised Intensities Source: G Rosa 2003. Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation LOWESS = Locally WEighted Regression and Smoothing Scatterplots Source: G Rosa 2003. Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation None Source: Yang et al 2002 Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation After Global Median Source: Yang et al 2002 Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation Global Lowess Source: Yang et al 2002 Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation Print-in-Group Lowess Source: Yang et al 2002 Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation After Print-in-Group Lowess Source: Yang et al 2002 Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Location Normalisation Additional Assumption (other than symmetry of changes): The proportion of genes that are Differentially Expressed (DE) is minimal Question: Which genes to use? Answer: Only the ones (housekeeping) that we know are not DE Comment: “Boutique” arrays become a nuisance Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Scale Normalisation (Standardisation) “Some scale adjustments may be required so that the relative expression levels from one particular experiment (slide) do not dominate the average relative expression levels across replicate experiments.” Yang et al 2002 Log2(R/G) – c(i) a(i) Notes: 1. The scaling a(i) is such that Var(M) = a(i)2 2 2. The estimation requires an approximation (“robust”) to the geometric mean: MAD i I I i 1 MADi where MAD is the Median Absolute Deviation. 3. It doesn’t get any more heuristic (funnier?) than this Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation …and other beautifying techniques Notes: 1. Except Log2, everything else applies only to Ratios: M = log2(R/G) 2. Except Log2, everything else applies only within slide 3. Everything is beautified to identify DE genes straight from MA-plot, either from a single slide or from a function of M’s across slides. 4. The uncertainty in measurements increases as intensity decreases 5. Measurements close to the detection limit are the most uncertain (cf. Sensitivity) 6. Fold-change measurements ignore these effects 7. We can calculate an intensity-dependent z-score that measures the ratio relative to the standard deviation in the data Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation …and other beautifying techniques 2 Locally estimated standard deviation of positive ratios 1 2-fold Z= 1 0 Z= -1 2-fold 1 Locally estimated standard deviation of negative ratios 2 2 . 0 1 . 5 1 . 0 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 Z= 5 Corrected Log10 ( Ratio ) Corrected Log10 ( Ratio ) 2 Z= 2 1 Z= 1 2-fold 0 2-fold Z= -1 1 Z= -2 Z= -5 Z= -5 2 2 . 0 1 . 5 1 . 0 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 Mean ( Log10 ( Intensity ) ) Mean ( Log10 ( Intensity ) ) Local Log10 ( Ratio ) Z-Score 1 0 5 Z > 2 is at the ~ 95% confidence level 0 5 Z= 5 1 0 2 . 0 1 . 5 1 . 0 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 Source: J Pevsner 2004 Mean ( Log10 ( Intensity ) ) Armidale Animal Breeding Summer Course, UNE, Feb. 2006 A Quantitative Overview to Gene Expression Profiling in Animal Genetics Normalisation: References Bilban M, Buehler LK, Head S, Desoye G, Quaranta V. Normalizing DNA microarray data. Curr Issues Mol Biol. 2002 Apr;4(2):57-64. Durbin BP, Hardin JS, Hawkins DM, Rocke DM. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics. 2002 Jul;18 Suppl 1:S105-10. Kepler TB, Crosby L, Morgan KT. Normalization and analysis of DNA microarray data by self-consistency and local regression. Genome Biol. 2002 Jun 28;3(7):RESEARCH0037. Schuchhardt, J., D. Beule, et al. Normalization Strategies for cDNA Microarrays. NAR 2000 28(10): E47-e47. Tran PH, Peiffer DA, Shin Y, Meek LM, Brody JP, Cho KW. Microarray optimizations: increasing spot accuracy and automated identification of true microarray signals. Nucleic Acids Res. 2002 Jun 15;30(12):e54. Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res. 2001 Jun 15;29(12):2549-57. Tsodikov A, Szabo A, Jones D. Adjustments and measures of differential expression for microarray data. Bioinformatics. 2002 Feb;18(2):251-60. Yang MC, Ruan QG, Yang JJ, Eckenrode S, Wu S, McIndoe RA, She JX. A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays. Physiol Genomics. 2001 Oct 10;7(1):45-53. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15. Armidale Animal Breeding Summer Course, UNE, Feb. 2006