Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Probe Level Analysis of TM Affymetrix Data Mark Reimers, NCI Outline Design of Affy probesets Background Normalization Non-specific hybridization Estimation Comparison of Methods ® Affymetrix GeneChip Probe Arrays Hybridized Probe Cell GeneChip Probe Array Single stranded, fluorescently labeled DNA target * * * * * Oligonucleotide probe 20µm 1.28cm Each probe cell or feature contains millions of copies of a specific oligonucleotide probe Over 400,000 different probes complementary to genetic information of interest Image of Hybridized Probe Array Affymetrix Probe Design Published 5´ Gene Sequence 3´ Multiple (11-20) 25-base oligonucleotide probes Perfect Match Mismatch PM is exactly complementary to published sequence MM is changed on 13th base Chip Layout Typical chips are square: 640x640 (U95A), 712x712 (U133) or 1042x1042 (Plus2) Older chips placed all probes for one gene in a row Modern chips distribute probes according to sequence, not gene Chip Nomenclature HGU133A - Human Genome: Unigene build 133, first chip PM - ‘perfect match’ MM - ‘mismatch’ Control sequence Signal - intensity sequence from unrelated organism Doesn’t translate directly to abundance Cross-hybridization Binding of sequences other than target Affymetrix Background Adjustment and Normalization What’s the Issue? Background: some Affy chips show consistently higher values for the lowest signals (presumably absent) than others Background may vary over a chip Normalization: Distribution of probe signals may differ between chips, independent of background adjustment PM and MM may be shifted differently Probe Intensities in 23 Replicates Approaches to Background Subtract common estimate of background Fit local background across chip and subtract - MAS 5.0 Consider background as random variable Use statistical theory to derive background correction RMA ‘Bayesian’ BG Correction Each S = BG + Intensity + e BG randomly sampled from Normal distn Intensity randomly sampled from exponential distribution Estimate mean and SD of BG distn by fitting values below mode of signal distn Estimate Intensity, conditional on S, by integrating over possible values of BG I 0 (S x)dN, (x) K Approaches to Normalization Simple: find average of each chip; divide all values by chip average MAS5: trimmed mean Invariant set: find subset of probes in almost same rank order in each chip Quantile normalization: fit to average quantiles across experiment Probes on Different Chips Plots of two Affymetrix chips against the experiment means MAS 5.0 Plot probes from each chip against common base-line chip Fit regression line to middle 98% of probes Invariant Set (Li-Wong) Method Select baseline chip X For each other chip Y: Select probes p1, …, pK, (K ~ 10000), such that p1 < p2 < …< pK in both chips Fit running median through points { (xp1,yp1), …, (xpK, ypK) } Repeat Quantile Method (RMA) Distributions of probe intensities vary substantially among replicate chips This cannot be even approximately resolved by any linear transformation Drastic solution: ‘shoehorn’ all probe intensities into same distribution Ideal distribution is taken as average of all Quantile Distribution Normalization of Reference Chip Intensities Distribution Formula: xnorm = F2-1(F1(x)) Density function Assumes: gene distribution changes little F1(x) Cumulative Distribution Function F2(x) a x y Ratio-Intensity: Before Ratio-Intensity: After Critique of RMA Normalization Distribution of signals looks more like exponential on log scale No allowance for regional biases in BG Quantile normalization is very strong: highly expressed genes won’t be equal Better to let higher end be roughly linear Requires much memory - could be implemented differently Model-based Estimates for Affymetrix Raw Data Many Probes for One Gene Gene 5´ Sequence 3´ Multiple oligo probes Perfect Match Mismatch How to combine signals from multiple probes into a single gene abundance estimate? Probe Variation Individual probes don’t agree on fold changes Probes for one gene may vary by two orders of magnitude on each chip CG content is most important factor in signal strength Signal from 16 probes along one gene on one chip Competing Models 2005 GCOS (Affymetrix MicroArray Suite 5.0) dChip Li and Wong, HSPH Bioconductor: affy package (RMA) Manufacturer’s software Bolstad, Irizarry, Speed, et al Variants such as gcRMA, vsn Probe-level analyses affyPLM, logit-t, … Probe Measure Variation •Typical probes are two orders of magnitude different! •CG content is most important factor •RNA target folding also affects hybridization 3x104 0 Principles of MAS 5 method First estimate background •bg = MM (if physically possible) •log(bg) = log(PM)-log(non-specific proportion) (if impossible) •Non-specific proportion = max(SB, e) •SB = Tukeybiweight(log(PM)-log(MM)) •Signal = Tukeybiweight(log(Adjusted PM)) Critique of MAS 5 principle Not clear what an average of different probes should mean Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here No ‘learning’ based on cross-chip performance of individual probes Motivation for multi-chip models: Probe level data from spike-in study ( log scale ) note parallel trend of all probes Courtesy of Terry Speed Linear Models Extension of linear regression Essential features: Measurement errors independent of each other ‘random noise’ Needs normalization to eliminate systematic variation Noise levels comparable at different levels of signal Small number of factors give predicted levels combine in linear function or simple algebraic form Model for Probe Signal Each probe signal is proportional to i) the amount of target sample – a ii) the affinity of the specific probe sequence to the target – f NB: High affinity is not the same as Specificity Probe can give high signal to intended target and also to other transcripts Probes 1 2 3 chip 1 a1 chip 2 a2 f1 f2 f3 Multiplicative Model For each gene, a set of probes p1,…,pk Each probe pj binds the gene with efficiency fj In each sample there is an amount qi. Probe intensity should be proportional to fjxqi Always some noise! Robust Statistics Outlier: a measure that is far beyond the typical random variation Robust methods try to fit the majority of data points common in biological measures 10-15% in Affy probe sets Issue is to identify which points to down-weight or ignore Median is very robust – but inefficient Trimmed means are almost as robust and much more efficient Robust Linear Models Criterion of fit Least median squares Sum of weighted squares Least squares and throw out outliers Method for finding fit High-dimensional search Iteratively re-weighted least squares Median Polish Why Robust Models for GeneChips? 10% - 15% of individual signals in a probe set deviate greatly from pattern Often outliers lie close together Causes: Scratches Proximity to heating elements Uneven fluid flow Li & Wong (dChip) Model: PMij = qifj + eij - Original model (dChip 1.0) used PMij - MMij = qifj + eij by analogy with Affy MAS 4 Outlier removal: Fitting probes in one set on one chip Identify extreme residuals Remove Re-fit Iterate Dark blue: PM values Red: fitted values Light blue: probe SD Critique of Li-Wong model Model assumes that noise for all probes has same magnitude All biological measurements exhibit intensity-dependent noise Bolstad, Irizarry, Speed – (RMA) For each probe set, take the log transform of PMij = qifj: log ( PM ij ) log( ai ) log( f j ) i.e. fit the model: Fit this additive model by iteratively re-weighted least-squares or ij i j ij median polish Where nlog() stands for logarithm after normalization nlog ( PM bg) a b e Critique: assumes probe noise is constant (homoschedastic) on log scale Comparison of Methods Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA 20 replicate arrays – variance should be small Standard deviations of expression estimates on arrays arranged in four groups of genes Courtesy of Terry Speed by increasing mean expression level Steady Improvement Affymetrix improves their model MAS P & A calls reasonable MAS 5.0 estimation does a reasonable job on probe sets that are bright PLIER is a multi-chip model Abundant genes dChip and RMA do better on genes that are less abundant Signalling proteins, transcription factors, etc Expression Comparison 1 – MAS 4 Ratio-Intensity Plot comparing two chips from spike-in experiment White dots represent unchanged genes Red numbers flag spike-in genes Courtesy of Terry Speed Expression Comparison 2 – MAS 5 t-scores changed genes Theoretical t-distribution Expression Comparison 3 – Li-Wong Courtesy of Terry Speed Expression Comparison 4 - RMA Courtesy of Terry Speed Comparison on Real Data These results are based on samples with 14 spike-ins - not realistic complexity Choe et al (Genome Biology 2005) produced a spike in data set with realistic complexity - found MAS5 PM correction worked well Comparisons of biological variation vs technical variation in replicated samples suggest RMA defaults work best Mix and Match Methods in affy Background: rma, mas Normalization: quantile, constant, … PM-correction: none, Model: median polish, mas Estimates <- expresso( cel.data, bgcorrect.method = mas, normalization.method = quantiles, … gcRMA: Estimating Non-specific Hybridization Each probe has its own characteristic cross-hybridizations (NSH) Mismatch is not a good estimate of NSH GC content may predict NSH reasonably well