Download ppt

Identifying expression differences in cDNA microarray experiments Lecture 20, Statistics 246, April 6, 2004 1 Introduction • • • • • Many microarray experiments are carried out to find genes which are differentially expressed between two (or more) samples of cells. Examples abound: cells (from the liver, say), in a mouse with a gene knocked out, compared with liver cells in a normal mouse of the same strain cells in one region of the brain (say cerebellum), compared with cells in a different region (say the anterior cingulate region) tumor cells in some organ (say the liver), compared with normal cells from the same organ cells from an organism (say yeast) after a treatment (say by heat, or cold, or a drug) compared with cells of the same kind in the untreated state cells from some part of a developing organ or organism at one time, compare with cells of the same kind at a later time, and so on It should be clear that the number of such comparisons is limited only by the imagination of the biologist, at least at the moment, when details of so many genetic programs (drug response, development, tumorigenesis, ..) are incomplete. 2 Preliminary remarks Initially, comparative microarray experiments were done with few, if any replicates, and statistical criteria were not used for identifying differentially expressed genes. Instead, simple criteria were used such as fold-change, with 2-fold being a popular cut-off. This was sometimes done without regard to the variability present in the experiment, and, depending on the experiment, could be too liberal, naming genes that were not differentially expressed (false positives, or errors of the first kind), or too conservative, failing to identify genes that were differentially expressed (false negatives, errors of the second kind). The relative importance of false positives and false negatives, depends on the context: the aims of the experiment (e.g. were the investigators seeking broad patterns, or specific genes), and the follow-up experiments envisaged (e.g. validation of findings by a more precise technique). It did not take long for people to want to assign statistical significance to their findings concerning differentially expressed genes. Could p-values be attached, confidence statements be made, and so on? These questions raised a number of issues which were unfamiliar to the molecular biologists doing the experiments: replication, systematic versus random differences, multiplicity of tests, etc. 3 Preliminary remarks, cont. It was eventually realized on that biological replicates are important for reaching statistically sound conclusions, but even here the story is not simple, as many systematic features remain even in experiments with “independent” replicates. The fact that many thousands of comparisons were being carried out made use of traditional cut-offs (+/- 2SD, or p < 0.05) inappropriate also became clear to people. Strict control of type 1 error rates (we’ll be more precise in the next lecture) turned out to be asking too much in this microarray context, and different criteria for “controlling” errors rates have come to the fore, most notably false discovery rates (FDR). However, there still remains a need for appropriate theory. Modelling large-scale microarray experiments is not a simple task: the difference between assuming and actually having independence can be great. Where are we now? Depending on the context, some researchers can make use of traditional multiple testing procedures. Others make use of FDR notions, which are more widely applicable, but lead to weaker conclusions. Yet others have realized that is is unreasonable (given their sample sizes) to expect “statistical significance” for all their results, and instead seek 4 evidence of “biological significance”, and validation by suitable follow-up. Preliminary remarks, concluded Where are we now?, cont. Two issues have come to the fore in recent years. The first is the interpretation of the lists of genes determined (by whatever means) to be differentially expressed: What kinds genes are they, i.e. what is their function (DNA binding, protease,..) ? What cellular pathways or processes are involved (replication, cell death..)? Where in the cell are these genes operating (in ribosomes, mitochondria,..)? The question becomes: are genes of a given class over-represented in the list of differentially expressed genes? As there are many classes, there are many such questions, and so the issue of multiplicity comes up once more. The second is the determination of significance for sets of genes given a priori, attempting to answer some form of the question: is this (specific) set of genes differentially expressed? The idea here is that a particular set of genes (say those involved in oxidative phosphorylation) might have all or many changed a little, and that this pattern of change might be “significant”, although the individual changes are not. Of course we won’t begin with just one set of genes, but many, so multiplicity questions arise here too. In both of these questions, how we rank the genes will be important, and 5 the cut-offs less so. Some motherhood statements Important aspects of a statistical analysis include: • Tentatively separating systematic from random sources of variation • Removing the systematic and quantifying the random, when the system is in control • Identifying and dealing with the most relevant source of variation in subsequent analyses Only if this is done can we hope to make more or less valid probability statements. 6 The simplest cDNA microarray data analysis problem is identifying differentially expressed genes using one slide • • • • This is a common enough hope Efforts are frequently successful It is not hard to do by eye The problem is probably beyond formal statistical inference (valid pvalues, etc) for the foreseeable future….why? In the next two panels, genes found to be up- or down-regulated in an 8 treatment (Srb1 over-expression) versus 8 control comparison are indicated in red and green, respectively, on plots of the data from single hybridizations. Also depicted are “confidence lines” determined by different people, which claim to be able to delineate differentially expressed genes using just one hybridization (slide), the different lines corresponding to different methods and/or different “confidence” levels. What do we see? 7 Matt Callow’s Srb1 dataset (#5). Newton’s and Chen’s single slide method 8 Matt Callow’s Srb1 dataset (#8). 9 Newton’s, Sapir & Churchill’s and Chen’s single slide method The second simplest cDNA microarray data analysis problem is identifying differentially expressed genes using replicated hybridizations There are a number of different aspects: • First, between-slide normalization; then • What should we look at: averages, SDs, t-statistics, other summaries? • How should we look at them? • Can we make valid probability statements? 10 Apo AI experiment (Matt Callow, LBNL) Goal. To identify genes with altered expression in the livers of Apo AI knock-out mice (T) compared to inbred C57Bl/6 control mice (C). • 8 treatment mice and 8 control mice • 16 hybridizations: liver mRNA from each of the 16 mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3. • Probes: ~ 6,000 cDNAs (genes), including 200 related to lipid metabolism. 11 12 Which genes have changed? When permutation testing possible 1. For each gene and each hybridization (8 ko + 8 ctl), use M=log2(R/G). 2. For each gene form the t-statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2) 3. Form a histogram of 6,000 t-values. 4. Do a normal qq-plot; look for values “off the line”. 5. Permutation testing (next lecture). 6. Adjust for multiple testing (next lecture). 13 Histogram & normal qq-plot of t-statistics ApoA1 14 What is a normal qq-plot? We have a random sample, say ti, i=1, …,n, which we believe might come from a normal distribution. If it did, then for suitable  and , ((ti-)/), i=1,…n would be uniformly distributed on [0,1](why?), where  is the standard normal c.d.f.. Denoting the order statistics of the t-sample by t(1) ,t(2) ,….,t(n) we can then see that ((t(i) -)/) should be approximately i/n (why?). With this in mind, we’d expect t(i) to be about -1(i/n) +  (why?). Thus if we plot t(i) against -1((i+1/2/(n+1)), we might expect to see a straight line of slope about  with intercept about . (The 1/2 and 1 in numerator and denominator of the i/n are to avoid problems at the extremes.) This is our normal quantile-quantile plot, the i/n being a quantile of the uniform, and the -1 being that of the normal. 15 Why do a normal q-q plot? One of the things we want to do with our t-statistics is roughly speaking, to identify the extreme ones. It is natural to rank them, but how extreme is extreme? Since the sample sizes here are not too small ( two samples of 8 each gives 16 terms in the difference of the means), approximate normality is not an unreasonable expectation for the null marginal distribution. Converting ranked t’s into a normal qq-plot is a great way to see the extremes: they are the ones that are “off the line”, at one end or another. This technique is particularly helpful when we have thousands of values. Of course we can’t expect all differentially expressed genes to stand out as extremes: many will be masked by more extreme random variation, which is a big problem in this context. See next lecture for a discussion 16 of these issues. gene t index statistic 2139 -22 Apo AI 4117 -13 EST, weakly sim. to STEROL DESATURASE 5330 -12 CATECHOL O-METHYLTRANSFERASE 1731 -11 Apo CIII 538 -11 EST, highly sim. to Apo AI 1489 -9.1 EST 2526 -8.3 Highly sim. to Apo CIII precursor 4916 -7.7 similar to yeast sterol desaturase 941 -4.7 2000 +3.1 5867 -4.2 4608 +4.8 948 -4.7 5577 -4.5 Gene annotation 17 Useful plots of t-statistics 18 Which genes have changed? Permutation testing not possible Our current approach is to use averages, SDs, t-statistics and a new statistic we call B, inspired by empirical Bayes. It is equivalent to a moderated t-statistic. We hope in due course to calibrate B and use that as our main tool. We begin with the motivation, using data from a study in which each slide was replicated four times. 19 Results from 4 replicates Points to note One set (green) has a high average M but also a high variance and a low t. Another (pale blue) has an average M near zero but a very small variance, leading to a large negative t. A third (dark blue) has a modest average M and a low variance, leading to a high positive t. A fourth (purple) has a moderate average M and a moderate variance, leading to a small t. Another pair (yellow, red) have moderate average Ms and middling variances, and moderately large ts. Which do we regard most favourably? Let’s look at M and t jointly. 21 •M •t •t M Sets defined by cut-offs: from the Apo AI ko experiment 22 •M •t •t M Results from the Apo AI ko experiment 23 •M •t •t M Apo AI experiment: t vs average A. 24 What have we concluded? Using average M alone, we ignore useful information in the SD across replicates. Some large values are large because of outliers. Using t alone, we are vulnerable to being misled by small SDs. With 1000s of genes, some SDs will be very small. Some will be very large too. Formal testing can sort out these issues for us, but if we simply want to rank, what should we rank on? Neither average M nor t seems appropriate. One approach (SAM) is to inflate the SDs slightly. Another approach is based on the following empirical Bayes story. It can go further. There are a number of variants. 25 An empirical Bayes story Suppose that our M values are independently and normally distributed, and that a proportion p of genes are differentially expressed, i.e. have M’s with non-zero means. Further, suppose that the variances and means of these are chosen jointly from inverse chi-square and normal conjugate priors, respectively. Genes not differentially expressed have zero means, and variances chosen from the same inverse chi-squared distribution. The scale and d.f. parameters in the inverse chi-square are estimated from the data, as is a parameter c connecting the prior for the mean with that for the variances. We then calculate for the posterior probability that a given gene is differentially expressed, and find it is an increasing function of B over the page, where a and c are estimated parameters, and p is in the constant. 26 Empirical Bayes log posterior odds ratio (LOR)  2a 2 2   s  M  n  B  const  log 2 2a  s2  M   n 1  nc  Notice that for large n this approximately t=M./s . 27 B=LOR compared with t and M. 28 Extensions include dealing with • • • • Replicates within and between slides Several effects: use a linear model ANOVA: are the effects equal? Time series: selecting genes for trends 29 Summary (for the second simplest problem) • Microarray experiments typically have thousands of genes, but only few (1-10) replicates for each gene. • Averages can be driven by outliers. • t-statistics can be driven by tiny variances. • B = LOR will, we hope – use information from all the genes – combine the best of M. and t – avoid the problems of M. and t Ranking on B could be helpful. 30 Acknowledgments Yee Hwa Yang Sandrine Dudoit Ingrid Lönnstedt Natalie Thorne David Freedman Ngai lab, UCB Matt Callow (LBNL) 31

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt