Download Introduction to the Analysis of Microarray Data

Introduction to the Statistical Analysis of Microarray Data Martina Bremer1,4, Edward Himelblau2, and Andreas Madlung3 1 Purdue University, Department of Statistics, Purdue University, West Lafayette, IN 47907; Polytechnic State University, Biological Science, San Luis Obispo, CA 93407; email: [email protected]; 3 University of Puget Sound, Department of Biology, Tacoma, WA, 98416; email: [email protected]; 4 Current address: San Jose State University, San Jose, CA, 95192. email: [email protected] 2California Introduction: Studying Gene Expression Knowing the transcriptional activity of a gene can give target spots) in the same time that it used to take to valuable insight to the function of the protein it encodes analyze the activity of a single gene. and to the role it plays in an organism. Gene activity in the same individual can vary from tissue to tissue, Such technological advances have revolutionized the between different developmental stages, or even from way molecular bioscience is done and have sped up the morning to night time. Gene activity is influenced by the rate of new discoveries. However, they have also led to activity of other genes and the proteins they encode. the rapid acquisition of huge amounts of data that require Gene expression can change in response to outside the use of biostatistics for analysis and validation of the factors, such as the environment or exposure of the collected data. In practice, gene activity is assessed, by organism to chemical substances, competitors, or labeling mRNA that was extracted from an organism, with pathogens. fluorescent dyes. The labeled mRNA, known as the “probe” is applied to the glass slide and allowed to bind to The classical approach to measuring the activity of a its complementary spot on the array. This process is gene has been to isolate messenger RNA (mRNA), and called hybridization. Subsequently, the unbound mRNA is estimate the amount of mRNA of the gene of interest washed off the slide. The slide is scanned and the present at a given time in the organism. Traditionally, this amount of fluorescently labeled mRNA bound to each has been done for one gene at a time. spot is proportional to the activity of the gene it represents. In most cases, software analysis is then used Whole genome sequencing projects of many species, to determine how much of a signal is due to biologically including humans, have provided information that allows relevant processes and how much is due to technical researchers to distinguish every gene in the organism. “noise”. In this lab activity you will learn how to analyze The development of microarray technology has made it data obtained from a microarray experiment using possible to survey the gene expression activity of statistical tests very similar to those that a commercial thousands of genes at the same time by using short software package would do. pieces of DNA, each uniquely representing one gene, and spotting them to a solid support, such as a Biological Use for Microarrays microscope glass slide. Using extremely small capillaries to apply these pieces of DNA, up to 25,000 genes can be Why would a researcher want to do microarray represented on a single conventional 1.5 cm x 5 cm slide. experiments? In essence, a microarray experiment can Using spots, give useful information for any question that asks researchers can assess the relative amount of mRNA in whether or not two different populations of cells express a sample of all 25,000 represented genes (called the different sets of genes. For example: A researcher wants these microscopic arrays of DNA to find out which genes become active if a plant is 1 subjected to prolonged drought stress (Figure 1). An appropriate experiment would be to have one set of plants growing in optimal conditions and a second set growing in the same conditions, except with limited water. After a few days under these conditions, tissue is harvested from both sets (treatment: no water; control: well-watered) and mRNA is extracted. As described later in more detail, a common method used in microarray analysis is to label mRNA from the treatment group with one color dye and the control mRNA with another color. Equal amounts of mRNA are then used for the hybridization to the array. If a scanner with the capacity to detect two colors is used, relative amounts of mRNA of each gene can be compared between the control group and the treatment group. Genes up-regulated (“turned on”) in response to drought stress will show a stronger signal of one color (treatment) than the other color (control). After statistical analysis of the data obtained for all of the 25,000 genes, a gene list is generated allowing the researcher to know which genes are activated by the treatment. In our example, these are genes that become active in response to drought stress. Figure 1. Comparing gene expression using microarrays. mRNA is extracted from a plant that has undergone an experimental treatment (T, drought stress in this case) and an untreated control (C). The mRNA transcription to generate cDNAs. undergoes reverse A different fluorescent molecule is used to label each of the cDNA pools. The labeled cDNAs are then hybridized to a microarray. The microarray consists of a glass slide on which thousands of distinct DNA sequences have been affixed. Each dot (or “feature”) on the slide represents the sequence of a different plant gene. After the unbound probe is washed away, a special slide scanner excites each feature on the array with a laser and measures the fluorescent signal emitted. The more cDNA is bound to a spot, the greater the signal will be. The magnified computer screen at the lower right shows the possible results for each feature. A red spot (A) represents a gene that is only expressed in the control. A green spot (B) represents a gene that is only expressed in the treated plant. A yellow spot (C) represents a gene that is expressed in both treated and control plants. A dark spot (D) indicates that the corresponding gene is not expressed in either the control or treated plants. 2 What is statistics? Statistics is a collection of procedures and formulas that Carefully conducted experiments can keep the technical allow us to make decisions when faced with uncertainty. variation to a minimum. The biological variation between Where individuals in the experimental groups, however, cannot does the uncertainty come from? Many experiments that address the same problem or question be influenced. can differ in their outcomes when conducted by different people or with different material. Statisticians call this Mean, median, and standard deviation variation. In microarray experiments, the two main Three of the most important statistical concepts are the sources of variation that cause the uncertainty are: mean, median and standard deviation of a set of measurements. While the mean and median are used to Biological variation describe the center of the measurements, the standard Different organisms have different gene expression deviation is used to describe the spread. profiles, or in other words, the activity of their genes varies. The measured expression levels hence vary from individual to individual used in the study. (Figure 2) Technical variation Due to human error, there can be slight variation in microarray manufacturing and hybridization of the mRNA Mean: the average of the n measurements or… x  1n i 1 xi to the slide. Even if an experiment calls for applying the n same amount of mRNA from the same organism to two identical slides, the measurements may be different. (Figure 2) Median: The middle observation. (If the number of observations n is odd the middle number is used. If n is even the average of the two middle observations is used.) Standard deviation: measures the average distance of the n observations x1,…,xn from the mean ( x ) or… Figure 2. Sources of variation in gene expression studies. s 1 n 1  x  x n i 1 2 i  3 Overview over Microarray Experiment Figure 3. Spot intensity. After labeled cDNA is hybridized to a microarray each fluorescent label is individually excited and It is the goal of many microarray experiments to compare the gene expression levels of a treatment group with those of a control group. For this purpose, mRNA is detected by the slide scanner. The scanner divides the spots into pixels. The spots are not uniform and there can be variable intensity from pixel to pixel within each spot. Also, there are areas of low intensity, “background” fluorescence in the regions extracted from the cells of several individuals in each between spots. After scanning, the red and green signals are group. The samples are labeled with red and green superimposed by the computer that generates a composite fluorescent dyes and allowed to bind to the DNA on the yellow spot (notice the distinct red and green pixels in the array (target spots). background of the superimposed image.) The scanner divides the features on the array into pixels Microarray Data Analysis and for each pixel a computer records the scanned red What do the many numbers in a microarray experiment and green intensity. Usually, the spots are not uniform. output The intensity in both the red and the green channels may experiment is conducted, the result is usually a large data vary over these pixels. The spots with very low intensity file with many columns and as many rows as there were are called the background (Figure 3). spots on the array. It may look something like figure 4. file actually mean? When a microarray Usually, the output file contains much more information than shown here, but we will only use the pictured columns for our analysis. The first three columns (labeled ``Block’’, ``Column’’, and ``Row’’) tell us the position of the scanned spot on the array. The column labeled ``Name’’ contains the name of the gene that was spotted there. The column labeled ``ID’’ contains information pertaining to the exact part of the gene that was used to originally produce the target spot on the glass slide. The red and green intensities recorded by the scanner are reported in the next four columns. ``F’’ stands for foreground and ``B’’ for background. The numbers 635 and 532 represent the wavelengths of the red and green laser light, respectively. For example, the numbers in the F635 Median column are the median values of the scanned foreground pixels when excited with the red laser light. 4 In addition to measuring the spot itself, the scanner also represents the red intensity for gene i, then two quantities measures so-called background intensity. This is in commonly used are: essence probe that bound to the silicate of the glass slide and falsely increases the signal of each spot by the same Figure 4. Example of the output file of a microarray experiment. After scanning the slide, the intensity of both the red and green values are recorded separately (even if the spot looks yellow to the eye it is comprised of green and red labeled probe). The data are stored in a so-called .gpr file that can be copied directly into a spreadsheet, such as in this case a Microsoft Excel file.  Ri Gi (equation 1) Ai  12 (log 2 Ri  log Gi ) (equation 2) M i  log 2 amount. This background intensity needs to be subtracted to get an accurate reading of the spot intensity based on probe-target interaction only. The columns that  (If you need to re-acquaint yourself with logarithms, now labeled ``F635 Median – B635’’ and ``F532 Median –  would be a good time to do so. If you need a calculator to B532’’. They contain the background corrected median convince yourself of the values discussed below you can red and green intensities for each spot on the array. use will be most helpful in our analysis are therefore the ones one online at: http://www.rechneronline.de/ logarithmus/. Notice that the base used here is 2, not 10.) Normalization The results of a microarray experiment are obviously The quantity Mi (the log ratio) describes the relationship influenced by technical variation. This means that the between the two groups. If the intensity in the red and measured intensities vary on the array(s) in a systematic green-labeled group is the same, then Mi will be zero. If manner. the red intensity is twice as big as the green then Mi will Often more than one slide is used to conduct the be equal to 1. If, on the other hand, the green intensity is experiment. If different amounts of probe are applied to twice as big as the red, then Mi will be equal to –1. The the slides, the intensities on some slides may be quantity Ai describes the overall intensity observed for consistently higher than the intensities on other slides, gene i. The quantities Mi and Ai are very useful in the even though they measure the same genes. normalization of microarray data. Normalization means to mathematically manipulate the Before data to make it uniform in a variety of ways. There are experiment, researchers often take a look at what is several different ways this normalization can be done. called the “MA-plot”. For every feature i on the array, the conducting the analysis for a microarray values Mi and Ai are computed and are plotted in an xyIn many microarray experiments, the results are reported plot (Figure 5). as log-ratios of the two intensity measurements. If Gi represents the green intensity for gene i and Ri 5 It is known that the green and red dyes used to label the samples interact differently with certain genes. The dyes have different light stability and may vary in efficiency. That means that the dye molecules are more likely to bind to certain genes than others. On average, we would like the intensities for both dyes to be about the same. That means, that on average, the log-ratios Mi should be about zero. Normalization means Figure 5. Using an MA plot to visualize normalization on one that one computes the average M-value for all genes array. Highly up- or down-regulated genes are above or below spotted on the array and then makes sure that the the x-axis.. The diagonal lines to the left are called “fishtails”. average will be zero. Fishtails are an artifact of changing the value for spots that have higher background than foreground. 6 #1: Do this…to normalize microarray data. Find a computer and open an Excel spread sheet. Type in the values as you see them below. Note that in the example below, the six features shown all represent the same gene. The first step is to compute the log-ratios of red (635) to green (532) signal. In Excel, you can use the command “LOG(x,2)” to compute log 2 ( x) or you can calculate them with a calculator or the website given earlier. Write the results into a new column and label it “M”: Write the M-values on this sheet in column G. The goal of normalization is to assure that the average of the M-values is set to zero. Next compute the average of all six M values. The average is:_______________________Now subtract this average from the M-values for all genes in the example above to “correct” them. Two values have already been entered so you can see if you are on the right track. Fill in the rest in column G below. The point of normalization is to center the average on the zero value. Here you have calculated the average M value and then subtracted it from all M values, making it – on average – zero. This procedure in this example shifts all data points up a bit as you can see on the MA plot before and after normalization. The biological reason to normalize in this case was that one dye because of its chemical stability, not because of the expression of the genes it labels, always gives a higher value than the other dye, introducing an error equally great for all data points. 7 Normalization by Dye-Swap Design A better way to deal with uneven binding of the dyes to certain genes is to carry out experiments as dye-swaps (Figure 6). In them, the probe from each group (treatment and control) is split into two portions and labeled with different dyes. The labeled samples are then hybridized crosswise (red “treatment” with green “control” and green “treatment” with red “control”) onto two arrays. To normalize the data, obtain the M-value for each feature on each slide (as demonstrated in the “Do This…” section on page 7). For consistency we will compute Mvalues for the two arrays once as red/green log ratio and for the other slide (in which the dyes are swapped) as the green/red log-ratio. Then average the M-values from the Figure 6: Experimental design for a dye-swap experiment. In two arrays for each feature: this design, the treatment and control are first labeled with one set of dyes and hybridized to array 1. To account for different Mi  12 (Mi(1)  Mi(2) ) Here, M M i(2) is    (1) i labeling efficiencies of the two dyes, the same probes are now is the log-ratio for feature i on array 1, and labeled with the other dye (the dyes are swapped) and subsequently hybridized to array 2. the log-ratio for the same feature (with dyes swapped) on array 2. #2: Do this… to normalize data from a dye-swap experiment. Suppose you have data from a dye swap experiment. There are two (very small) arrays. Each array contains six spots for the same gene. Suppose for each spot, you have already computed the log-ratio M as in the previous example: To conduct dye-swap normalization, average the M-values for each spot on the two arrays. For example, after normalization, the corrected M-value for the spot in column 1/row 3 is (-0.16-0.42)/2 = -0.29. This way, an M-value can be obtained for every spot on the array. In the table below, fill in the M-values are for each spot: Corrected M-values -0.29 -0.94 -0.26 0.42 -2.47 -1.34 8 Drawing Conclusions differentially expressed in the treatment and control group, we also have to look at the variation of these It is often the goal of microarray experiments to identify measurements. Only if the distance from zero of the genes that become either more or less expressed in average of our measurements is large compared to response to a treatment that compares two different the variation, can we assume that the gene has states (e.g., drought-stressed and well-watered plants). different expression in the two experimental groups. To carry out the experiment, the gene expression levels of plants that were subjected to drought stress (treatment group) are compared with those of plants that were watered well (control group). For every gene on the microarray we want to decide whether the expression levels in the two experimental groups are (significantly) different from each other or not. How can we make sense of the many thousands of measurements collected simultaneously in a microarray experiment? Figure 7. Is the average M equal to zero? Multiple repeats for the same gene will give different results. Statistical tests provide an answer whether or not the mean of the repeats is significantly different from zero, or, in other words, if the treatment resulted in First, we will analyze each gene separately. Our goal is differences in gene expression from the control. On the left, to decide whether the expression level of the gene is different means for observations with the same variation pattern different in the treatment and the control group. If the are shown. The variation in the measurements is important in green and red intensities are different, that means that their quotient R/G will not be equal to one. If the quotient is not equal to one, then the M-value for the spot, which deciding whether the mean of a number of observations is significantly different from zero. If the variation is small, we may be more inclined to assume a non-zero mean than if the variation is large. On the right, a greater sample size may or is the logarithm of the quotient (see equation 1), will not may not (in this case not) result in greater confidence that the be equal to zero. It will be positive, if the red intensity is mean is different from zero. greater than the green and negative if the green intensity is greater than the red (See Figure 5). Statistical Decision Making The researcher is trying to detect a difference between Hypothesis tests are an important tool for statistical the treatment and control group. Very few spots on the decision-making. They are used to answer a ``Yes/No’’ array will have equal red/green intensities (M-values question about a population. But instead of being able to equal to zero). The challenge in analysis is to determine observe the whole population, we only get to see a small whether the differences are due to biological or technical sample. variation, or whether they reflect true differences in gene expression between the samples. This is achieved by For example, in a criminal trial, the defendant is analysis of M-values. But how far away from zero do considered innocent until proven guilty. During the trial these M-values have to be so that we are convinced that both sides (prosecution and defense) present evidence the result is due to a real difference in the experimental and at the end of the trial the jury members have to groups and not just due to variation? (Figure 7) decide whether the evidence is enough to convict the defendant or not. To decide whether the M-values for a gene are far enough away from zero so that we would call the gene 9 Suppose we find a bloody knife in the defendant’s closet. A statistician would now ask ``how likely is it for something like this to happen to an innocent person?’’ If Probability to observe extreme   0.05 Reject the null hypothesis p  data if the null hypothesis is true  0.05 Do not reject the null hypothesis the answer were ``Not very likely’’ then the statistician would conclude that the defendant is probably guilty. The probability p is called the “p-value” of the test. The smaller the p-value is, the less likely it would be to obtain Hypothesis Testing: the data you have, if the null hypothesis were true. A statistical hypothesis test is similar. A null hypothesis is a statement about a parameter. Like the innocence assumption in the criminal trial it is usually of the form ``there is nothing unusual happening here’’. To figure out the probability of observing extreme (or unusual, atypical) data, we have to have a quantity whose statistical distribution (behavior) we know and whose value we can compute from the sample data. The alternative hypothesis is the opposite of the null Such a quantity is called a “test statistic”. hypothesis – this is the statement that the scientist really suspects to be true. Data is collected that will be used as evidence. In a microarray experiment, we want to identify genes whose expression values are different in the treatment and control group. We will conduct a hypothesis test for The scientist now takes on the role of prosecutor. If it is unlikely to observe what we see if the null hypothesis were true, we can conclude that the data does not conform to this theory and we then reject the null hypothesis. (This does not mean that we have proved each gene that is spotted on the array. You can think of the criminal trial as an analogy to what we are going to do. Fill in the table on the next page with the steps a biologist would have to do, to conduct the hypothesis test. that the alternative hypothesis is true.) If it is unlikely to see something as extreme (or more extreme) as our data from variation if the null hypothesis were true, we can reject the null hypothesis. If, on the other hand, outcomes like the one we observed happen all the time if the null hypothesis were true, then we cannot reject the null hypothesis. Suppose a study finds that 15% of all innocent people keep bloody knifes in their closets. Would you declare our defendant guilty in this case? What would you do, if you knew that only 0.00001% of all innocent people kept bloody knifes in their closets? What should the probability of your observations be, so that you would be willing to accept or reject a hypothesis? Most often, the answer to this question depends on the problem. A popular value that is used in many fields is 0.05: 10 Criminal Trial Gene Expression Experiment Null hypothesis Assumption of innocence No difference between gene expression in control and treatment plants Alternative hypothesis Assumption that the defendant is guilty Gene expression differs between the control and treated plants. Data Evidence (such as a bloody knife in the defendant’s closet) p-value Probability of finding incriminating evidence on an innocent person Rejecting the null hypothesis Jury finds the defendant guilty Declare the gene differentially expressed Accepting the null hypothesis Jury finds the defendant not guilty Declare that the gene is not differentially expressed. Hypothesis Test for Log-Ratios Gene expression measured by red/green fluorescence levels Probability that different expression levels result from only biological or technical variation and not form random chance. much higher than the green (meaning that the expression of the genes whose mRNA was labeled with red dye is In microarray experiments, especially if the data has higher than the expression of the genes, whose mRNA been normalized, it will be in the form of log-ratios of red was labeled with green dye). Large negative values of and green intensities (M-values). The file will have one the test statistic mean that the log-ratio is negative, which M-value for every spot on the array. Each gene will be means that the green intensity is much higher than the spotted several times on the array, so that for each red. gene we have several M-values. To decide, whether a gene is expressed differently in the two groups (treatment and control), we will decide whether the M-values for that gene are close to zero (on average) or not. To make this decision, we will also have to take the variance of the observations into account. A t-test will allow you to do this. Suppose that you repeatedly measure a characteristic, which has mean zero. If you have n measurements with average x and standard deviation s, then the quantity… t x s2 n  is a number that characterizes the distribution (behavior) of the test statistic. The ``normal'' (or typical) values are Figure 8. t-distribution for different degrees of freedom. The degrees of freedom depend on the experimental set up, certain assumptions, and the number of observations. always those close to zero. The unusual values are the ones in the tails of the distribution, either large positive or large negative numbers. Large positive values of the test statistic mean that the log-ratio is positive, which means that the red intensity is 11 How large is large? of the distribution in the graph of the t-distribution above. How large (or small) will a test statistic value need to be If that occurs, one can safely argue that the two values so that we can call it unusual? Most researchers work (for red and green labeled RNA) differ from each other in with a significance level of 5% (or 0.05). They call an a “statistically significant” manner between the treatment observation unusual, if its p-value is smaller than 5%. and the control group. That means that the test statistic value falls into the outer 5% tail area #3: Do this…conduct a t-test for a microarray experiment. We need to make a decision for each gene (represented by several spots on the array). Is the gene expressed differently in the treatment and control group? Therefore, for each gene, we will carry out the hypothesis test separately. Before the Test: Pick a gene. Find all the red and green intensity values for this gene in your data file. Compute all the Mvalues for this gene. Step 1: We need to set up the null hypothesis and alternative hypothesis. Remember that the null hypothesis means that the treatment had no effect (on this gene) and the alternative hypothesis is what the researcher is really expecting to support. A good experiment will always provide valuable information, regardless of the outcome of the hypothesis test. In your answer sheet write down the two hypotheses for your experiment: Null Hypothesis: Alternative Hypothesis: Step 2: In this example we have already collected the data. We will use the microarray measurements as estimates for the gene expression levels in the two groups. Step 3: We want to find a p-value for the gene. That means that we have to compute a test-statistic value and then decide how usual or unusual it is. Example: Suppose the collected (and normalized) microarray data of the gene At1g01000 looks like this: (note…there are six values because this gene is spotted onto the array in six locations.) 12 Continued on the next page… Do this…conduct a t-test for a microarray experiment (continued). We have six M-values for the gene At1g01000 in this table. First, compute the average of the six observations: x = _________ (fill in the value here) and the standard deviation s =_________________ (fill in value here)(see formula on p 3 or calculate in Excel or with your calculator). Since we have six observations, n= 6. Now we can compute the value of the test statistic as: t  x 2 s n  1.15 1.282 6  2.08 Calculate t and fill in value here:___________ The degree of freedom that describes the behavior of the test statistic in this case is df  6 1 5. To find the p-value, we have to find the percentage of cases, in which the t-test statistic with df = 5 would take on  more extreme values than the t-value we observed. Extreme values are the ones far away from zero: In the past, these values had to be looked up in tables. Today, Excel and other software programs have them stored in their statistics package. The p-value can be found with the Excel command “=TDIST(2.08, 5,2)”. In the example above, the exact p-value (red shaded tail area of the distribution) is 0.0921 or 9.21%. To find a p-value using Excel, open an Excel spreadsheet, click on any empty cell and enter “=TDIST(absolute value of your test statistic, df, 2)”. The “2” stands for two-sided, which means that you want the red area in both tail ends. In this example the absolute value (no minus sign) of the test statistic is 2.08 and the degree of freedom is df  6 1 5. Now calculate the p-value using the t-value you calculated above. Step 4: What conclusion can we draw? The p-value is the probability to observe data as extreme/unusual as the one we saw if the gene expression in the two groups were the same. Our p-value 9.21% is quite large (bigger than 5%). That means that we would get observations such as these by random chance and not due to real difference in gene expression almost 10% of the time. Hence, our data is nothing unusual and we accept the null hypothesis (equal expression in both groups) for gene At1g01000. Results of Experiment: To determine the p-values for the other genes spotted on the microarray you would repeat steps 1 - 4 above. This would provide us with a p-value for each gene on the array. Gene name p-value At1g01000 0.0921 Differentially expressed at level 5%? 13 H QUESTIONS MICROARRAYS 5 points Do this #1: The average is:_______________________ Do this #2: The corrected M values are: Corrected M-values -0.29 Do this #3: In your answer sheet write down the two hypotheses for your experiment: Null Hypothesis: Alternative Hypothesis: 14 First, compute the average of the six observations: a) x = _________ (fill in the value here) b) s =_________________ (fill in value here). Since we have six observations, n= 6. Now we can compute the value of the test statistic  as: t x s2 n  1.15 1.282 6  2.08 c) Calculate t (fill in value here)___________ d) Now calculate the p-value using the t-value you calculated above. _________________ e) Given your calculated p-value, is the gene differentially expressed between treatment and control? 15

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to the Analysis of Microarray Data