Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Analysis of Arrays Written by BIO 480 Student Katie Criswell – Modified by T. Rife Before Class: Think About: What should your microarray data look like? How many yellow spots versus red or green? What does this mean your average ratio should be? Graph It: Use Excel and Graph your Red versus Green Data. What does it look like? Is it what you expected? Be prepared to share this graph in class! Consider: What kinds of experimental mistakes might throw your data off? Normalization is the technique that corrects for variation in the microarray results that have been altered by microarray technology rather than actual biological differences. There are many different sources of error that may have caused these variations. Variability could be caused due to manufacturing process of the probe DNA, the amount of DNA spotted on the slide, or the ability for cDNAs to bind to the array. Dye bias could also arise from the physical properties of the dye due to decay or ability to hybridize with cDNAs. Hybridization of the dyes to the cDNAs could be affected by humidity, dust, salts or other molecules. Different scanning settings could cause imbalances between the red and green dyes. For example, higher scanning intensities improve the quality of the signal but increase the risk of saturation. All of these factors can affect the intensities of expression, therefore normalization must be considered when analyzing the data. By minimizing the effects of variance caused by microarray technology of each microarray, the data can be analyzed with more confidence and the data can be better utilized to calculate actual biological differences. Mean and Median Normalization are used to normalize within microrrays to help account for these technical problems. Mean or Median Normalization must be done before you normalize between microarrays. We suggest that you do both mean normalization. Then use another normalization technique to normalize between microarrays. After median normalization, a t-test should be done that compares the expression levels of a gene between the different microarrays. Before any normalization is done, some procedures must be done to each data set. Normalizing will be done using the expression ratios for each gene. First, each gene’s red and green background was first subtracted from their intensity. Save your data sheet. Now create a new data sheet where you get rid of values that are 0 or contain a negative number after subtracting the background because one can’t divide by 0 to make an effective ratio. If there are still positive values for green and red intensity values for a gene after the subtraction, divide the mutant color by the wild type color to get an expression ratio. Transform this ratio into a log base 2 scale. Transforming into log base 2, reduces the scale of the difference for the data sets by transforming the data set to a 0- 1 16 numeric scale. It allows us to visualize better differences in ration of 0.1 vs. 0.5. This makes the data more suitable and easier to work with. Try to work on some of this before class with your group and make sure you can work the following problems: What is the log to the base 2 of: a. b. c. d. e. 1 2 4 0.5 or ½ 0.25 or ¼ In Class- We will Spend Some Time Normalizing Our Data But read through this technique before class: Mean Normalization for Expression Ratio Values From each microarray, 2 data sets (top half and bottom half of microarray) are available to be used for analysis. For this experiment, 2 microarrays were used giving us 4 different data sets. Each data set was compiled into a matrix. An example of a matrix is in the below figure. “X” denotes a gene identity “N” equals the number of data sets used. “P” equals the number of genes in each microarray. The first column (X11 … Xpn ) are the expression ratio for each gene from a data set that you choose as Data Set #1. The second column (X21 … Xpn ) are the expression ratio for each gene from a data set that you choose as Data Set #2. A row (Xp1… Xpn) are all the values for the specific gene assigned to that row from all the data sets in the collection. 2 The log base 2 values of all the expression ratios in each data set are used to find the mean values of each microarray data set. The mean expression ratio is calculated for each data set (column). For example, when calculating expression ratio means: M1 … M n M1 equals mean expression ratio for data set #1 (Column #1) Mn equals mean expression ratio for last data set (Column #n) Mean expression ratios should be calculated for each data set. After a mean is calculated for each data set, that data set’s mean expression ratio value is subtracted from each of that data set’s gene expression ratio value. For example, when subtracting mean expression ratio from an individual gene’s expression ratio: X11 – M1… X1n - Mn : : : : Xp1 – M1 … Xpn – Mn All new expression ratio values after mean normalization for one data set should be graphed for frequency in a histogram. You will have “n” number of histograms for “n” number of data sets. A boxplot of values from each microarray data set can be compared after mean normalization also. To do a histogram in SPSS (Burruss Lab), copy a data set’s value into a column in the SPSS spread sheet. Under “Graphs”, choose “Interactive” and then “Histogram”. “Count will be your independent variable and your data set’s values will be the dependent variable. Figure 1. Histogram before Log Base 2 transformation. Figure 1 represents the expression ratio of genes before they were transformed into log base 2 value. The graph is right skewed. This is because most genes do not have a high expression level in this microarray. 3 30 Frequency 25 20 15 10 5 15.5 13.4 11.3 9.2 5 7.1 2.9 0.8 -1.3 -3.4 -5.5 -7.6 -9.7 -11.8 -16 -13.9 0 Green Intensity Values of Data Set #1 Without Norm alization Figure 2. Histogram for Data Set #1 after Log Base 2 transformation. Figure 2 represents the log base 2 values before any normalization is done. Transforming the data into log base 2 corrects the values into a more normal curve. 30 Frequency 25 20 15 10 5 15.5 13.4 11.3 9.2 7.1 5 2.9 0.8 -1.3 -3.4 -5.5 -7.6 -9.7 -11.8 -13.9 -16 0 Green Intensity Values of Data Set #1 After Mean Norm alization Figure 3. Histogram of Data Set #1 after Mean Normalization After mean normalization, Figure 3 shows that the graph has shifted so that most of the data lies around zero. 4 Boxplots can be done with each data set to see the overall change after median normalization for each of the data sets. Boxplots must be done in SPSS (Burruss Lab). Copy each data set’s values into their separate columns. Under “Graphs”, choose choose “Boxplots”. Next, choose “Simple” and “Summaries of Separate Variables”. Under “Boxes Represent”, drag all your data sets into the box. This should give a graph comparing boxplots of each of your data sets. 325 327 325 365 14 294 12 328 99 312 128 328 449 99 43 138 10 8 6 448 438 317 178 4 223 2 371 421 399 341 DataSet1 DataSet2 317 333 421 458 432 399 385 DataSet3 DataSet4 0 Figure 4. Boxplot of values of each of the Four Data Sets used in the matrix. Before normalization is performed, outliers can be seen in Figure 4 and the data ranges between the data sets are not similar. 1 325 327 325 312 6 294 365 43 294 70 465 121 121 3 0 -3 451 458 333 223 384 445 371 223 341 -6 464 432 421 421 385 399 DataSet1 DataSet2 DataSet3 DataSet4 Figure 5. Boxplot of each of the 4 Data Sets used in the matrix after mean normalization. After mean normalization, the data ranges between the data sets are more similar and also better comparable with each other. 5 Median Normalization for Expression Ratio Values Median Normalization must be used after using either Mean Normalization or Standard Deviation Normalization Data sets for each microrray’s normalized data must be compiled into a matrix. From each microarray, 2 data sets (top half and bottom half) are available to be used for analysis. For this experiment, 2 microarrays were used giving us 4 different data sets. Each data set was compiled into a matrix. An example of a matrix is in the below figure. “X” denotes a gene “N” equals the number of data sets used. “P” equals the number of genes in each microarray. In other words, the first column (X11 … Xpn ) are all the normalized expression ratio values in data set #1 or the last row (Xp1… Xpn ) are all the normalized expression ratio values for that gene for all the data sets. The normalized expression ratio values are used to find the median values of each microarray data set. For example, when calculating the expression ratio medians: M1 equals the red intensity median for genes X11 … X1n or M1 equals the median for all of this specific gene’s expression ratio values from the compiled matrix. Mp equals the red intensity median for genes Xp1 … Xpn or Mp equals the median for all of this specific gene’s expression ratio values from the compiled matrix. You will have “P” number of medians, one for “P” number of genes. A median for all the expression ratio medians was calculated. This median is Mm Mm equals the median for all combined red medians M1 … Mp Each gene’s expression ratio was then multiplied by a ratio. For example, when calculating the new expression ratio value for each gene in data set #1, the expression ratio value was multiplied by a ratio and that ratio is: 6 Ratio = (Mm / A1) Mm equals the median for expression ratio medians A1 equals the median for expression ratio values for genes X1,1 … Xp,1 (A1is different than M1. A1 is the median for all the expression ratio values for the Data Set #1. In this experiment, we had A1-A4 because we had four different data sets.) *this same ratio is used for each gene for data set #1* Next, when calculating the new value of expression ratios for the genes in data set #2, each gene’s expression ratio value was multiplied by a ratio and that ratio is Ratio = (Mm / A2) Mm equals the median for all expression ratio medians A2 equals the median for expression ratio values for genes X1,2 … Xp,2 *this same ratio is used for each gene for data set #2* Do this again for data set #3 and so on. All new expression ratio values after median normalization for one data set should be graphed for frequency in a histogram. You will have “n” number of histograms for “n” number of data sets. To do a histogram in SPSS (Burrus Lab), copy a data set’s value into a column in the SPSS spread sheet. Under “Graphs”, choose “Interactive” and then “Histogram”. “Count will be your independent variable and your data set’s values will be the dependent variable. Figure 1. Histogram Values before Log Base 2 Transformation Figure 1 represents the expression levels of genes before they were transformed into log base 2 value, and the graph is right skewed. This is because most genes do not have a high expression level in a microarray. 7 30 Frequency 25 20 15 10 5 15.5 13.4 11.3 9.2 7.1 5 2.9 0.8 -1.3 -3.4 -5.5 -7.6 -9.7 -11.8 -13.9 -16 0 Green Intensity Values of Data Set #1 Without Norm alization Figure 2. Histogram of Values for Data Set #1 after Log base 2 transformation. Comparing Figure 1 and Figure 2, simply transforming the data into log base 2 corrects the values into a more normal curve. 30 Frequency 25 20 15 10 5 13.9 9.3 11.6 7 4.7 2.4 0.1 -2.2 -4.5 -6.8 -9.1 -11.4 -13.7 -16 0 Green Intensity Value of Data Set #1 After Median Norm alization Figure 3. Histogram of Values for Data Set #1 after Median Normalization In Figure 3, the graph has shifted slightly to the left after median normalization. Boxplots can be done with each data set to see the overall change after median normalization for each of the data sets. Boxplots must be done in SPSS (Burruss Lab). Copy each data set’s values into their separate columns. Under “Graphs”, choose choose “Boxplots”. Next, choose “Simple” and “Summaries of Separate Variables”. Under “Boxes Represent”, drag all your data sets into the box. This should give a graph comparing boxplots of each of your data sets. 295 15.0 325 346 12.5 312 128 334 365 325 346 346 324 324 458 333 458 10.0 7.5 8 5.0 223 390 387 421 365 2.5 392 412 409 399 385 Figure 6. Boxplot of values of each of the 4 Data Sets used in the matrix after median normalization. Standard Deviation Normalization for expression ratios This type of normalization is used to normalize each data set. It is similar to mean normalization except standard deviation is used also in the equation. 9 An equation is used to standardize the expression ratio values for each gene. That equation is: Z = (V-M) / SD Z is the new expression ratio value V is the value you want to standardize M is the mean of the data set SD is the standard deviation of the data set To do this in Excel, copy the expression ratio values (after subtraction of background and log base 2 transformation) of one data set into Column A. Find the mean of the data set and place that in Column B. Copy and Special Paste the mean value so that each gene’s B box has the mean value. Find the standard deviation of the data set and place that in Column C. Copy and Special Paste the mean value so that each gene’s C box has the standard deviation value. To standardize the expression ratio, a function must be used in Column D using the value, mean, and standard deviation. For gene #1, that function is =STANDARDIZE(A2,B2,C2) For gene #2, that function is =STANDARDIZE(A3,B3,C3) And so on for each gene… Do this again for each of the data sets. Plot each data set’s new expression ratio values into a histogram and compare the symmetry with mean normalization. Choose the graph with the best symmetrical curve and use those values to start median normalization. A Type of this array analysis can be done in Magic Tool as well although you use less arrays for your comparison. This feature is called Standardize. We haven’t yet determined the exact math behind how it works but we know it does something similar. The next type of Normalization – Normalizing microarrys to each other still needs to be worked out by members in our class? Anyone interested in figuring it out as an independent project? 10