Download Statistical Analysis of Arrays

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Statistical Analysis of Arrays
Written by BIO 480 Student Katie Criswell – Modified by T. Rife
Before Class:
Think About: What should your microarray data look like?
How many yellow spots versus red or green?
What does this mean your average ratio should be?
Graph It: Use Excel and Graph your Red versus Green Data. What does it look
like? Is it what you expected? Be prepared to share this graph in class!
Consider: What kinds of experimental mistakes might throw your data off?
Normalization is the technique that corrects for variation in the microarray results
that have been altered by microarray technology rather than actual biological differences.
There are many different sources of error that may have caused these variations.
Variability could be caused due to manufacturing process of the probe DNA, the amount
of DNA spotted on the slide, or the ability for cDNAs to bind to the array. Dye bias could
also arise from the physical properties of the dye due to decay or ability to hybridize with
cDNAs. Hybridization of the dyes to the cDNAs could be affected by humidity, dust,
salts or other molecules. Different scanning settings could cause imbalances between the
red and green dyes. For example, higher scanning intensities improve the quality of the
signal but increase the risk of saturation. All of these factors can affect the intensities of
expression, therefore normalization must be considered when analyzing the data. By
minimizing the effects of variance caused by microarray technology of each microarray,
the data can be analyzed with more confidence and the data can be better utilized to
calculate actual biological differences.
Mean and Median Normalization are used to normalize within microrrays to help
account for these technical problems. Mean or Median Normalization must be done
before you normalize between microarrays. We suggest that you do both mean
normalization. Then use another normalization technique to normalize between
microarrays. After median normalization, a t-test should be done that compares the
expression levels of a gene between the different microarrays.
Before any normalization is done, some procedures must be done to each data set.
Normalizing will be done using the expression ratios for each gene. First, each gene’s red
and green background was first subtracted from their intensity. Save your data sheet.
Now create a new data sheet where you get rid of values that are 0 or contain a negative
number after subtracting the background because one can’t divide by 0 to make an
effective ratio. If there are still positive values for green and red intensity values for a
gene after the subtraction, divide the mutant color by the wild type color to get an
expression ratio. Transform this ratio into a log base 2 scale. Transforming into log base
2, reduces the scale of the difference for the data sets by transforming the data set to a 0-
1
16 numeric scale. It allows us to visualize better differences in ration of 0.1 vs. 0.5. This
makes the data more suitable and easier to work with.
Try to work on some of this before class with your group and make sure you can
work the following problems:
What is the log to the base 2 of:
a.
b.
c.
d.
e.
1
2
4
0.5 or ½
0.25 or ¼
In Class- We will Spend Some Time Normalizing Our Data But read through
this technique before class:
Mean Normalization for Expression Ratio Values
From each microarray, 2 data sets (top half and bottom half of microarray) are available
to be used for analysis. For this experiment, 2 microarrays were used giving us 4 different
data sets. Each data set was compiled into a matrix. An example of a matrix is in the
below figure.
“X” denotes a gene identity
“N” equals the number of data sets used.
“P” equals the number of genes in each microarray.
The first column (X11 … Xpn ) are the expression ratio for each gene from a data set that
you choose as Data Set #1. The second column (X21 … Xpn ) are the expression ratio for
each gene from a data set that you choose as Data Set #2. A row (Xp1… Xpn) are all the
values for the specific gene assigned to that row from all the data sets in the collection.
2
The log base 2 values of all the expression ratios in each data set are used to find the
mean values of each microarray data set. The mean expression ratio is calculated for each
data set (column). For example, when calculating expression ratio means:
M1 … M n
M1 equals mean expression ratio for data set #1 (Column #1)
Mn equals mean expression ratio for last data set (Column #n)
Mean expression ratios should be calculated for each data set. After a mean is calculated
for each data set, that data set’s mean expression ratio value is subtracted from each of
that data set’s gene expression ratio value. For example, when subtracting mean
expression ratio from an individual gene’s expression ratio:
X11 – M1… X1n - Mn
:
:
:
:
Xp1 – M1 … Xpn – Mn
All new expression ratio values after mean normalization for one data set should be
graphed for frequency in a histogram. You will have “n” number of histograms for “n”
number of data sets. A boxplot of values from each microarray data set can be compared
after mean normalization also.
To do a histogram in SPSS (Burruss Lab), copy a data set’s value into a column in the
SPSS spread sheet. Under “Graphs”, choose “Interactive” and then “Histogram”. “Count
will be your independent variable and your data set’s values will be the dependent
variable.
Figure 1. Histogram before Log Base 2 transformation.
Figure 1 represents the expression ratio of genes before they were transformed into log
base 2 value. The graph is right skewed. This is because most genes do not have a high
expression level in this microarray.
3
30
Frequency
25
20
15
10
5
15.5
13.4
11.3
9.2
5
7.1
2.9
0.8
-1.3
-3.4
-5.5
-7.6
-9.7
-11.8
-16
-13.9
0
Green Intensity Values of Data Set #1 Without
Norm alization
Figure 2. Histogram for Data Set #1 after Log Base 2 transformation.
Figure 2 represents the log base 2 values before any normalization is done. Transforming
the data into log base 2 corrects the values into a more normal curve.
30
Frequency
25
20
15
10
5
15.5
13.4
11.3
9.2
7.1
5
2.9
0.8
-1.3
-3.4
-5.5
-7.6
-9.7
-11.8
-13.9
-16
0
Green Intensity Values of Data Set #1 After Mean
Norm alization
Figure 3. Histogram of Data Set #1 after Mean Normalization
After mean normalization, Figure 3 shows that the graph has shifted so that most of the
data lies around zero.
4
Boxplots can be done with each data set to see the overall change after median
normalization for each of the data sets. Boxplots must be done in SPSS (Burruss Lab).
Copy each data set’s values into their separate columns. Under “Graphs”, choose choose
“Boxplots”. Next, choose “Simple” and “Summaries of Separate Variables”. Under
“Boxes Represent”, drag all your data sets into the box. This should give a graph
comparing boxplots of each of your data sets.
325
327
325
365
14
294
12
328
99
312
128
328
449
99
43
138
10
8
6
448
438
317
178
4
223
2
371
421
399
341
DataSet1
DataSet2
317
333
421
458
432
399
385
DataSet3
DataSet4
0
Figure 4. Boxplot of values of each of the Four Data Sets used in the matrix.
Before normalization is performed, outliers can be seen in Figure 4 and the data ranges
between the data sets are not similar.
1
325
327
325
312
6
294
365
43
294
70
465
121
121
3
0
-3
451
458
333
223
384
445
371
223
341
-6
464
432
421
421
385
399
DataSet1
DataSet2
DataSet3
DataSet4
Figure 5. Boxplot of each of the 4 Data Sets used in the matrix after mean normalization.
After mean normalization, the data ranges between the data sets are more similar and also
better comparable with each other.
5
Median Normalization for Expression Ratio Values
Median Normalization must be used after using either Mean Normalization or Standard
Deviation Normalization
Data sets for each microrray’s normalized data must be compiled into a matrix. From
each microarray, 2 data sets (top half and bottom half) are available to be used for
analysis. For this experiment, 2 microarrays were used giving us 4 different data sets.
Each data set was compiled into a matrix. An example of a matrix is in the below figure.
“X” denotes a gene
“N” equals the number of data sets used.
“P” equals the number of genes in each microarray.
In other words, the first column (X11 … Xpn ) are all the normalized expression ratio
values in data set #1 or the last row (Xp1… Xpn ) are all the normalized expression ratio
values for that gene for all the data sets.
The normalized expression ratio values are used to find the median values of each
microarray data set. For example, when calculating the expression ratio medians:
M1 equals the red intensity median for genes X11 … X1n or
M1 equals the median for all of this specific gene’s expression ratio values from the
compiled matrix.
Mp equals the red intensity median for genes Xp1 … Xpn or
Mp equals the median for all of this specific gene’s expression ratio values from the
compiled matrix.
You will have “P” number of medians, one for “P” number of genes.
A median for all the expression ratio medians was calculated. This median is Mm
Mm equals the median for all combined red medians M1 … Mp
Each gene’s expression ratio was then multiplied by a ratio. For example, when
calculating the new expression ratio value for each gene in data set #1, the expression
ratio value was multiplied by a ratio and that ratio is:
6
Ratio = (Mm / A1)
Mm equals the median for expression ratio medians
A1 equals the median for expression ratio values for genes X1,1 … Xp,1
(A1is different than M1. A1 is the median for all the expression ratio values for the Data
Set #1. In this experiment, we had A1-A4 because we had four different data sets.)
*this same ratio is used for each gene for data set #1*
Next, when calculating the new value of expression ratios for the genes in data set #2,
each gene’s expression ratio value was multiplied by a ratio and that ratio is
Ratio = (Mm / A2)
Mm equals the median for all expression ratio medians
A2 equals the median for expression ratio values for genes X1,2 … Xp,2
*this same ratio is used for each gene for data set #2*
Do this again for data set #3 and so on.
All new expression ratio values after median normalization for one data set should be
graphed for frequency in a histogram. You will have “n” number of histograms for “n”
number of data sets.
To do a histogram in SPSS (Burrus Lab), copy a data set’s value into a column in the
SPSS spread sheet. Under “Graphs”, choose “Interactive” and then “Histogram”. “Count
will be your independent variable and your data set’s values will be the dependent
variable.
Figure 1. Histogram Values before Log Base 2 Transformation
Figure 1 represents the expression levels of genes before they were transformed into log
base 2 value, and the graph is right skewed. This is because most genes do not have a
high expression level in a microarray.
7
30
Frequency
25
20
15
10
5
15.5
13.4
11.3
9.2
7.1
5
2.9
0.8
-1.3
-3.4
-5.5
-7.6
-9.7
-11.8
-13.9
-16
0
Green Intensity Values of Data Set #1 Without
Norm alization
Figure 2. Histogram of Values for Data Set #1 after Log base 2 transformation.
Comparing Figure 1 and Figure 2, simply transforming the data into log base 2 corrects
the values into a more normal curve.
30
Frequency
25
20
15
10
5
13.9
9.3
11.6
7
4.7
2.4
0.1
-2.2
-4.5
-6.8
-9.1
-11.4
-13.7
-16
0
Green Intensity Value of Data Set #1 After Median
Norm alization
Figure 3. Histogram of Values for Data Set #1 after Median Normalization
In Figure 3, the graph has shifted slightly to the left after median normalization.
Boxplots can be done with each data set to see the overall change after median
normalization for each of the data sets. Boxplots must be done in SPSS (Burruss Lab).
Copy each data set’s values into their separate columns. Under “Graphs”, choose choose
“Boxplots”. Next, choose “Simple” and “Summaries of Separate Variables”. Under
“Boxes Represent”, drag all your data sets into the box. This should give a graph
comparing boxplots of each of your data sets.
295
15.0
325
346
12.5
312
128
334
365
325
346
346
324
324
458
333
458
10.0
7.5
8
5.0
223
390
387
421
365
2.5
392
412
409
399
385
Figure 6. Boxplot of values of each of the 4 Data Sets used in the matrix after median
normalization.
Standard Deviation Normalization for expression ratios
This type of normalization is used to normalize each data set. It is similar to mean
normalization except standard deviation is used also in the equation.
9
An equation is used to standardize the expression ratio values for each gene. That
equation is:
Z = (V-M) / SD
Z is the new expression ratio value
V is the value you want to standardize
M is the mean of the data set
SD is the standard deviation of the data set
To do this in Excel, copy the expression ratio values (after subtraction of background and
log base 2 transformation) of one data set into Column A.
Find the mean of the data set and place that in Column B. Copy and Special Paste the
mean value so that each gene’s B box has the mean value.
Find the standard deviation of the data set and place that in Column C. Copy and Special
Paste the mean value so that each gene’s C box has the standard deviation value.
To standardize the expression ratio, a function must be used in Column D using the
value, mean, and standard deviation.
For gene #1, that function is =STANDARDIZE(A2,B2,C2)
For gene #2, that function is =STANDARDIZE(A3,B3,C3)
And so on for each gene…
Do this again for each of the data sets. Plot each data set’s new expression ratio values
into a histogram and compare the symmetry with mean normalization. Choose the graph
with the best symmetrical curve and use those values to start median normalization.
A Type of this array analysis can be done in Magic Tool as well although you use
less arrays for your comparison. This feature is called Standardize. We haven’t
yet determined the exact math behind how it works but we know it does
something similar.
The next type of Normalization – Normalizing microarrys to each other still
needs to be worked out by members in our class? Anyone interested in figuring
it out as an independent project?
10