Download Introduction to the Analysis of Microarray Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epitranscriptome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Human genetic variation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Copy-number variation wikipedia , lookup

NEDD9 wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Public health genomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Fetal origins hypothesis wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of human development wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy wikipedia , lookup

Gene desert wikipedia , lookup

The Selfish Gene wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Introduction to the Statistical Analysis of Microarray Data
Martina Bremer1,4, Edward Himelblau2, and Andreas Madlung3
1
Purdue University, Department of Statistics, Purdue University, West Lafayette, IN 47907;
Polytechnic State University, Biological Science, San Luis Obispo, CA 93407; email:
[email protected]; 3 University of Puget Sound, Department of Biology, Tacoma, WA, 98416; email:
[email protected]; 4 Current address: San Jose State University, San Jose, CA, 95192. email:
[email protected]
2California
Introduction: Studying Gene Expression
Knowing the transcriptional activity of a gene can give
target spots) in the same time that it used to take to
valuable insight to the function of the protein it encodes
analyze the activity of a single gene.
and to the role it plays in an organism. Gene activity in
the same individual can vary from tissue to tissue,
Such technological advances have revolutionized the
between different developmental stages, or even from
way molecular bioscience is done and have sped up the
morning to night time. Gene activity is influenced by the
rate of new discoveries. However, they have also led to
activity of other genes and the proteins they encode.
the rapid acquisition of huge amounts of data that require
Gene expression can change in response to outside
the use of biostatistics for analysis and validation of the
factors, such as the environment or exposure of the
collected data. In practice, gene activity is assessed, by
organism to chemical substances, competitors, or
labeling mRNA that was extracted from an organism, with
pathogens.
fluorescent dyes. The labeled mRNA, known as the
“probe” is applied to the glass slide and allowed to bind to
The classical approach to measuring the activity of a
its complementary spot on the array. This process is
gene has been to isolate messenger RNA (mRNA), and
called hybridization. Subsequently, the unbound mRNA is
estimate the amount of mRNA of the gene of interest
washed off the slide. The slide is scanned and the
present at a given time in the organism. Traditionally, this
amount of fluorescently labeled mRNA bound to each
has been done for one gene at a time.
spot is proportional to the activity of the gene it
represents. In most cases, software analysis is then used
Whole genome sequencing projects of many species,
to determine how much of a signal is due to biologically
including humans, have provided information that allows
relevant processes and how much is due to technical
researchers to distinguish every gene in the organism.
“noise”. In this lab activity you will learn how to analyze
The development of microarray technology has made it
data obtained from a microarray experiment using
possible to survey the gene expression activity of
statistical tests very similar to those that a commercial
thousands of genes at the same time by using short
software package would do.
pieces of DNA, each uniquely representing one gene,
and spotting them to a solid support, such as a
Biological Use for Microarrays
microscope glass slide. Using extremely small capillaries
to apply these pieces of DNA, up to 25,000 genes can be
Why would a researcher want to do microarray
represented on a single conventional 1.5 cm x 5 cm slide.
experiments? In essence, a microarray experiment can
Using
spots,
give useful information for any question that asks
researchers can assess the relative amount of mRNA in
whether or not two different populations of cells express
a sample of all 25,000 represented genes (called the
different sets of genes. For example: A researcher wants
these
microscopic
arrays
of
DNA
to find out which genes become active if a plant is
1
subjected to prolonged drought stress (Figure 1). An
appropriate experiment would be to have one set of
plants growing in optimal conditions and a second set
growing in the same conditions, except with limited water.
After a few days under these conditions, tissue is
harvested from both sets (treatment: no water; control:
well-watered) and mRNA is extracted. As described later
in more detail, a common method used in microarray
analysis is to label mRNA from the treatment group with
one color dye and the control mRNA with another color.
Equal amounts of mRNA are then used for the
hybridization to the array. If a scanner with the capacity
to detect two colors is used, relative amounts of mRNA of
each gene can be compared between the control group
and the treatment group. Genes up-regulated (“turned
on”) in response to drought stress will show a stronger
signal of one color (treatment) than the other color
(control). After statistical analysis of the data obtained for
all of the 25,000 genes, a gene list is generated allowing
the researcher to know which genes are activated by the
treatment. In our example, these are genes that become
active in response to drought stress.
Figure 1. Comparing gene expression using microarrays.
mRNA is extracted from a plant that has undergone an
experimental treatment (T, drought stress in this case) and an
untreated control
(C).
The mRNA
transcription to generate cDNAs.
undergoes
reverse
A different fluorescent
molecule is used to label each of the cDNA pools. The labeled
cDNAs are then hybridized to a microarray.
The microarray
consists of a glass slide on which thousands of distinct DNA
sequences have been affixed. Each dot (or “feature”) on the
slide represents the sequence of a different plant gene. After the
unbound probe is washed away, a special slide scanner excites
each feature on the array with a laser and measures the
fluorescent signal emitted. The more cDNA is bound to a spot,
the greater the signal will be. The magnified computer screen at
the lower right shows the possible results for each feature. A red
spot (A) represents a gene that is only expressed in the control.
A green spot (B) represents a gene that is only expressed in the
treated plant.
A yellow spot (C) represents a gene that is
expressed in both treated and control plants. A dark spot (D)
indicates that the corresponding gene is not expressed in either
the control or treated plants.
2
What is statistics?
Statistics is a collection of procedures and formulas that
Carefully conducted experiments can keep the technical
allow us to make decisions when faced with uncertainty.
variation to a minimum. The biological variation between
Where
individuals in the experimental groups, however, cannot
does
the
uncertainty
come
from?
Many
experiments that address the same problem or question
be influenced.
can differ in their outcomes when conducted by different
people or with different material. Statisticians call this
Mean, median, and standard deviation
variation. In microarray experiments, the two main
Three of the most important statistical concepts are the
sources of variation that cause the uncertainty are:
mean, median and standard deviation of a set of
measurements. While the mean and median are used to
Biological variation
describe the center of the measurements, the standard
Different organisms have different gene expression
deviation is used to describe the spread.
profiles, or in other words, the activity of their genes
varies. The measured expression levels hence vary from
individual to individual used in the study. (Figure 2)
Technical variation
Due to human error, there can be slight variation in
microarray manufacturing and hybridization of the mRNA
Mean: the average of the n measurements or…
x  1n i 1 xi
to the slide. Even if an experiment calls for applying the
n
same amount of mRNA from the same organism to two
identical slides, the measurements may be different.
(Figure 2)
Median: The middle observation. (If the number of
observations n is odd the middle number is used. If n is
even the average of the two middle observations is
used.)
Standard deviation: measures the average distance of
the n observations x1,…,xn from the mean ( x ) or…
Figure 2. Sources of variation in gene expression studies.
s
1
n 1
 x  x
n
i 1
2
i

3
Overview over Microarray Experiment
Figure 3. Spot intensity. After labeled cDNA is hybridized to a
microarray each fluorescent label is individually excited and
It is the goal of many microarray experiments to compare
the gene expression levels of a treatment group with
those of a control group. For this purpose, mRNA is
detected by the slide scanner. The scanner divides the spots
into pixels. The spots are not uniform and there can be variable
intensity from pixel to pixel within each spot. Also, there are
areas of low intensity, “background” fluorescence in the regions
extracted from the cells of several individuals in each
between spots. After scanning, the red and green signals are
group. The samples are labeled with red and green
superimposed by the computer that generates a composite
fluorescent dyes and allowed to bind to the DNA on the
yellow spot (notice the distinct red and green pixels in the
array (target spots).
background of the superimposed image.)
The scanner divides the features on the array into pixels
Microarray Data Analysis
and for each pixel a computer records the scanned red
What do the many numbers in a microarray experiment
and green intensity. Usually, the spots are not uniform.
output
The intensity in both the red and the green channels may
experiment is conducted, the result is usually a large data
vary over these pixels. The spots with very low intensity
file with many columns and as many rows as there were
are called the background (Figure 3).
spots on the array. It may look something like figure 4.
file
actually
mean?
When
a
microarray
Usually, the output file contains much more information
than shown here, but we will only use the pictured
columns for our analysis.
The first three columns (labeled ``Block’’, ``Column’’, and
``Row’’) tell us the position of the scanned spot on the
array. The column labeled ``Name’’ contains the name of
the gene that was spotted there. The column labeled
``ID’’ contains information pertaining to the exact part of
the gene that was used to originally produce the target
spot on the glass slide.
The red and green intensities recorded by the scanner
are reported in the next four columns. ``F’’ stands for
foreground and ``B’’ for background. The numbers 635
and 532 represent the wavelengths of the red and green
laser light, respectively. For example, the numbers in the
F635 Median column are the median values of the
scanned foreground pixels when excited with the red
laser light.
4
In addition to measuring the spot itself, the scanner also
represents the red intensity for gene i, then two quantities
measures so-called background intensity. This is in
commonly used are:
essence probe that bound to the silicate of the glass slide
and falsely increases the signal of each spot by the same
Figure 4. Example of the output file of a microarray experiment. After scanning the slide, the intensity of both the red and green
values are recorded separately (even if the spot looks yellow to the eye it is comprised of green and red labeled probe). The data are
stored in a so-called .gpr file that can be copied directly into a spreadsheet, such as in this case a Microsoft Excel file.

Ri
Gi
(equation 1)
Ai  12 (log 2 Ri  log Gi )
(equation 2)
M i  log 2
amount.
This
background
intensity
needs
to
be
subtracted to get an accurate reading of the spot intensity
based on probe-target interaction only. The columns that

(If you need to re-acquaint yourself with logarithms, now
labeled ``F635 Median – B635’’ and ``F532 Median – 
would be a good time to do so. If you need a calculator to
B532’’. They contain the background corrected median
convince yourself of the values discussed below you can
red and green intensities for each spot on the array.
use
will be most helpful in our analysis are therefore the ones
one
online
at:
http://www.rechneronline.de/
logarithmus/. Notice that the base used here is 2, not 10.)
Normalization
The results of a microarray experiment are obviously
The quantity Mi (the log ratio) describes the relationship
influenced by technical variation. This means that the
between the two groups. If the intensity in the red and
measured intensities vary on the array(s) in a systematic
green-labeled group is the same, then Mi will be zero. If
manner.
the red intensity is twice as big as the green then Mi will
Often more than one slide is used to conduct the
be equal to 1. If, on the other hand, the green intensity is
experiment. If different amounts of probe are applied to
twice as big as the red, then Mi will be equal to –1. The
the slides, the intensities on some slides may be
quantity Ai describes the overall intensity observed for
consistently higher than the intensities on other slides,
gene i. The quantities Mi and Ai are very useful in the
even though they measure the same genes.
normalization of microarray data.
Normalization means to mathematically manipulate the
Before
data to make it uniform in a variety of ways. There are
experiment, researchers often take a look at what is
several different ways this normalization can be done.
called the “MA-plot”. For every feature i on the array, the
conducting
the
analysis
for
a
microarray
values Mi and Ai are computed and are plotted in an xyIn many microarray experiments, the results are reported
plot (Figure 5).
as log-ratios of the two intensity measurements. If Gi
represents the green intensity for gene i and Ri
5
It is known that the green and red dyes used to label the
samples interact differently with certain genes. The dyes
have different light stability and may vary in efficiency.
That means that the dye molecules are more likely to
bind to certain genes than others.
On average, we would like the intensities for both dyes to
be about the same. That means, that on average, the
log-ratios Mi should be about zero. Normalization means
Figure 5. Using an MA plot to visualize normalization on one
that one computes the average M-value for all genes
array. Highly up- or down-regulated genes are above or below
spotted on the array and then makes sure that the
the x-axis.. The diagonal lines to the left are called “fishtails”.
average will be zero.
Fishtails are an artifact of changing the value for spots that have
higher background than foreground.
6
#1: Do this…to normalize microarray data.
Find a computer and open an Excel spread sheet. Type in the
values as you see them below. Note that in the example below, the six features shown all represent
the same gene.
The first step is to compute the log-ratios of red (635) to green (532) signal. In Excel, you can use the command
“LOG(x,2)” to compute
log 2 ( x) or you can calculate
them with a calculator or the website given earlier. Write the
results into a new column and label it “M”:
Write the M-values on this sheet in column G.
The goal of normalization is to assure that the average of the M-values is set to zero. Next compute the average of all
six M values. The average is:_______________________Now subtract this average from the M-values for all genes
in the example above to “correct” them. Two values have already been entered so you can see if you are on the
right track. Fill in the rest in column G below.
The point of normalization is to center the average on the zero value. Here you have calculated the average M
value and then subtracted it from all M values, making it – on average – zero. This procedure in this example shifts all
data points up a bit as you can see on the MA plot before and after normalization. The biological reason to
normalize in this case was that one dye because of its chemical stability, not because of the expression of the genes
it labels, always gives a higher value than the other dye, introducing an error equally great for all data points.
7
Normalization by Dye-Swap Design
A better way to deal with uneven binding of the dyes to
certain genes is to carry out experiments as dye-swaps
(Figure 6). In them, the probe from each group (treatment
and control) is split into two portions and labeled with
different dyes. The labeled samples are then hybridized
crosswise (red “treatment” with green “control” and green
“treatment” with red “control”) onto two arrays.
To normalize the data, obtain the M-value for each
feature on each slide (as demonstrated in the “Do This…”
section on page 7). For consistency we will compute Mvalues for the two arrays once as red/green log ratio and
for the other slide (in which the dyes are swapped) as the
green/red log-ratio. Then average the M-values from the
Figure 6: Experimental design for a dye-swap experiment. In
two arrays for each feature:
this design, the treatment and control are first labeled with one
set of dyes and hybridized to array 1. To account for different
Mi  12 (Mi(1)  Mi(2) )
Here,
M
M i(2) is



(1)
i
labeling efficiencies of the two dyes, the same probes are now
is the log-ratio for feature i on array 1, and
labeled with the other dye (the dyes are swapped) and
subsequently hybridized to array 2.
the log-ratio for the same feature (with dyes
swapped) on array 2.
#2: Do this… to normalize data from a dye-swap experiment. Suppose you have data from a dye swap
experiment. There are two (very small) arrays. Each array contains six spots for the same gene. Suppose for each
spot, you have already computed the log-ratio M as in the previous example:
To conduct dye-swap normalization, average the M-values for each spot on the two arrays. For example, after
normalization, the corrected M-value for the spot in column 1/row 3 is (-0.16-0.42)/2 = -0.29. This way, an M-value
can be obtained for every spot on the array. In the table below, fill in the M-values are for each spot:
Corrected M-values
-0.29
-0.94
-0.26
0.42
-2.47
-1.34
8
Drawing Conclusions
differentially expressed in the treatment and control
group, we also have to look at the variation of these
It is often the goal of microarray experiments to identify
measurements. Only if the distance from zero of the
genes that become either more or less expressed in
average of our measurements is large compared to
response to a treatment that compares two different
the variation, can we assume that the gene has
states (e.g., drought-stressed and well-watered plants).
different expression in the two experimental groups.
To carry out the experiment, the gene expression levels
of plants that were subjected to drought stress (treatment
group) are compared with those of plants that were
watered well (control group). For every gene on the
microarray we want to decide whether the expression
levels
in
the
two
experimental
groups
are
(significantly) different from each other or not.
How can we make sense of the many thousands of
measurements collected simultaneously in a microarray
experiment?
Figure 7. Is the average M equal to zero? Multiple repeats for
the same gene will give different results. Statistical tests provide
an answer whether or not the mean of the repeats is significantly
different from zero, or, in other words, if the treatment resulted in
First, we will analyze each gene separately. Our goal is
differences in gene expression from the control. On the left,
to decide whether the expression level of the gene is
different means for observations with the same variation pattern
different in the treatment and the control group. If the
are shown. The variation in the measurements is important in
green and red intensities are different, that means that
their quotient R/G will not be equal to one. If the quotient
is not equal to one, then the M-value for the spot, which
deciding whether the mean of a number of observations is
significantly different from zero. If the variation is small, we may
be more inclined to assume a non-zero mean than if the
variation is large. On the right, a greater sample size may or
is the logarithm of the quotient (see equation 1), will not
may not (in this case not) result in greater confidence that the
be equal to zero. It will be positive, if the red intensity is
mean is different from zero.
greater than the green and negative if the green intensity
is greater than the red (See Figure 5).
Statistical Decision Making
The researcher is trying to detect a difference between
Hypothesis tests are an important tool for statistical
the treatment and control group. Very few spots on the
decision-making. They are used to answer a ``Yes/No’’
array will have equal red/green intensities (M-values
question about a population. But instead of being able to
equal to zero). The challenge in analysis is to determine
observe the whole population, we only get to see a small
whether the differences are due to biological or technical
sample.
variation, or whether they reflect true differences in gene
expression between the samples. This is achieved by
For example, in a criminal trial, the defendant is
analysis of M-values. But how far away from zero do
considered innocent until proven guilty. During the trial
these M-values have to be so that we are convinced that
both sides (prosecution and defense) present evidence
the result is due to a real difference in the experimental
and at the end of the trial the jury members have to
groups and not just due to variation? (Figure 7)
decide whether the evidence is enough to convict the
defendant or not.
To decide whether the M-values for a gene are far
enough away from zero so that we would call the gene
9
Suppose we find a bloody knife in the defendant’s closet.
A statistician would now ask ``how likely is it for
something like this to happen to an innocent person?’’ If
Probability to observe extreme   0.05 Reject the null hypothesis
p

data if the null hypothesis is true  0.05 Do not reject the null hypothesis
the answer were ``Not very likely’’ then the statistician
would conclude that the defendant is probably guilty.
The probability p is called the “p-value” of the test. The
smaller the p-value is, the less likely it would be to obtain
Hypothesis Testing:
the data you have, if the null hypothesis were true.
A statistical hypothesis test is similar. A null hypothesis is
a statement about a parameter. Like the innocence
assumption in the criminal trial it is usually of the form
``there is nothing unusual happening here’’.
To figure out the probability of observing extreme (or
unusual, atypical) data, we have to have a quantity
whose statistical distribution (behavior) we know and
whose value we can compute from the sample data.
The alternative hypothesis is the opposite of the null
Such a quantity is called a “test statistic”.
hypothesis – this is the statement that the scientist really
suspects to be true. Data is collected that will be used as
evidence.
In a microarray experiment, we want to identify genes
whose expression values are different in the treatment
and control group. We will conduct a hypothesis test for
The scientist now takes on the role of prosecutor. If it is
unlikely to observe what we see if the null hypothesis
were true, we can conclude that the data does not
conform to this theory and we then reject the null
hypothesis. (This does not mean that we have proved
each gene that is spotted on the array. You can think of
the criminal trial as an analogy to what we are going to
do. Fill in the table on the next page with the steps a
biologist would have to do, to conduct the hypothesis
test.
that the alternative hypothesis is true.)
If it is unlikely to see something as extreme (or more
extreme) as our data from variation if the null hypothesis
were true, we can reject the null hypothesis. If, on the
other hand, outcomes like the one we observed happen
all the time if the null hypothesis were true, then we
cannot reject the null hypothesis.
Suppose a study finds that 15% of all innocent people
keep bloody knifes in their closets. Would you declare
our defendant guilty in this case? What would you do, if
you knew that only 0.00001% of all innocent people kept
bloody knifes in their closets?
What should the probability of your observations be, so
that you would be willing to accept or reject a
hypothesis? Most often, the answer to this question
depends on the problem. A popular value that is used in
many fields is 0.05:
10
Criminal Trial
Gene Expression Experiment
Null hypothesis
Assumption of innocence
No difference between gene expression
in control and treatment plants
Alternative hypothesis
Assumption that the defendant is guilty
Gene expression differs between the
control and treated plants.
Data
Evidence (such as a bloody knife in the
defendant’s closet)
p-value
Probability of finding incriminating
evidence on an innocent person
Rejecting the null hypothesis
Jury finds the defendant guilty
Declare the gene differentially expressed
Accepting the null hypothesis
Jury finds the defendant not guilty
Declare that the gene is not differentially
expressed.
Hypothesis Test for Log-Ratios
Gene expression measured by red/green
fluorescence levels
Probability that different expression
levels result from only biological or
technical variation and not form random
chance.
much higher than the green (meaning that the expression
of the genes whose mRNA was labeled with red dye is
In microarray experiments, especially if the data has
higher than the expression of the genes, whose mRNA
been normalized, it will be in the form of log-ratios of red
was labeled with green dye). Large negative values of
and green intensities (M-values). The file will have one
the test statistic mean that the log-ratio is negative, which
M-value for every spot on the array. Each gene will be
means that the green intensity is much higher than the
spotted several times on the array, so that for each
red.
gene we have several M-values. To decide, whether a
gene is expressed differently in the two groups (treatment
and control), we will decide whether the M-values for that
gene are close to zero (on average) or not. To make this
decision, we will also have to take the variance of the
observations into account.
A t-test will allow you to do this. Suppose that you
repeatedly measure a characteristic, which has mean
zero. If you have n measurements with average
x
and
standard deviation s, then the quantity…
t
x
s2
n

is a number that characterizes the distribution (behavior)
of the test statistic. The ``normal'' (or typical) values are
Figure 8. t-distribution for different degrees of freedom.
The degrees of freedom depend on the
experimental set up, certain assumptions, and
the number of observations.
always those close to zero.
The unusual values are the ones in the tails of the
distribution, either large positive or large negative
numbers.
Large positive values of the test statistic mean that the
log-ratio is positive, which means that the red intensity is
11
How large is large?
of the distribution in the graph of the t-distribution above.
How large (or small) will a test statistic value need to be
If that occurs, one can safely argue that the two values
so that we can call it unusual? Most researchers work
(for red and green labeled RNA) differ from each other in
with a significance level of 5% (or 0.05). They call an
a “statistically significant” manner between the treatment
observation unusual, if its p-value is smaller than 5%.
and the control group.
That means that the test statistic value falls into the outer
5% tail area
#3: Do this…conduct a t-test for a microarray experiment.
We need to make a decision for each gene (represented by several spots on the array). Is the gene expressed differently in
the treatment and control group? Therefore, for each gene, we will carry out the hypothesis test separately.
Before the Test: Pick a gene. Find all the red and green intensity values for this gene in your data file. Compute all the Mvalues for this gene.
Step 1: We need to set up the null hypothesis and alternative hypothesis. Remember that the null hypothesis means that the
treatment had no effect (on this gene) and the alternative hypothesis is what the researcher is really expecting to
support. A good experiment will always provide valuable information, regardless of the outcome of the hypothesis
test.
In your answer sheet write down the two hypotheses for your experiment:
Null Hypothesis:
Alternative Hypothesis:
Step 2: In this example we have already collected the data. We will use the microarray measurements as estimates for the
gene expression levels in the two groups.
Step 3: We want to find a p-value for the gene. That means that we have to compute a test-statistic value and then decide
how usual or unusual it is.
Example: Suppose the collected (and normalized) microarray data of the gene At1g01000 looks like this:
(note…there are six values because this gene is spotted onto the array in six locations.)
12
Continued on the next page…
Do this…conduct a t-test for a microarray experiment (continued).
We have six M-values for the gene At1g01000 in this table. First, compute the average of the six observations:
x = _________ (fill in the value here)
and the standard deviation s =_________________ (fill in value here)(see
formula on p 3 or calculate in Excel or with your calculator).
Since we have six observations, n= 6. Now we can compute the value of the test statistic as:
t

x
2
s
n

1.15
1.282
6
 2.08
Calculate t and fill in value here:___________
The degree of freedom that describes the behavior of the test statistic in this case is df  6 1 5.
To find the p-value, we have to find the percentage of cases, in which the t-test statistic with df = 5 would take on

more extreme values than the t-value we observed. Extreme values are the ones far away from zero:
In the past, these values had to be looked up in tables. Today, Excel and other software programs have them stored
in their statistics package. The p-value can be found with the Excel command “=TDIST(2.08, 5,2)”. In the example
above, the exact p-value (red shaded tail area of the distribution) is 0.0921 or 9.21%.
To find a p-value using Excel, open an Excel spreadsheet, click on any empty cell and enter “=TDIST(absolute
value of your test statistic, df, 2)”. The “2” stands for two-sided, which means that you want the red area in both tail
ends. In this example the absolute value (no minus sign) of the test statistic is 2.08 and the degree of freedom is
df  6 1 5.
Now calculate the p-value using the t-value you calculated above.
Step 4: What conclusion can we draw? The p-value is the probability to observe data as extreme/unusual as the one we
saw if the gene expression in the two groups were the same. Our p-value 9.21% is quite large (bigger than 5%).
That means that we would get observations such as these by random chance and not due to real difference in gene
expression almost 10% of the time. Hence, our data is nothing unusual and we accept the null hypothesis (equal
expression in both groups) for gene At1g01000.
Results of Experiment: To determine the p-values for the other genes spotted on the microarray you would repeat steps 1
- 4 above. This would provide us with a p-value for each gene on the array.
Gene name
p-value
At1g01000
0.0921
Differentially
expressed at level 5%?
13
H
QUESTIONS
MICROARRAYS
5 points
Do this #1:
The average is:_______________________
Do this #2: The corrected M values are:
Corrected M-values
-0.29
Do this #3:
In your answer sheet write down the two hypotheses for your experiment:
Null Hypothesis:
Alternative Hypothesis:
14
First, compute the average of the six observations:
a)
x = _________ (fill in the value here)
b) s =_________________ (fill in value here).
Since we have six observations, n= 6. Now we can compute the value of the test statistic

as:
t
x
s2
n

1.15
1.282
6
 2.08
c) Calculate t (fill in value here)___________
d)
Now
calculate
the
p-value
using
the
t-value
you
calculated
above.
_________________
e) Given your calculated p-value, is the gene differentially expressed between treatment
and control?
15