Download Supplementary materials

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics in learning and memory wikipedia , lookup

Minimal genome wikipedia , lookup

Human genetic variation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Metagenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

NEDD9 wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy wikipedia , lookup

Gene wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Helitron (biology) wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene desert wikipedia , lookup

Genome (book) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Page 1 of 4
Clarke and Zhu, Microarray data analysis
Supplementary material
Basic statistical analysis
Once the data normalization step has been accomplished, statistically relevant comparisons can
be made between arrays within an experimental data set. In experiments where there are no
true pair wise comparisons, such as a developmental or time course linked series, normalizing
all arrays against a designated control sample can provide a useful frame of reference for
subsequent data analysis. Samples representing a default state, such as germinating seedlings
or time point 0, are frequently used as the designated control sample.
Both parametric and non-parametric tests have been applied to identify the differentially
expressed genes. The general difference between these tests is that the parametric tests rely
on actual values while the Wilcoxon Rank Sum and Kruskal-Wallis tests rely on value order.
The parametric T-test and ANOVA compare the mean of a probe set from one group to that
probe set’s mean from another group and return a probability value (P-value). This p-value
represents the chance of identifying a difference in means as big as or bigger than the one
observed when in reality there is no difference between the group means. The (nonparametric)
Wilcoxon-Rank Sum and Kruskal-Wallis tests rank the values from all groups together, lowest to
highest, and use the sum of the ranks for each group to return a P-Value. This p-value
represents the probability of observing a difference in the sum of the ranks that is as or more
extreme than that observed if in reality there is no difference between the group medians. Note
that if there are a small number of arrays in the study, there will be a very limited number of
possible rankings. This can mean that as well as suffering from low power, the Wilcoxon test 1)
may not discriminate between the genes very well as large numbers of genes will be assigned
the same p-value and 2) that the smallest p-value possible can actually be quite large. For
Page 2 of 4
Clarke and Zhu, Microarray data analysis
example, if we used a Wilcoxon Rank Sum test to compare 2 groups each with 3 replicates, the
smallest possible p-value would be 0.10 or 10%.
There is a great deal of literature devoted to identifying the most appropriate methods of
analyzing gene expression data (Cui and Churchill 2003, Yang et al, 2005, Wright and Simon
2003, Mansourian et al 2004) and it remains an area of active research. Even the seemingly
simple objective of identifying a list of differentially expressed candidate genes between two
conditions has no easy solution. The low levels of replication common in microarray studies
means that, even if genuine biological replication is present, variance estimates are often
unreliable. Methods attempting to address this problem, such as SAM (Tusher et al, 2001),
which adds a constant fudge factor to the estimated standard deviation, or the Local Pooled
Error (LPE) test (Jain et al, 2003), which borrows strength across genes in order to estimate
variance, depend upon assumptions that may or may not be reasonable.
Multiple testing correction. Multiple testing is an important issue because if we
conduct 50,000 tests at a significance level of 5% (p<0.05) on a 50,000 gene array, by
definition, this would lead to 2,500 genes being identified as significant by chance, even if there
are no true differences (i.e. false positive). In general, the use of standard statistical multiple
testing corrections such as Bonferoni are felt to be inappropriately stringent for gene expression
studies. This has led to the use of the False Discovery Rate (FDR) (Benjamini and Hochberg
1995, 2002; Tusher et al., 2001) as an alternative approach to multiple testing. The FDR is
defined as the estimated proportion of false positives in the selected gene list; however, there
are many proposed methods of calculating this.
Unsupervised clustering methods
The most frequently used unsupervised clustering methods are discussed here.
Hierarchical clustering. Hierarchical clustering is used to create groups of clusters
based on relatedness linked by branches that ultimately form a tree (Figure 6A) Similarity
between genes or gene groups is represented by the distance to their closest branch point.
Once a gene / gene group has been linked to its closest related gene / gene group, the pairing
is redefined as a new gene group (Eisen et al., 1998). There are three methods to determine
Page 3 of 4
Clarke and Zhu, Microarray data analysis
the value of a gene group when calculating similarity to another gene or gene group. Singlelinkage uses the closest value between gene groups, Complete-linkage uses the farthest value
and Average-linkage uses the mean of all genes in the group. Statistical studies have shown
that Single-linkage clusters are often worse than random associations while Complete-linkage
consistently generates a stable cluster (Yeung et al., 2001; Gibbons and Roth, 2002). Genes
belonging to common sub-branches have the most similar expression pattern, and are therefore
said to cluster together or be co-regulated, while genes connect at the highest level of branching
are the most dissimilar.
Hierarchical clustering establishes the structure of the data set by direct comparison and
grouping. Consequently directionality of gene groups can be randomly assigned resulting in
branched clusters appearing chaotic or ‘flipped’ relative to closely related groups. This occurs
because similarity distances are determined by comparing 2 groups directly without information
from surrounding gene groups. Usually this does not matter when clustering discreet candidate
gene lists, but will become an issue as the size and complexity of the data set increase. The kMeans and Self Organizing Map (SOM) are methods of partition clustering that account for
complexity by using information from related gene groups. Both methods are reiterative in that
distances and clusters are recalculated until maximum relatedness is achieved. However,
neither method is deterministic so repeating the analysis will yield slightly different results.
k-Means.
The k-means analysis requires a user determined value (k) representing the
number of expected gene groupings or clusters. k number of cluster centers, or centroids, are
randomly generated on a two-dimensional representation of expression difference between two
groups of samples (Somogyi, 1999). Each iteration of the analysis assigns all gene profiles to
their closest centroid. The location of each centroid is then recalculated based on all the
assigned gene profiles to a position that minimizes the distance between all the profiles. Using
the recalculated centroids, gene profiles are reassigned to their new closest centroid. The
process is repeated until all genes are contained within the cluster defined by their closest
centroid. It is recommended that k approximate the number of conditions or variables being
tested (Figure 6B).
Page 4 of 4
Clarke and Zhu, Microarray data analysis
SOM. The SOM analysis requires a user determined number of rows and columns to
create a grid of cluster centers, called nodes, overlaid on a two-dimensional representation of
expression difference between samples (Tamayo et al., 1999). Each iteration of the SOM
analysis samples every gene individually and moves the closest node towards that gene. All
neighboring nodes connected on the grid to the moving node are also moved towards that gene
depending on their distance from the moving node. The process continues until the coordinates
of the nodes no longer change with each iteration. The analysis provides value in that each
node is defined by it’s relation to neighboring nodes. Consequently, each gene cluster has a
measure of similarity to related gene clusters. The SOM can be represented as linked groups of
similar genes or can be used as a weighted value for a Hierarchical cluster (Figure 6C).
PCA.
Not a true clustering method, Principal Component Analysis (PCA) is a
decomposition technique that reduces data into major themes or trends known as components
(Jolliffe, 2002). Each component, also called an eigenvector, contains those expression profile
patterns that account for a portion of the variability within the entire dataset. The 1st component
accounts for the most variation, the 2nd component for the second most variation and so forth.
For example, from a standard plant microarray experiment the 1st component might be the
expression pattern that best describe the variation between leaf and root; followed by genotype,
followed by developmental stage, followed by experimental treatment and so on. It can be used
to help find the expression pattern that best describes the experimental conditions being studied
or filter sources of unwanted variation (Figure 6D).