Download Supplementary materials

Page 1 of 4 Clarke and Zhu, Microarray data analysis Supplementary material Basic statistical analysis Once the data normalization step has been accomplished, statistically relevant comparisons can be made between arrays within an experimental data set. In experiments where there are no true pair wise comparisons, such as a developmental or time course linked series, normalizing all arrays against a designated control sample can provide a useful frame of reference for subsequent data analysis. Samples representing a default state, such as germinating seedlings or time point 0, are frequently used as the designated control sample. Both parametric and non-parametric tests have been applied to identify the differentially expressed genes. The general difference between these tests is that the parametric tests rely on actual values while the Wilcoxon Rank Sum and Kruskal-Wallis tests rely on value order. The parametric T-test and ANOVA compare the mean of a probe set from one group to that probe set’s mean from another group and return a probability value (P-value). This p-value represents the chance of identifying a difference in means as big as or bigger than the one observed when in reality there is no difference between the group means. The (nonparametric) Wilcoxon-Rank Sum and Kruskal-Wallis tests rank the values from all groups together, lowest to highest, and use the sum of the ranks for each group to return a P-Value. This p-value represents the probability of observing a difference in the sum of the ranks that is as or more extreme than that observed if in reality there is no difference between the group medians. Note that if there are a small number of arrays in the study, there will be a very limited number of possible rankings. This can mean that as well as suffering from low power, the Wilcoxon test 1) may not discriminate between the genes very well as large numbers of genes will be assigned the same p-value and 2) that the smallest p-value possible can actually be quite large. For Page 2 of 4 Clarke and Zhu, Microarray data analysis example, if we used a Wilcoxon Rank Sum test to compare 2 groups each with 3 replicates, the smallest possible p-value would be 0.10 or 10%. There is a great deal of literature devoted to identifying the most appropriate methods of analyzing gene expression data (Cui and Churchill 2003, Yang et al, 2005, Wright and Simon 2003, Mansourian et al 2004) and it remains an area of active research. Even the seemingly simple objective of identifying a list of differentially expressed candidate genes between two conditions has no easy solution. The low levels of replication common in microarray studies means that, even if genuine biological replication is present, variance estimates are often unreliable. Methods attempting to address this problem, such as SAM (Tusher et al, 2001), which adds a constant fudge factor to the estimated standard deviation, or the Local Pooled Error (LPE) test (Jain et al, 2003), which borrows strength across genes in order to estimate variance, depend upon assumptions that may or may not be reasonable. Multiple testing correction. Multiple testing is an important issue because if we conduct 50,000 tests at a significance level of 5% (p<0.05) on a 50,000 gene array, by definition, this would lead to 2,500 genes being identified as significant by chance, even if there are no true differences (i.e. false positive). In general, the use of standard statistical multiple testing corrections such as Bonferoni are felt to be inappropriately stringent for gene expression studies. This has led to the use of the False Discovery Rate (FDR) (Benjamini and Hochberg 1995, 2002; Tusher et al., 2001) as an alternative approach to multiple testing. The FDR is defined as the estimated proportion of false positives in the selected gene list; however, there are many proposed methods of calculating this. Unsupervised clustering methods The most frequently used unsupervised clustering methods are discussed here. Hierarchical clustering. Hierarchical clustering is used to create groups of clusters based on relatedness linked by branches that ultimately form a tree (Figure 6A) Similarity between genes or gene groups is represented by the distance to their closest branch point. Once a gene / gene group has been linked to its closest related gene / gene group, the pairing is redefined as a new gene group (Eisen et al., 1998). There are three methods to determine Page 3 of 4 Clarke and Zhu, Microarray data analysis the value of a gene group when calculating similarity to another gene or gene group. Singlelinkage uses the closest value between gene groups, Complete-linkage uses the farthest value and Average-linkage uses the mean of all genes in the group. Statistical studies have shown that Single-linkage clusters are often worse than random associations while Complete-linkage consistently generates a stable cluster (Yeung et al., 2001; Gibbons and Roth, 2002). Genes belonging to common sub-branches have the most similar expression pattern, and are therefore said to cluster together or be co-regulated, while genes connect at the highest level of branching are the most dissimilar. Hierarchical clustering establishes the structure of the data set by direct comparison and grouping. Consequently directionality of gene groups can be randomly assigned resulting in branched clusters appearing chaotic or ‘flipped’ relative to closely related groups. This occurs because similarity distances are determined by comparing 2 groups directly without information from surrounding gene groups. Usually this does not matter when clustering discreet candidate gene lists, but will become an issue as the size and complexity of the data set increase. The kMeans and Self Organizing Map (SOM) are methods of partition clustering that account for complexity by using information from related gene groups. Both methods are reiterative in that distances and clusters are recalculated until maximum relatedness is achieved. However, neither method is deterministic so repeating the analysis will yield slightly different results. k-Means. The k-means analysis requires a user determined value (k) representing the number of expected gene groupings or clusters. k number of cluster centers, or centroids, are randomly generated on a two-dimensional representation of expression difference between two groups of samples (Somogyi, 1999). Each iteration of the analysis assigns all gene profiles to their closest centroid. The location of each centroid is then recalculated based on all the assigned gene profiles to a position that minimizes the distance between all the profiles. Using the recalculated centroids, gene profiles are reassigned to their new closest centroid. The process is repeated until all genes are contained within the cluster defined by their closest centroid. It is recommended that k approximate the number of conditions or variables being tested (Figure 6B). Page 4 of 4 Clarke and Zhu, Microarray data analysis SOM. The SOM analysis requires a user determined number of rows and columns to create a grid of cluster centers, called nodes, overlaid on a two-dimensional representation of expression difference between samples (Tamayo et al., 1999). Each iteration of the SOM analysis samples every gene individually and moves the closest node towards that gene. All neighboring nodes connected on the grid to the moving node are also moved towards that gene depending on their distance from the moving node. The process continues until the coordinates of the nodes no longer change with each iteration. The analysis provides value in that each node is defined by it’s relation to neighboring nodes. Consequently, each gene cluster has a measure of similarity to related gene clusters. The SOM can be represented as linked groups of similar genes or can be used as a weighted value for a Hierarchical cluster (Figure 6C). PCA. Not a true clustering method, Principal Component Analysis (PCA) is a decomposition technique that reduces data into major themes or trends known as components (Jolliffe, 2002). Each component, also called an eigenvector, contains those expression profile patterns that account for a portion of the variability within the entire dataset. The 1st component accounts for the most variation, the 2nd component for the second most variation and so forth. For example, from a standard plant microarray experiment the 1st component might be the expression pattern that best describe the variation between leaf and root; followed by genotype, followed by developmental stage, followed by experimental treatment and so on. It can be used to help find the expression pattern that best describes the experimental conditions being studied or filter sources of unwanted variation (Figure 6D).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Supplementary materials