Download Hierarchical clustering

Extracting binary signals from microarray time-course data Debashis Sahoo1, David L. Dill2, Rob Tibshirani3 and Sylvia K. Plevritis4 1 Department of Electrical Engineering 2 Department of Computer Science 3 Department of Radiology and 4 Department of Health Research and Policy and Department of Statistics Stanford University Roli Shrivastava Introduction • Problem Statement – To identify up and down regulated gene – To identify the time of transition • Experimental Technique – Microarray (Tens of thousands of distinct probes on an array to accomplish the equivalent number of genetic tests in parallel) • Computational Technique – A tool called StepMiner to extract biologically meaningful result from large amounts of data Types of Transitions 1. One Step 2. Two Step 3. Genes for which the one- or two-step patterns do not fit appreciably better than a constant mean value (the null hypothesis). Fitting One or Two-Step Function • F1 statistic: Computes how well the one-step model fits the data • F2 statistic: Computes how well the two-step model fits the data • F12 statistic: Compares the fit of one-step model and two-step model on same data • P-value: Low P-value represents a good fit of the model to the data Calculate the F statistic for the model and data set Calculate the P-value If P < Pthreshold The model fits Pthreshold = 0.05 If P > Pthreshold The model does not fit StepMiner Algorithm one-step fits data AND one-step fits better than two-step two-step fits data AND one-step does not fit it Neither one-step Nor two-step fits the data Comparison of 4 Algorithms StepMiner Algo Step height = 5σ. Number of timepoints = 15. A total of 2000 random data, 2000 one step data and 2000 two step data with random step positions. Comparison of 4 Algorithms Step height = 5σ. Number of timepoints = 15. A total of 2000 random data, 2000 one step data and 2000 two step data with random step positions. Generation of Simulated Data • Microarray data with 15 non-uniform time points • 4000 genes with 2000 one-step and 200 two-step patterns • Gaussian noise was added to the above data • P-value threshold of 0.05 was used Results of Simulated Data - I • σ is the standard deviation of noise • Step position is fixed at 5 for 1step • Step position at 5 and 9 for 2-step • Higher the height easier is the identification Results of Simulated Data - II • σ is the standard deviation of noise • Random step positions • Small reduction in accuracy • Higher matches occur if all constant segments in a curve have several time points. • Desirable to design experiments so that there are several points before the first interesting transition and after the last interesting transition. Results of Simulated Data - III • Shows sensitivity to P-value threshold and number of time points • Random step position and step height of 5σ • Two-step signals require more time points than one-step signals • Matches increase on increasing P-value but at the cost of higher False Discovery Rate Results of Simulated Data - IV • Shows sensitivity to spacing between steps • For 15 time points first step is fixed at position 4 • A spacing of at least 3 time points is required when step height is > 3σ • Steps are required to be placed at least 3 time points from end point Diauxic Shift • In the initial phases of a growing batch culture, yeast prefers to metabolize glucose and produce ethanol even when oxygen is abundant. • When the glucose is exhausted, cells undergo a “diauxic shift,” in which they switch abruptly to an oxidative metabolism. This pathway allows the oxidation of the accumulated fermentation products and is highly efficient as a mechanism for generating Brauer et. al., Mol Biol Cell. 2005 May; 16(5): 2503–2517 ATP. Analysis of Experimental Data Fitting functions for 3 genes • 2284 genes with diauxic shift • 1088 were matched with onestep transition • 267 were two-step transitions • 929 did not match to anything The heat map shows two transitions at 8.25 and 9.25 h Same Data reanalyzed using StepMiner Heat Maps Analysis by Brauer et. al. Comparison With Brauer et al’s Results • The GO annotations and FDR-corrected P-values for the clusters reported in Brauer et al. was recomputed with the latest yeast gene annotations from the Gene Ontology Consortium Website • Table shows the results of the p-values from GO- Term Finder as well as Step Miner. Table for Comparison Results Of Comparison • The annotation that had the lowest P-values in Brauer et al. had even low P-values in the StepMiner groups. • In most cases, the P-values in the reanalysis are lower than Brauer et al’s, implies that grouping by time-of-change is at least as effective as hierarchical clustering at identifying relevant genes. • GO annotations are obtained fully automatically using StepMiner – it is not necessary to select interesting clusters manually. • Those clusters which has no P-values from StepMiner were “less interpretable in terms of diauxic shift”, in the words of Brauer et al. Comparison of StepMiner to Other Tools • Hierarchical clustering: finds clusters that transition at same time point – Manual search required to find transitions • SAM: finds transitions by looking for significant differences in average expression before and after a specified time point. – However, many of the genes selected by this method do not, in fact, have a transition at the specified time point. • EDGE: identify genes whose expression systematically change over time and significantly different from the mean of the expressions over time. – Clearly, this method doesn’t provide the direction and position of significant change directly. Hierarchical vs. StepMiner Cluster that transitions at 3 hours StepMiner clearly shows other transition times Comparison of StepMiner to Other Tools - STEM • Provides model profiles and their significance values • But profiles don’t look like step functions and therefore is not helpful to locate transitions Strengths and Limitations • Easy to understand • Few parameters • Biologically transitions can be more interesting • Very fast < 15s for 15 microarrays of 40000 genes • Can deal with missing measurements • Provides statistical parameters like P-value, FDR etc. • Binary model • There can be other cases: eg, transition is not step • Short and long time courses are not good Most appropriate for 10-30 Time measurements. Post StepMiner Analysis • Once StepMiner is run genes undergoing binary transitions can easily be partitioned into sets based on the number, direction, and timing of transitions. • These sets can be merged at the user’s discretion (e.g., the set of one-step genes that rise at time 3 could be merged with the twostep genes that rise at time 3), or can be further subdivided etc. • BACK UP SLIDES Replication vs. Resolution • For accuracy it is better to take more frequent measurements that to get replicates • It comes at a cost of correctly identifying the kind of step

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Hierarchical clustering