Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Visualization and Microarray • • • • Complement to numerical analysis Offers insightful information Detects the structure of dataset Early / late stage of data mining • Challenges of Microarray Visualization – – – – High dimensionality Large data size Intuitive layout Low time complexity University at Buffalo The State University of New York An Example – Early Stage University at Buffalo The State University of New York General Approaches • Global Visualizations – Encode each dimension uniformly by the same visual cue Parallel coordinates University at Buffalo The State University of New York General Approaches, con’t • Optimal Visualizations – Estimate the parameters and assess the fit of various spatial distance models for proximity data – Multidimensional scaling (MDS) • Sammon’s mapping: topology preservation. Two samples that are close to each other have to stay close when projected. University at Buffalo The State University of New York Sammon’s mapping • Sammon’s mapping is a classical case of MDS • MDS optimizes 2-D presentation to preserve distances in original N-dimensional space • Sammon’s mapping iteratively minimizes * ( d 1 ij d ij ) E d d 2 * i j i ij i j i * ij dij* is the distance between points i and j in the N-dimensional space dij* is the distance between points I and j in the visualization. University at Buffalo The State University of New York 2D to 1D University at Buffalo The State University of New York A method for achieving this projection 1. D1, D2 and D3 (the interpoint distances in the higher dimensional space) are calculated. 2. P1', P2' and P3' are generated randomly in the lower dimensional space. 3. The mapping error, E, is calculated for all the interpoint distances in the lower dimensional space. 4. The gradient showing the direction which minimizes the error is calculated. 5. The points in the lower dimensional space are moved according to the direction given by the gradient. 6. Steps 3 to 5 are repeated until E is below a given limit. University at Buffalo The State University of New York Sammon’s mapping, con’t • Some drawbacks – – – – Computationally intensive, time complexity O(n2) How to determine the best initialization No user interaction is permitted Addition of new data points requires rerun the process to get new minimized projection – Information loss University at Buffalo The State University of New York General Approaches, con’t • Projective Visualizations – Use projection functions to achieve a low dimensional display – Radial Visualizations • RadViz • Star Coordinates • VizStruct University at Buffalo The State University of New York Comparison of Approaches Advantages Disadvantages Global visualization Display all dimensional information, no computation Severe overlapping, large space to display Optimal visualization Achieve optimal result, sound theoretical basis Lack user interaction, heavy computation Projection visualization Concise display, little computation Lack regorous proof, may not be optimal University at Buffalo The State University of New York Challenges of Microarray Visualization • • • • High dimensionality Large data size Intuitive layout Low time complexity University at Buffalo The State University of New York Density or Heat Plots 1 Genes • Widely used with arrays • Works well only for structured data • Quantitative information is lost • Gets easily cluttered Increased 0 Before IFN After IFN Sample University at Buffalo The State University of New York TreeView Visualization University at Buffalo The State University of New York Principal component analysis PCA: • linear projection of data onto major principal components defined by the eigenvectors of the covariance matrix. • PCA is also used for reducing the dimensionality of the data. • Criterion to be minimised: square of the distance between the original and projected data. This is fulfilled by the Karhuven-Loeve transformation x P Px Example: Leukemia data sets by Golub et al.: Classification of ALL and AML P is composed by eigenvectors of the covariance matrix C University at Buffalo The State University of New York 1 ( xi )( xi ) t n 1 i Multi-linear scaling Sammon`s mapping: • Non-linear multi-dimensional scaling such as Sammon's mapping aim to optimally conserve the distances in an higher dimensional space in the 2/3-dimensional space. • Mathematically: Minimalisation of error function E by steepest descent method: E 1 i j Dij N N ( Dij d ij ) 2 i j Dij University at Buffalo The State University of New York Example: DLBCL prognosis – cured vs featal cases Our Visualization Approach Gene Space Fourier Harmonic Projection Sample Space University at Buffalo The State University of New York Geometric Interpretation N-dimensional space Two-dimensional space University at Buffalo The State University of New York An Example of the Mapping P=[a,a,…a] -> ? University at Buffalo The State University of New York First Fourier Harmonic Projection N-dimensional space Two-dimensional space University at Buffalo The State University of New York Analytical Properties University at Buffalo The State University of New York Scaling and Transpose Property Transpose Shift Scaling Original University at Buffalo The State University of New York Time Shifting Property University at Buffalo The State University of New York Visual Exploration Framework • • Explorative Visualization – Sample space Confirmative Visualization – Gene space University at Buffalo The State University of New York VizStruct Architecture Internet WebBrowser Web Server Matlab Web Server WebBrowser Matlab Applications Client Intranet Client Matlab Libraries Client University at Buffalo The State University of New York VizStruct User Interface University at Buffalo The State University of New York VizStruct User Interface (3) Cartesian Plot University at Buffalo The State University of New York Polar plot VizStruct User Interface (2) EM Mixture University at Buffalo The State University of New York Density contour Sample Classification University at Buffalo The State University of New York Binary Classification Binary classification: two sample classes Evaluation: hold out and cross validation Leukemia-A 72 samples with 7129 genes 38(27+11)Training,34(20+14) Testing, hold out evaluation Multiple Sclerosis 44 samples, 4132 genes MS_IFN(28), MS_CON(30), cross validation evaluation University at Buffalo The State University of New York Multiple Classification Breast Cancer 22 samples with 3226 genes 3 Classes: BRCA1 (7), BRCA2 (8), Sporadic (7) cross validation evaluation University at Buffalo The State University of New York SRBCT 88 samples with 2308 genes 4 classes: RMS, BL, NB, EWS, 63 Training and 25 Testing Classification Summary University at Buffalo The State University of New York Temporal Pattern (1) Nortryptyline University at Buffalo The State University of New York 10-OH Nortryptyline Temporal Pattern (2) Idealized temporal gene expression profiles • • • Rat Kidney data set of Stuart et al. (2001) contains 873 genes of 7 time points during kidney development There are 5 patterns or gene groups classified by the author Parallel coordinate shows the actual data comply to the profiles but with some noise Parallel coordinates for each of the gene groups University at Buffalo The State University of New York Temporal Pattern (3) Genes are very similar except the last time point Genes having a relatively steady increase in expression throughout development Genes are somewhat symmetric to the middle time point, i.e., they are transposing each other Genes having very high relative levels of expression in early development The first Fourier harmonic projection University at Buffalo The State University of New York VizStruct vs. Sammon’s Mapping VizStruct 4 Sammon's Mapping 119 123 106 0.2 123 106 108 109 131 136 135 103 133129 105 120 126118130 101 104 144 112 132 114 143 121117 11073 113 102 145141147 84 125 138 115 122 69 134 140 137 150 146 124 148 88 107 149 116 128127111142 78 77 13991 56 64 74 63 55 54 59 53 71 79 92 87 85 61 67 90 95 81 51 97 70 100 52 75 76 66 82 93 57 68 98 60 89 62 86 96 83 80 72 58 94 65 99 0.1 Imaginary Part of F 1(x[n]) Imaginary Part of F 1(x[n]) 119 0 42 9 25 13 30 31 24 2 26 10 35 38 39 448 46 21 12 45 50 14 7 44 3 27 43 836 40 32 29 1 628 22 18 41 520 47 49 11 37 19 23 17 33 34 16 -0.1 15 2 61 0 -2 94 58 42 118 132 136 108 131 126 130 110 103 109 144 129 133 105 125121 145 101 115135 104 117 113140141 138 112 116 142137 102 147 143 148 146 120114 149 122 84150 124 111 134 73 128 78 69 139 127 71 107 88 53 77 74 64 5557 87 85 67 51 79 92 56 91 5259 63 54 90 95 100 62 98 7586 76 66 97 60 70 9368 8996 72 83 8281 80 65 99 25 24 44 21 45 3126 30 32 19 4 46 10 12 8 27 35 9 132 38 40 28 2247 11 6 39 48 18 2049 3 7 3650 41129 43 5 17 34 14 37 33 15 23 16 -0.2 -4 0.1 0.12 0.14 0.16 0.18 0.2 Real Part of F (x[n] ) 1 -2 0 2 Real Part of F (x[n]) 1 • VizStruct is similar to Sammon’s mapping University at Buffalo The State University of New York VizStruct - Dimension Tour Interactively adjust dimension parameters Manually or automatically May cause false clusters to break Create dynamic visualization University at Buffalo The State University of New York Visualized Results for a Time Series Data Set University at Buffalo The State University of New York Interrelated Dimensional Clustering The approach is applied on classifying multiple-sclerosis patients and IFN-drug treated patients. – (A) Shows the original 28 samples' distribution. Each point represents a sample, which is a mapping from the sample's 4132 genes intensity vectors. – (B) Shows 28 samples' distribution on 2015 genes. – (C) Shows 28 samples' distribution on 312 genes. – (D) Shows the same 28 samples distribution after using our approach. We reduce 4132 genes to 96 genes. University at Buffalo The State University of New York References • • • • • • Li Zhang, Aidong Zhang, and Murali Ramanathan VizStruct: Exploratory Visualization for Gene Expression Profiling. Bioinformatics 2004 20: 85-92, 2004. Li Zhang, Chun Tang, Yuqing Song, and Aidong Zhang, Murali Ramanathan. VizCluster and Its Application on Clustering Gene Expression Data. International Journal of Distributed and Parallel Database, 13(1): 73-97, 2003 Li Zhang, Aidong Zhang, and Murali Ramanathan: Enhanced Visualization of Time Series through Higher Fourier Harmonics. In proceeding of BIOKDD 2003, Washington DC, August 2003, pp 49-56. Li Zhang, Aidong Zhang, and Murali Ramanathan: Fourier Harmonic Approach for Visualizing Temporal Patterns of Gene Expression Data. In proceeding of IEEE Computer Society Bioinformatics Conference (CSB 2003). Stanford, CA, August 2003, pp131-141. Li Zhang, Aidong Zhang, and Murali Ramanathan. Visualized Classification of Multiple Sample Types. In proceeding of BIOKDD 2002, Edmonton, Alberta, Canada, July 2002, pp 55-62. Li Zhang, Chun Tang, Yong Shi, Yuqing Song, and Aidong Zhang, Murali Ramanathan. VizCluster: An Interactive Visualization Approach to Cluster Analysis and Its Application on Microarray Data. In proceeding of the Second SIAM International Conference on Data Mining (SDM02). Arlinton, VA. April 2002, pp 2951. University at Buffalo The State University of New York