Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College 1 What is Neighboring Feature Clustering Given an m × n matrix M,where m denotes m samples and n denotes n (ordered) dimensional features, the goal is to find a intrinsic partition of the features based on their characteristics such that each cluster is a continuous piece of features. We assume there is a natural ordering of features that has relevance to the problem being solved – – – 2 E.g., in spectral datasets, such characteristics could be correlations For example, if we decide feature 1 and 10 belong to a cluster, feature 2 to 9 should also belong to that cluster. ZHIFENG: PLEASE IMPROVE THIS SLIDE, PROVIDE AN INTUITIVE DIAGRAM MR spectral features and DNA Copy Number??? 3 MR spectral features are highly redundant suggesting that the data lie in some low-dimensional space (ZHIFENG: WHAT DO YOU MEAN BY LOW DIMENSIONAL SPACE - CLARIFY) Neighboring spectral features of MR spectra are highly correlated Using NFC, we can partition the features into clusters. A cluster can be represented by a single feature, hence reducing the dimensionality. This idea can be applied to DNA copy number analysis too. Zhifeng: Yuhang said these two are not related!! Please explain how these are related. Use MDL method to solve NFC 4 Reduce NFC into a one dimensional piece-wise linear approximation problem. Given a sequence of n one dimensional points <x1,...,xn >, find the optimal step function-like line segments that can be fitted to the points Fig. 1. Piecewise linear approximation [3] [4] is usually 2D. Here we use its concept for a 1D situation. We use minimum description length (MDL) method [2] to solve this reduced problem. Zhifeng: define and explain MDL Minimum Description Length (MDL) 5 Zhifeng, please provide a slide to define this EXPLAIN HOW THE TRANSFORMATION IS DONE (AS IN [1]) TO GIVE 1D piece-wise linear approximation. Represent all the points by two line segments. Trade-off between approximation accuracy and number of line segments. A compromise can be made using MDL. ??? Zhifeng: it is all very cryptic, pieces of explanation are missing! Outline The problem – – Related work – HMM based, partial sum based, maximum likelihood based Our approach – – 6 Spectral data The abstract problem Problem reduced to 1D linear approximation MDL approach Reducing NFC to 1D Piece-Wise Linear Approximation Problem 1 7 Let correlation coefficient matrix of M be denoted as C. LetC∗ be the strictly upper triangular matrix derived from 1−|C| (entries near 0 imply high correlation between the corresponding two features). For features from i to j (1 ≤ i ≤ j ≤ n), the submatrix C∗i:j,i:j depicts pairwise correlations. We use its entries (excluding lower and diagonal entries) as the points to be explained by a line in the 1D piece-wise linear approximation problem. The objective is to find the optimal piece-wise line segments to fit those created points. Points near 0 mean high correlation. We need to force high correlations among a set. Thus the points are always approximated by 0. example 8 For example, suppose we have a set with points all around 0.3. In piece-wise linear approximation, it is better to use 0.3 as the approximation. However in NFC, we should penalize the points that stray away from 0. So we still use 0 as the approximation. Unlike usual 1D piece-wise linear approximation problem, the reduced problem has dynamic points (because they are created on the fly). Zhifeng: provide figure to illustrate above example Spectral data MR spectral data – High dimensional data points Spectral features are highly redundant (high correlation) – Find neighboring features with high correlation in a spectral dataset, – such as a MR spectral dataset. intensit frequeny Fig. 1 high dimensional data points 9 Fig. 2 correlation coefficient matrix Both axes are the features or the number of dimensions Problem Finding a low-dimensional space – – We extract an abstract problem: Neighboring Feature Clustering (NFC) – – 10 zhifeng: define low dimensional space Curse of dimensionality Features are ordered. Each cluster contains only neighboring features. Find an optimal clusters according to certain criteria Another application (with variation) Array Comparative Genomic Hybridization to detect copy number alterations. aCGH data are noisy – – 11 Smoothing Segmentation Fig. 4 aCGH data (smoothed). The X axis is log ratio Fig. 3 aCGH technology Fig.5 aCGH data(segmented). The X axis is log ratio Related work An algorithm trying to solve a similar problem – Baumgartner, et al, “Unsupervised feature dimension reduction for classification of MR spectra”, Magnetic Resonance Image, 22:251-256,2004 An extensive literature on the reduced problem – – The, et al, “On the detection of dominant points on digital curves”, IEEE PAMI, 11(8) 859-872, 1989 Statistical methods… Fig. 6 1D piece-wise approximation 12 Related work: statistical methods HMM based – Partial sum based – Lipson etc., ‘”Efficient Calculation of Interval Scores for DNA copy Number Data analysis”, RECOMB 2005 Maximum likelihood based – 13 Fridlyand, et al , “Hidden Markov models approach to the analysis of array CGH data”, J. Multivariate Anal., 90, 132153 Picard, etc., “A statistical approach for array CGH data analysis”, BMC Bioinformatics, 6:27,2005 Framework of the method proposed intensity 2. For each pair of features 1. Correlation coefficient matrix frequency 3. MDL code length (revised) 4. Code length matrix C1, intensity n-1 5. Shortest path (dynamic programming) n-1 1 frequency C2, C1, 2 C2, 3 2 3 C1,2 C3, … n-1 n-1 Cn- n C3, n C2, n 14 Fig. 7 our approach 1,n C1,1C1, 2 ...C1,n C2, 2 ...C2,n C ... Cn ,n Minimum Description Length Information Criterion – – – A model selection scheme Common information criteria are Akaike Information Criterion(AIC), Bayesian Information Criterion (BIC), and Minimum Description Length (MDL) MDL is to encode the model and the data given the model. The balance is achieved in terms of code length C ( D, M ) log p( D, M ) log p( D | M ) log p( M ) Fig. 6 1D piece-wise approximation 15 Encoding model and data given model For each pair (n*(n-1)/2 in total) of features – Encoding model – Cluster boundary, Gaussian parameter (standard deviation) Encoding data given model d 16 Fig.8 encoding the data given model for each feature pair Minimize the code length C1,n-1 Code length matrix C1,1C1, 2 ...C1,n C2, 2 ...C2,n C ... Cn ,n C2,n-1 C3,n-1 C1,2 1 2 C2,3 3 … n-1 Cn-1,n n C3,n C1,2 C2,n Fig. 9 alternative representation of matrix C Shortest path – – Recursive function Dynamic programming Fig. 10 Recursive function for the shortest path 17 Results We test on simulated data. Fig. 11 the revised correlation matrix and the computed code length matrix 18