Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bi-Clustering COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL • Let I be a subset of genes in the database. Let J be a subset of conditions. We say <I, J> forms an Order Preserving Cluster (OP-Cluster), if one of the following relationships exists for any pair of conditions. Expression Levels Definition of OP-Cluster j1 , j2 J , j1 j2 A1 A2 A3 A4 (1)i I , Dij1 Dij2 (2)i I , Dij1 Dij2 (3)i I , Dij1 Dij2 max | Dij1 Dij2 | min (| Dij1 |, | Dij2 |) j1 , j2 J 2 when The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Problem Statement • Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold nc and nr. 3 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Conversion to Sequence Mining Problem (1) Dij1 Dij2 j1 j2 (2) Dij1 Dij2 j1 j2 Expression Levels (3) Dij1 Dij2 CanonicalO rder ( j1 , j2 ) Sequence: A1 A4 A3 A2 A1 4 A2 A3 A4 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Ming OP-Clusters: A naïve approach • A naïve approach root Enumerate all possible subsequences in a prefix tree. For each subsequences, collect all genes that contain the subsequences. • Challenge: The total number of distinct subsequences are i i! m 1i N 5 a b b c d c a d c … d c d b d b c c d a d … d c d b c b d c d a … A Complete Prefix Tree with 4 items {a,b,c,d} The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining OP-Clusters: Prefix Tree Goal: Build a compact prefix tree that includes all sub-sequences only occurring in the original database. Strategies: g1 adbc g2 abdc g3 badc 1. Depth-First Traversal Root 2. Suffix concatenation: Visit subsequences that only exist in the input sequences. 3. Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences. a:1,2,3 a:1,2 d:1,2,3 d:1,3 d:1 b:1 c:1 6 c:1,2,3 c:1,3 b:3 b:2 a:3 d:2 d:3 c:2 c:3 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL References • J. Yang, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002. • H. Wang, W. Wang, J. Yang, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. • Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. • J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03. 7 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL