Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Mining via Co-clustering of Sparse Matrices Brian Thompson*, Linda Ness†, David Shallcross†, Devasis Bassu† * † Definitions Let 𝑀 be an 𝑚 × 𝑛 matrix. A bicluster of 𝑀 is a subset of matrix entries formed by the intersection of a set of rows 𝐼 ⊆ [𝑚] and a set of columns 𝐽 ⊆ [𝑛], and is denoted by 𝑀𝐼,𝐽 . 𝑀𝐼,𝐽 𝑀 Association Mining via Co-clustering of Sparse Matrices Motivation Matrices can represent: binary relations, objects and attributes, terms and documents, gene expression, recommender systems, ... Dense biclusters indicate strong associations 𝑀𝐼,𝐽 𝑀 Association Mining via Co-clustering of Sparse Matrices Motivation Matrices can represent: binary relations, objects and attributes, terms and documents, gene expression, recommender systems, ... Dense biclusters indicate strong associations 𝑀𝐼,𝐽 𝑀 Association Mining via Co-clustering of Sparse Matrices Co-Clustering Co-clustering: Given a matrix, cluster the rows and columns to form large, dense biclusters R1 R2 R3 C1 C2 C3 Challenges: Don’t know the number or sizes of clusters a priori Want solution to be efficient and scalable Matrix may be sparse Association Mining via Co-clustering of Sparse Matrices Our Approach We propose a two-step approach: 1. Define a quality metric 𝝁 for bicluster partitions We consider metrics of the form 𝜇 = 𝐵∈Π 𝑓 𝐵 (Motivation for this choice is in the 15-minute version of the talk...) 2. Find a co-clustering that maximizes the value of 𝝁 We propose the CC-MACS algorithm (Co-Clustering via Maximal Anti-Chain Search) Association Mining via Co-clustering of Sparse Matrices The CC-MACS Algorithm 1. Build randomized k-d trees on rows (𝑇 𝑟𝑜𝑤 ), cols (𝑇 𝑐𝑜𝑙 ) 2. Populate 𝐹𝑥,𝑦 = 𝑓(𝑀𝐼𝑥,𝐽𝑦 ) for 𝑥 ∈ 𝑇 𝑟𝑜𝑤 , 𝑦 ∈ 𝑇 𝑐𝑜𝑙 via DP 3. Initialize MACs 𝑆 𝑟𝑜𝑤 , 𝑆 𝑐𝑜𝑙 and heaps 𝐻𝑟𝑜𝑤 , 𝐻𝑐𝑜𝑙 ; ℎ 𝑥 = 𝑦∈𝑆 𝑐𝑜𝑙 𝑓(𝑀𝐼𝑥 ,𝐽𝑦 ) − 𝑓(𝑀𝐼𝑥.𝑙𝑒𝑓𝑡 ,𝐽𝑦 ) least one of 𝐻 𝑟𝑜𝑤 and 𝐻 𝑐𝑜𝑙 is − 𝑓(𝑀𝐼𝑥.𝑟𝑖𝑔ℎ𝑡 ,𝐽𝑦 ) 4. While at non-empty: • WLOG let 𝐻𝑟𝑜𝑤 . 𝑔𝑒𝑡𝑀𝑎𝑥 > 𝐻𝑐𝑜𝑙 . 𝑔𝑒𝑡𝑀𝑎𝑥 • Update data structures and variables: 𝐻 𝑟𝑜𝑤 , 𝑆 𝑟𝑜𝑤 , 𝜇𝑐𝑢𝑟𝑟 += ℎ𝑟𝑜𝑤 𝑥 , ℎ𝑐𝑜𝑙 𝑦 for 𝑦 ∈ 𝐻 𝑐𝑜𝑙 • If 𝑥. 𝑠𝑖𝑏𝑙𝑖𝑛𝑔 ∈ 𝑆 𝑟𝑜𝑤 , add 𝑥. 𝑝𝑎𝑟𝑒𝑛𝑡 to 𝐻 𝑟𝑜𝑤 5. Return co-clustering formed by 𝑆 𝑟𝑜𝑤 × 𝑆 𝑐𝑜𝑙 Association Mining via Co-clustering of Sparse Matrices The CC-MACS Algorithm The CC-MACS Algorithm The CC-MACS Algorithm The CC-MACS Algorithm The CC-MACS Algorithm The CC-MACS Algorithm Experiments: Synthetic Data • Generate 𝑚 × 𝑛 matrix 𝑀 with 𝑘 biclusters of size 𝑟 × 𝑠 selected randomly from 𝑀; non-bicluster entries are 0, each bicluster entry is a 1 with probability 1 − 𝑝 • Want co-clustering output to match ground truth 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 • Compare via 𝐹1-score: 𝐹1 = 2 ⋅ 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 + 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 Association Mining via Co-clustering of Sparse Matrices Experiments: Real-World Data • Matrices from domains of finite element modeling and quantum chemistry [src: NIST Matrix Market repository] Dataset Original Matrix CrossAssociation CC-MACS (𝒘𝟐 /𝒔) CC-MACS (𝒘𝟑 /(𝒂𝒔)) Association Mining via Co-clustering of Sparse Matrices CC-MACS (𝒘𝟒 /(𝒂𝟐 𝒔)) Concluding Thoughts • The CC-MACS algorithm runs in 𝑂(𝑚𝑛 log 𝑚𝑛) time. • Our approach compared favorably to state-of-the-art and baseline methods for a classification task on synthetic data. • Choice of metric can affect quality and granularity of results; different metrics may be appropriate for different applications. • The CC-MACS algorithm effectively identified large, dense biclusters in the datasets evaluated. Association Mining via Co-clustering of Sparse Matrices Acknowledgements/Disclaimer This research was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-706. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government. Any misinformation, mistakes, or misunderstanding resulting from this talk are solely the fault of the speaker. Association Mining via Co-clustering of Sparse Matrices Association Mining via Co-clustering of Sparse Matrices Example Matrices Spectral methods, which try to rearrange rows and columns to form a diagonal block matrix, would not perform well on this matrix. The dashed lines suggest a good co-clustering. Association Mining via Co-clustering of Sparse Matrices