Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker: Liu Yu-Jiun Date: 2006/11/8 Introduction The goal of data mining is to discover useful knowledge. Present the clusters as the sets of points. Interpret the clusters as the humancomprehensible patterns. In the past, only concern the length of patterns, and descript the cluster C directly. 2 SOR description Sum of Rectangles ( SOR ) is the canonical format for cluster descriptions. SOR for C : either SOR or SOR Black: cluster C (R1 and R2) Red: other cluster (R1’) Green: Bc SOR description: R1 + R2 SOR description: Bc – R1’ SOR SOR kSOR 3 ESOR (C ) R1 R2 Notations 4 Example R2 R3 R4 E SOR (C ) R1 R2 R3 R4 R5 E SOR (C ) Bc ( R1' R2' R3' ) R5 R2’ R1 R3’ 5 Problems Maximum Description Accuracy (MDA) Minimum Description Length (MDL) A novel description: kSOR description 6 Accuracy Formula recall E C / C precision E C / E f 2 recall precision recall precision Two additional measures: 1. Recall at fixed precision. (fix precision = 1) 2. Precision at fixed recall. (fix recall = 1) 7 Three Heuristic Algorithms Learn2Cover MDL approximating max length. Length of rectangle. DesTree MDA approximating the Pareto front. FindClans transforms the output from DesTree into the shorter final description. 8 Learn2Cover o x is the next point from Bc in the sorted order. 9 Cost of Learn2Cover l j (R ) : the length of rectangle R along dimension Dj. R’ : the expanded R in covering ox 10 DesTree DesTree takes the output from Learn2Cover, R or R -, as input. Build the tree from bottom to up. Merge the child nodes into parent nodes until a single node is left. Each node represents a rectangle. The higher in the tree we cut, the shorter the length and the lower the accuracy. 11 merge 12 FindClans FindClans takes as input a cut from DesTree, outputs a kSOR description. 13 Algorithm -- FindClans 14 Experimental Compare with CART and BP. Real datasets from the UCI repository, where data records with the same class label were treated as a cluster. 15 Comparisons with CART Concern both of MDA and MDL. 16 DesTree vs. CART accuracy length 17 Comparisons with BP BP addresses the MDL problem only. Synthetic datasets. Gaining 20%~50% length reduction. Learn2Cover without violation checking, so faster than BP. 18 Conclusions kSOR provides enhanced expressive power. MDA allows trading accuracy for interpretability. A paradigm for query-based “secondgeneration” database mining systems. 19