Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISMIS Oct 21-23, 2015, Lyon, France Paul Amalaman, and Christoph F. Eick Department of Computer Science University of Houston HC-edit: A Hierarchical Clustering Approach To Data Editing CS@UH Data Mining & Machine Learning Group 1 Talk Organization 1. Introduction 2. Data Editing 3. Supervised Taxonomy & The STAXAC Algorithm 4. HC-edit 5. Related Work 6. Summary and Conclusion CS@UH Data Mining & Machine Learning Group 2 1. Introduction k-NN Classifiers K-neighborhood, Nk(x) Given an example x, a training dataset O, the k-neighborhood of x, noted Nk(x), consisting of its k nearest neighbors of x. K-NN rule Given Nk(x), the k-NN rule for x states: Assign to x the class label of the majority class labels in Nk(x). CS@UH Data Mining & Machine Learning Group 5 K-NN Classifiers Advantages Easy to implement Accuracy typically high Complex decision boundaries Disadvantages Need to store the training set—the dataset is the model! Slow CS@UH Data Mining & Machine Learning Group 3 2. Data Editing 1.Introduction 2.Data Editing Background: Data Editing Algorithms Wilson Editing Multi-Edit Representative-based Supervised Clustering Editing 3.Supervised Taxonomy & The STAXAC Algorithm 4.HC-edit 5.Related Work 6.Summary and Conclusion CS@UH Data Mining & Machine Learning Group 4 2. Data Editing Problem Definition: Given 1. Dataset O 2. Remove “bad” examples from O Oedited < O\{“bad” examples} 3. Use Oedited as the model for a k-NN classifier The goal of the presented paper is to improve the accuracy of kNN classifiers by developing new data editing approaches CS@UH Data Mining & Machine Learning Group 6 Background: Some Data Editing Algorithms Wilson Editing [Wilson 72] Wilson editing relies on the idea that if an example is erroneously classified using the k-NN rule it has to be removed from the training set Multi-Edit [Devijver 80] The algorithm repeatedly applies Wilson editing to m random subsets of the original dataset until no more examples are removed (Representative-based) Supervised Clustering Editing [Eick 2004] Use a representative-based supervised clustering approach to cluster the data. Delete all non representative examples CS@UH Data Mining & Machine Learning Group 7 Examples Wilson Editing Developed by Wilson in 1972 Remove points that do not agree with the majority of their k nearest neighbours Earlier example Original data Wilson editing with k=7 CS@UH Overlapping classes Original data Wilson editing with k=7 Data Mining & Machine Learning Group 8 Background: Data Editing Algorithms Problem With Wilson Editing: Excessive examples removal—especially in the decision boundary areas (a) Original dataset Natural boundary (b) Wilson Editing Result Natural boundary New boundary CS@UH Data Mining & Machine Learning Group 9 Supervised Clustering Goal: Obtain class uniform clusters that are dominated by instances of a single class Traditional clustering CS@UH Supervised clustering Data Mining & Machine Learning Group 10 Representative-based Supervised Clustering Editing 11 Attribute1 2 1 Clustering maximizes purity 3 4 Attribute2 Objective: Find a set of objects OR in the dataset O to be clustered, such that the clustering X obtained by using the objects in OR as representatives minimizes q(X); e.g. the following q(X): q(X):= i purity(Ci)*(|Ci|**) with >1 CS@UH Data Mining & Machine Learning Group Supervised Taxonomy & The STAXAC Algorithm 1.Introduction 2.Editing Algorithms 3.Supervised Taxonomy & The STAXAC Algorithm 4.HC-edit 5.Related Work 6.Summary and Conclusion CS@UH Data Mining & Machine Learning Group 12 3. Supervised Taxonomy Supervised Taxonomy (ST). Supervised taxonomies are generated considering background information concerning class labels in addition to distance metrics, and are capable of capturing class-uniform regions in a dataset CS@UH Data Mining & Machine Learning Group 13 14 Uses of Supervised Taxonomies Data Set Editing (this talk) Bioinformatics (e.g. try to identify interesting subclasses of known diseases) Clustering Meta Learning—create useful background from datasets by developing algorithms that operate on supervised taxonomies CS@UH Data Mining & Machine Learning Group The STAXAC Algorithm CS@UH Data Mining & Machine Learning Group 15 The STAXAC Algorithm 16 Algorithm 1: STAXAC (Supervised TAXonomy Agglomerative Clustering) Input: examples with class labels and their distance matrix D. Output: Hierarchical clustering 1. Start with a clustering X of one-object clusters. 2. C* ,C’X; merge-candidate(C*,C’) (1-NNX(C*) = C’ or 1-NNX(C’ )=C* ) 3. WHILE there are merge-candidates (C*, C’) left BEGIN a. Merge the pair of merge-candidates (C*,C’) obtaining a new cluster C=C*C’ and a new clustering X’ for which Purity(X’) has the largest value b. Update merge-candidates: C’’ merge-candidate(C’’,C) (merge-candidate(C’’,C*) or merge-candidate(C’’,C’)) c. Extend dendrogram by drawing edges from C’ and C* to C END 4. Return the constructed dendrogram CS@UH Data Mining & Machine Learning Group Properties of STAXAC 17 1. In contrast to other supervised clustering algorithms, STAXAC is a 2. 3. 4. 5. 6. hierarchical clustering that maximizes cluster purity. Individual clusterings can be extracted using different purity thresholds—HC-edit that is introduced later relies on this property. In contrast to other hierarchical clustering algorithms, STAXAC conducts a wider search, merging clusters that are neighboring and not necessarily the closest two clusters STAXAC uses a 1-NN graph to determine which clusters are neighboring Proximity graphs need only be computed at the beginning of the run. STAXAC could be generalized by using more powerful proximity graphs, such as Gabriel Graphs, to conduct a wider search. CS@UH Data Mining & Machine Learning Group Proximity Graphs Proximity graphs provide various definitions of “neighbour”: NNG MST RNG GG DT NNG = Nearest Neighbour Graph MST = Minimum Spanning Tree RNG = Relative Neighbourhood Graph GG = Gabriel Graph DT = Delaunay Triangulation (neighbours of a 1NN-classifier) CS@UH Data Mining & Machine Learning Group 18 HC-edit 1.Introduction 2.Editing Algorithms 3.Supervised Taxonomy & The STAXAC Algorithm 4.HC-edit 5.Related Work 6.Summary and Conclusion CS@UH Data Mining & Machine Learning Group 19 4. The HC-edit Approach HC-Edit Approach: 1. Create a supervised taxonomy ST for dataset O using STAXAC 2. Extract a clustering from ST for a given purity threshold β 3. Remove minority examples from each cluster 4. Classification: Use k-NN weighted vote rule to classify an example CS@UH Data Mining & Machine Learning Group 20 HC-Edit Code 21 Inputs: T: tree generated by hierarchical clustering algorithm such as STAXAC (with purity information and examples associated with for each node cluster) for a dataset O β: cluster purity threshold Output: Oedited, a dataset that is a subset of the original dataset O Function EDIT_TREE (T) Oedited:= ; EXTRACT_EXAMPLES(T); RETURN Oedited; END Function EXTRACT_EXAMPLES (T) BEGIN IF T=NULL EXIT IF purity(T) >= β Add majority examples of T to Oedited ELSE EXTRACT_EXAMPLES (T.right) EXTRACT_EXAMPLES (T.left) END CS@UH Data Mining & Machine Learning Group 30 25 20 15 10 5 0 A A B Clusters & Maj. Class Labels # of Examples Per Cluster #of maj. examples A A B A Clusters & Maj. Class Labels #of maj. examples CS@UH 30 25 20 15 10 5 0 A # of min. examples A B Clusters & Maj Class Labels. #of maj. examples 30 25 20 15 10 5 0 A 22 Boston Housing – Class Distribution for β =0.96 # of min. examples Boston Housing – Class Distribution for β =0.95 A # of Examples Per Clusters Boston Housing – Class Distribution for β =1 # of Examples Per Cluster # of Examples Per Cluster Top Clusters of size ≥20 Extracted from T A # of min. examples Boston Housing – Class Distribution for β =0.948 40 30 20 10 0 A A B A Clusters & Maj Class Labels #of maj. examples Data Mining & Machine Learning Group # of min. examples A # of Examples Per Cluster Clusters of size ≥20 continued Boston Housing – Class Distribution for β =0.86 23 Boston Housing – Class Distribution for β =0.75 120 150 100 80 100 60 40 50 20 0 0 A B #of maj. examples CS@UH B # of min. examples A A B B Clusters & Maj. Class Labels #of maj. examples # of min. examples Data Mining & Machine Learning Group How to choose ? 24 If is set to 1, no editing occurs. As decreases more objects will be removed in the editing process, and the obtained clusters contain more instances We also observed that using very low values for leads to low accuracy, as clusters get too contaminated by examples belonging to other classes. Moreover, only a finite subset of values needs to be considered: a subset of the node purities that occur in the supervised taxonomy. Basically, only a finite number of clusterings can be extracted from the supervised taxonomy and for many purity values the extracted supervised clusterings are the same . Therefore, one could set a lower bound for and use n-fold crossvalidation for purities that occur in the tree in [,1), and choose the purity value that leads to the highest accurarcy for the editing of the dataset. r CS@UH Data Mining & Machine Learning Group Experimental Results 25 E. Coli HC-edit-STAXAC HC-edit-UPGMA 56.1 (50) [43.20] 47.56 (100)[0.00],(90)[0.00] Wils onProb 57.32 [49.73] Wils on 57.32 [49.73] K-NN K=1 K=3 56.10 (100)[0],(90)[1.06],(80)[8.13 ] 52.44 (100)[0.00],(90)[0.00] 57.32 [44.53] 57.32 [45.73] 56.10 K=5 58.54 (100)[0.00],(90)[10.67] 63.41 (100)[0.00],(90)[10.67] 53.66 (100)[0.00],(90)[0.00] 51.22 [42.40] 51.22 [41.86] 54.88 62.20 (80)[4.00] 52.44 [43.20] 54.88[46.80] 56.1 K-NN K=7 51.22 BEV HC-edit-STAXAC HC-edit-UPGMA Wils onProb Wils on K=1 81.35 (90)[6.00] 82.64 (80)[14.17] 83.6 [19.96] 83.6 [19.96] 79.1 K=3 81.03 (90)[6.00] 81.99 (80)[14.17] 82.64 [16.92] 81.79 [16.75] 81.67 K=5 81 (100)[0.00] 83.28 (80)[14.17] 81.67[16.50] 79.74 [18.60] 80.39 K=7 81.35 (90)[6.00] 83.28 (80)[14.17] 81.35[16.25] 80.06 [18.60] 80.39 Vot 91.89 (100)[0.00] HC-edit-UPGMA 90.09 (90)[5.80],(80)[10.9] Wils onProb 90.95 [9.00] Wils on 90.95 [9.00] K-NN K=1 HC-edit-STAXAC K=3 91.38 (100)[0.00] 91.38 (90)[5.80] 90.95 [8.00] 90.95 [8.00] 92.24 K=5 92.67 (100)[0.00],(90)[5.38] 93.10 (100)[0.00] 92.24 (100)[0.00],(90)[5.80] 90.95 [8.00] 90.95 [8.00] 91.38 92.67 (100)[0.00] 90.95 [8.00] 90.95 [8.00] 92.67 K=7 89.66 Bld HC-edit-STAXAC HC-edit-UPGMA 61.11 (100)[0.00] 62.09 (80)[7.15] Wils onProb 61.44 [39.26] Wils on 61.44 [39.26] K-NN K=1 K=3 59.80 (100)[0.00] 59.80 (90),(80)[7.15] 59.48 [35.64] 59.48 [35.60] 59.48 K=5 63.07 (90)[3.21] 59.8 [32.00] 59.8 [31.73] 63.4 K=7 64.05 (90)[3.21] 62.75 (100)[0.00],(90)[0.6] 64.38 (100)[0.00],(90)[0.6] 61.44 [29.48] 61.44 [29.37] 64.05 HC-edit-UPGMA 78.57 (80)[5.6] Wils onProb 77.14 [25.61 ] Wils on 77.14 [25.61] K-NN K=1 HC-edit-STAXAC 80 (80)[10.50] 78.57 (100)[0.00],(90)[0.7],(80)[5.6] 72.86 [18.25] 74.29 [16.84] 78.57 K=5 78.57 (90)[2.50],(80)[10.50] 80 (90)[2.5] 77.14 (100)[0.00],(90)[0.7],(80)[5.6] 72.86 [18.60] 71.43 [17.89] 78.57 K=7 78.57 (100)[0.00],(90)[2.5],(80)[10.5] 78.57 (100)[0.00],(90)[0.7],(80)[5.6] 71.43 [18.25] 70 [18.94] 74.29 Wils on K-NN 60.13 AEV K=3 77.14 BEE HC-edit-STAXAC HC-edit-UPGMA Wils onProb K=1 80.83 (90)[7.31] 84.17 (80)[15.83] 78.33 [17.68] K=3 81.67 (90)[7.31] 84.17 (80)[15.83] 82.5 [19.07] K=5 82.50 (100)[0.00],(90)[7.31] 84.17 (80)[15.83] 82.5 [17.96] K=7 84.17 (90)[7.31] 84.17 (80)[15.83] 84.17 [16.67] 78.69 [17.68] 85 [20.09] 78.33 76.15 [17.96] 85.83 [16.57] 80.83 Wils on K-NN 82.5 83.33 Bos HC-edit-STAXAC HC-edit-UPGMA Wils onProb K=1 69.57 (90)[37.28] 72.33 (80)[7.14] 67.00 [23.24] 65.5 [23.24] 67 K=3 70.55 (90)[37.28] 71.94 (80)[7.14] 70.36 [23.68] 70.55 [23.48] 67.39 67.46 [23.61] 69.37 [25.83] 66.4 [24.16] 70.55 69.57 [24.18] 73.12 K=5 70.75 (100)[0.00] 70.36 (90)[1.03] K=7 73.12 (100)[0.00] 73.12 (100)[0.00] CS@UH Data Mining & Machine Learning Group 5. Related Work 26 1. Devijver, P. and Kittler, J.: “On the Edited Nearest Neighbor Rule”. IEEE 1980 Pattern Recognition.Vol.1, 72-80Multi-Edit 2. Eick, C.F., Zeidat, N., and Vilalta, R.: “Using Representative-Based or Nearest Neighbor Dataset Editing”. ICDM 2004: 375-378 RBSCE 3. Wilson, D.L.,: “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data”, IEEE Transactions on Systems, Man, and Cybernetics, 2:408420, 1972Wilson Editing 4. Saitou, N., Nei., M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4) (1987)UPGMA 5. Vázquez, F., Sánchez, J., Pla, F.: A stochastic approach to Wilson’s editing algorithm; in LNCS, vol. 3523, pp. 35–42. Springer, Heidelberg (2005) Wilson Probing CS@UH Data Mining & Machine Learning Group Summary and Conclusion 1.Introduction 2.Editing Algorithms 3.Supervised Taxonomy & The STAXAC Algorithm 4.HC-edit 5.Related Work 6.Summary and Conclusion CS@UH Data Mining & Machine Learning Group 27 6. Summary and Conclusion 28 We proposed HC-edit that identifies regions in the dataset of high purity using supervised clustering and removes minority examples from the identified regions. HC-edit takes as input a hierarchical clustering augmented with purity information in the nodes. In general, the supervised taxonomy can be viewed as a “useful” intermediate result which allows to inspect multiple supervised clusterings before selecting the “best” one. Traditional k-nearest neighbor methods used the parameter k for both editing and classification. By allowing two parameters for modeling—purity for editing, and k for the classification—, HC-edit provides greater landscape for a model selection. The experiments reveal that the HC-edit has improved accuracy for many datasets—usually removing less of training examples than other methods. CS@UH Data Mining & Machine Learning Group Future Work 29 Explore more sophisticated techniques to select purity thresholds for extracting “good” supervised clusterings for editing and meta learning. Generalize STAXAC to use more complicated proximity graphs, such as Gabriel graphs. Explore the usefulness of supervised taxonomies for other applications CS@UH Data Mining & Machine Learning Group Multi-Edit Multi-edit: Repeatedly applies Wilson A: DIFFUSION: Divide the dataset O into N>=3 random subsets S1 ,…SN B: CLASSIFICATION: Classify Si using the K-NN rule with S(i+1) Mod N as the training set (i=1,…,N) C: EDITING: Discard all incorrectly classified examples D: CONFUSION: Replace O by subset Or consisting of the union of all remaining examples in S1,…SN E: TERMINATION: if the last iteration produced no editing then terminate; otherwise go to step A. CS@UH Data Mining & Machine Learning Group Extracting a clustering from a STAXAC generated Tree Algorithm 3: ExtractClustering (T, β) 1. Inputs: taxonomy tree T; user-defined purity threshold 𝜃 2. Output: clustering X 3. Function ExtractClustering (T, β) 4. IF (T = NULL) 5. RETURN 6. IF T.purity ≥ β 7. RETURN T 8. ELSE 9. RETURN ExtractClustering(T.left, β) ExtractClustring(T.right, β) 10. End Function CS@UH Data Mining & Machine Learning Group