Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Principal component analysis wikipedia , lookup
Human genetic clustering wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
K-means clustering wikipedia , lookup
Figure 5: Fisher iris data set vote matrix after ordering. c 2007 The Authors. Journal Compilation c 2007 Blackwell Publishing Ltd. 38 Expert Systems, July 2007, Vol. 24, No. 3 183 6.2.2. Methodology application We did not know how many clusters the data set should be classified into, so we performed the test forcing the tools to cluster it into the suggested number of four clusters. We performed the same steps for the suggested methodology as we performed with the Fisher iris data set. When we examined the final results, we noted that the classification was not satisfactory. We therefore applied the same steps with fewer clusters – three and two, respectively. 6.2.3. Methodology implementation output The user profiles data set ordered vote matrix with four clusters (Figure 6) shows the results after applying the suggested methodology to the data set. First, we tried to observe the outcome when classifying the data set into four clusters as suggested in the original research. The outcome was very inconsistent. A cluster comprising samples 1, 4, 8, 9, 17, 23, 25, 29, 30, 31, 32, 37, 39 and 40 could be identified, but the rest of the samples, excluding sample 26, were not clearly associated with more than one additional cluster. This certainly does not indicate that the initial assumption, that the data set could be classified into four clusters based on the eight properties, is correct. The user profiles data set ordered vote matrix with three clusters (Figure 7) shows the results for the user profiles data set when forcibly classified into three clusters. We tried this classification after receiving unsatisfactory results from the four-cluster attempt. The results were quite similar, where the outstanding cluster comprising samples 1, 4, 8, 9, 17, 23, 25, 29, 30, 31, 32, 37, 39 and 40 was clearly identified, while the rest of the samples could not be clearly divided into more than one additional cluster. The user profiles data set ordered vote matrix with two clusters (Figure 8) shows the user profiles data set with the suggested methodology applied to it assuming two clusters. This time, the results were quite clear and two clusters could be easily identified. It is important to note that the cluster that was identified even when trying to classify the data set into four clusters remained consistent throughout all the methodology applications. 184 Expert Systems, July 2007, Vol. 24, No. 3 S 4 8 9 29 30 31 32 37 39 1 17 23 25 40 11 10 12 24 16 27 22 15 28 36 21 33 6 7 34 5 13 14 19 35 38 20 3 18 2 26 M1 M3 M4 M5 M6 M7 M8 M9 M10 HM 4 4 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 1 1 1 4 4 1 1 1 1 1 1 1 1 2 4 1 1 1 1 1 1 1 4 2 4 1 1 1 1 1 1 1 4 2 4 1 1 1 1 1 1 1 4 2 4 1 1 1 1 1 1 1 4 1 2 1 1 1 1 1 1 1 9 3 1 2 3 2 3 2 2 4 36 2 1 3 2 1 3 2 2 4 25 3 1 3 2 1 3 2 2 4 25 2 2 3 2 1 3 3 2 4 25 3 1 4 3 3 4 2 2 4 25 2 1 4 3 3 4 2 2 4 25 2 1 3 4 1 4 2 2 4 16 2 2 3 2 1 3 2 2 4 16 2 2 3 2 1 3 2 2 4 16 2 2 3 2 1 3 2 2 4 16 2 3 2 2 1 3 2 3 2 16 2 3 2 2 1 3 2 3 2 16 1 3 2 2 1 3 2 3 2 16 1 3 2 2 1 3 2 3 2 16 1 3 2 2 1 3 2 3 2 16 3 2 3 2 1 2 2 2 3 4 3 2 3 2 1 2 2 2 3 4 3 2 3 2 1 2 2 2 3 4 2 2 3 2 1 2 2 2 3 4 2 2 3 2 1 2 2 2 3 4 2 2 3 2 1 2 2 2 3 4 2 2 2 2 1 2 2 2 4 4 1 3 2 2 1 2 2 2 2 4 4 2 2 2 1 2 2 2 2 4 1 2 2 2 1 2 2 2 2 1 3 1 3 4 4 4 4 4 4 0 342 Figure 6: User profiles data set ordered vote matrix with four clusters. Applying the methodology to the other assumptions showed that the rest of the samples could not be divided into additional clusters. Hence, these two clusters were probably the best classification based on the given properties. 7. Discussion and conclusions The suggested methodology produced a clear, visual presentation of data set classifications, c 2007 The Authors. Journal Compilation c 2007 Blackwell Publishing Ltd. 39 S 4 8 9 29 30 31 32 37 39 1 17 23 25 40 26 21 33 16 22 27 6 7 10 11 15 24 28 34 36 12 14 18 19 35 38 2 3 5 13 20 M1 M3 M4 M5 M6 M7 M8 M9 M10 HM 3 2 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 3 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 4 2 2 1 1 1 1 1 1 1 4 2 2 1 1 1 1 1 1 1 4 2 2 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 9 1 3 3 3 3 3 3 2 3 9 2 3 2 2 1 2 2 3 2 16 2 3 2 2 1 2 2 3 2 16 1 3 3 3 1 3 2 2 3 16 2 3 3 3 1 3 2 2 3 16 2 3 3 3 1 3 2 2 3 16 1 3 2 2 1 2 2 3 2 9 1 3 2 2 1 2 2 3 2 9 2 3 3 2 1 2 2 2 3 9 1 3 2 3 2 2 2 2 3 9 2 3 3 2 1 2 2 2 3 9 2 3 3 2 1 2 2 2 3 9 2 3 3 2 1 2 2 2 3 9 1 3 2 2 1 2 2 3 2 9 2 3 3 2 1 2 2 2 3 9 1 3 3 2 1 2 2 2 3 4 1 3 3 2 1 2 2 2 3 4 3 1 2 2 1 2 2 2 2 4 2 1 3 2 1 2 2 2 3 4 2 1 3 2 1 2 2 2 3 4 2 1 3 2 1 2 2 2 3 4 1 1 2 2 1 2 2 2 2 1 1 1 2 2 1 2 2 2 2 1 1 1 3 2 1 2 2 2 3 1 1 1 3 2 1 2 2 2 3 1 2 1 2 2 1 2 2 2 3 1 190 S 1 17 23 25 8 9 4 29 30 31 32 37 39 40 26 2 3 5 6 7 11 12 13 14 16 18 34 10 15 19 20 21 22 24 27 28 33 35 36 38 M1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 M3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 M4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 M5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 M6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 M8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 M9 M10 HM 1 1 4 1 1 4 1 1 4 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 4 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 2 2 0 42 Figure 7: User profiles data set ordered vote matrix with three clusters. Figure 8: User profiles data set ordered vote matrix with two clusters. which can be used to identify samples that are clustered correctly. The user profiles data set is a good example of this, as we started with the initial assumption that it should be classified into four clusters but when performing the classification it became obvious that four clusters was incorrect. Figure 6 demonstrates this very clearly. Continuing to work according to the suggested methodology, we reached a very clear classification that is demonstrated visually in Figure 8. Looking at the Fisher iris data set, as demonstrated in Figure 5, we can see how samples that are falsely clustered, such as sample 25, or samples that are difficult to cluster, such as sample 57, stand out. Such a view is hard to reach using legacy presentations, such as two- or threedimensional scatter charts, since in many cases there are more than three properties by which the data set is classified. However, there are no good and clear means with which to present a data set distribution in more than three dimensions. This c 2007 The Authors. Journal Compilation c 2007 Blackwell Publishing Ltd. 40 Expert Systems, July 2007, Vol. 24, No. 3 185 requires us to use perspectives where not all the properties are presented, causing an inaccurate and sometimes even misleading presentation. The suggested methodology produced a perspective that gives a clear presentation of the effectiveness of the different algorithms; such a perspective can be useful when applied to a training data set in order to decide on the most effective algorithm to use in order to classify future samples. This is demonstrated by the association of the clusters presented in Figures 3 and 4. Using the heterogeneity meter the outcome of the different algorithms can be compared, in contrast to legacy methods where examplewise an outcome of six classification samples {1, 2, 3, 4, 5 and 6} clustered by two different algorithms as, for example, output A {(1, 2), (3, 4), (5, 6)} and output B {(1, 4), (3, 6), (2, 5)} is quite difficult to associate and analyse. The current study suggests a methodology for classifying data sets. Though limited regarding the size of the data sets it can analyse, it succeeds in providing a clear visual perspective of areas of interest that fail to provide a satisfactory visual presentation when using legacy tools. We demonstrated the successful application of the methodology to a well-known data set as well as to a data set that could not have been analysed correctly without using the suggested methodology. The current study proves the need for such a presentation and provides the means to produce it. In this sense, it has opened a path for further research that will allow the improvement of the suggested methodology and its implementation to data sets that are currently not covered. 7.1. Limitations and future research The suggested methodology in its current application does not scale well and requires the application of excessive computing power to achieve its views. Therefore, it is not suggested for use on large data sets, and is more applicable regarding small data sets and training data sets. The suggested methodology also fails to provide a clear means by which to order the samples according to different types of perspective. To a certain extent, this must be done manually. 186 Expert Systems, July 2007, Vol. 24, No. 3 There are still several open issues regarding the use of the suggested methodology: finding an efficient method to minimize the heterogeneity meter in order to find the correct association of the clusters according to the different algorithms; identifying which algorithms to use to cluster a specific data set that forms the desired perspective; adapting the application of the suggested methodology for use with large data sets; finding a formula to normalize the heterogeneity meter with respect to the number of clusters the data set was classified into. References BOUDJELOUD, L. and F. POULET (2005) Visual interactive evolutionary algorithm for high dimensional data clustering and outlier detection, Lecture Notes in Artificial Intelligence, 3518, 426–431. CLIFFORD, H.T. and W. STEVENSON (1975) An Introduction to Numerical Classification, New York: Academic Press. DE-OLIVEIRA, M.C.F. and H. LEVKOWITZ (2003) From visual data exploration to visual data mining: a survey, IEEE Transactions on Visualization and Computer Graphics, 9 (3), 378–394. ERLICH, Z., R. GELBARD and I. SPIEGLER (2002) Data mining by means of binary representation: a model for similarity and clustering, Information Systems Frontiers, 4, 187–197. FISHER, R.A. (1936) The use of multiple measurements in taxonomic problems, Annual Eugenics, 7, 179–188. JAIN, A.K. and R.C. DUBES (1988) Algorithms for Clustering Data, Upper Saddle River, NJ: Prentice Hall. JAIN, A.K., M.N. MURTY and P.J. FLYNN (1999) Data clustering: a review, ACM Communication Surveys, 31, 264–323. SHAPIRA, B., P. SHOVAL and U. HANANI (1999) Experimentation with an information filtering system that combines cognitive and sociological filtering integrated with user stereotypes, Decision Support Systems, 27, 5–24. SHARAN, R. and R. SHAMIR (2002) Algorithmic approaches to clustering gene expression data, in Current Topics in Computational Molecular Biology, T. Jiang, T. Smith, Y. Xu and M.Q. Zhang (eds), Boston, MA: MIT Press, 269–300. SHULTZ, T., D. MARESCHAL and W. SCHMIDT (1994) Modeling cognitive development on balance scale phenomena, Machine Learning, 16, 59–88. c 2007 The Authors. Journal Compilation c 2007 Blackwell Publishing Ltd. 41 Essay 2 “Decision Support System using - Visualization of Multi-Algorithms Voting” 42 DSS Using Visualization of Multi-Algorithms Voting Ran M. Bittmann Graduate School of Business Administration – Bar-Ilan University, Israel Roy M. Gelbard Graduate School of Business Administration – Bar-Ilan University, Israel • INTRODUCTION The problem of analyzing datasets and classifying them into clusters based on known properties is a well known problem with implementations in fields such as finance (e.g., pricing), computer science (e.g., image processing), marketing (e.g., market segmentation), and medicine (e.g., diagnostics), among others (Cadez, Heckerman, Meek, Smyth, & White, 2003; Clifford & Stevenson, 2005; Erlich, Gelbard, & Spiegler, 2002; Jain & Dubes, 1988; Jain, Murty, & Flynn, 1999; Sharan & Shamir, 2002). Currently, researchers and business analysts alike must try out and test out each diverse algorithm and parameter separately in order to set up and establish their preference concerning the individual decision problem they face. Moreover, there is no supportive model or tool available to help them compare different results-clusters yielded by these algorithm and parameter combinations. Commercial products neither show the resulting clusters of multiple methods, nor provide the researcher with effective tools with which to analyze and compare the outcomes of the different tools. To overcome these challenges, a decision support system (DSS) has been developed. The DSS uses a matrix presentation of multiple cluster divisions based on the application of multiple algorithms. The presentation is independent of the actual algorithms used and it is up to the researcher to choose the most appropriate algorithms based on his or her personal expertise. Within this context, the current study will demonstrate the following: • • How to evaluate different algorithms with respect to an existing clustering problem. Identify areas where the clustering is more effective and areas where the clustering is less effective. Identify problematic samples that may indicate difficult pricing and positioning of a product. Visualization of the dataset and its classification is virtually impossible using legacy methods when more than three properties are used, as is the case in many problems, since displaying the dataset in such a case will require giving up some of the properties or using some other method to display the dataset’s distribution over four or more dimensions. This makes it very difficult to relate to the dataset samples and understand which of these samples are difficult to classify, (even when they are classified correctly), and which samples and clusters stand out clearly (Boudjeloud & Poulet, 2005; De-Oliveira & Levkowitz, 2003; Grabmier & Rudolph, 2002; Shultz, Mareschal, & Schmidt, 1994). Even when the researcher uses multiple algorithms in order to classify the dataset, there are no available tools that allow him/her to use the outcome of the algorithms’ application. In addition, the researcher has no tools with which to analyze the difference in the results. The current study demonstrates the usage of a developed decision support methodology based upon formal quantitative measures and a visual approach, enabling presentation, comparison, and evaluation of the multi-classification suggestions resulting from diverse algorithms. The suggested methodology and DSS support a cross-algorithm presentation; all resultant classifications are presented together in a “Tetris-like format” in which each column represents a specific classification algorithm and each line represents a specific sample case. Formal quantitative measures are then used to analyze these “Tetris blocks,” arranging them according to their best structures, that is, the most agreed-upon classification, which is probably the most agreed-upon decision. Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. 43 D DSS Using Visualization of Multi-Algorithms Voting Such a supportive model and DSS impact the ultimate business utility decision significantly. Not only can it save critical time, it also pinpoints all irregular sample cases, which may require specific examination. In this way, the decision process focuses on key issues instead of wasting time on technical aspects. The DSS is demonstrated using common clustering problems of wine categorizing, based on 13 measurable properties. THEORETICAL BACKGROUND Cluster Analysis In order to classify a dataset of samples with a given set of properties, researchers use algorithms that associate each sample with a suggested group-cluster, based on its properties. The association is performed using likelihood measure that indicates the similarity between any two samples as well as between a sample, to be associated, and a certain group-cluster. There are two main clustering-classification types: • Supervised (also called categorization), in which a fixed number of clusters are predetermined, and the samples are divided-categorized into these groups. • Unsupervised (called clustering), in which the preferred number of clusters, to classify the dataset into, is formed by the algorithm while processing the dataset. There are unsupervised methods, such as hierarchical clustering methods, that provide visualization of entire “clustering space” (dendrogram), and in the same time enable predetermination of a fixed number of clusters. A researcher therefore uses the following steps: 1. 2. The researcher selects the best classification algorithm based on his/her experience and knowledge of the dataset. The researcher tunes the chosen classification algorithm by determining parameters, such as the likelihood measure, and number of clusters. Current study uses hierarchical clustering methods, which are briefly described in the following section. Hierarchical Clustering Methods Hierarchical clustering methods refer to a set of algorithms that work in a similar manner. These algorithms take the dataset properties that need to be clustered and start out by classifying the dataset in such a way that each sample represents a cluster. Next, it merges the clusters in steps. Each step merges two clusters into a single cluster until only one cluster (the dataset) remains. The algorithms differ in the way in which distance is measured between the clusters, mainly by using two parameters: the distance or likelihood measure, for example, Euclidean, Dice, and so forth, and the cluster method, for example, between group linkage, nearest neighbor, and so forth. In the present study, we used the following wellknown hierarchical methods to classify the datasets: • • • • • 44 Average linkage (within groups): This method calculates the distance between two clusters by applying the likelihood measure to all the samples in the two clusters. The clusters with the best average likelihood measure are then united. Average linkage (between groups): This method calculates the distance between two clusters by applying the likelihood measure to all the samples of one cluster and then comparing it with all the samples of the other cluster. Once again, the two clusters with the best likelihood measure are then united. Single linkage (nearest neighbor): This method, as in the average linkage (between groups) method, calculates the distance between two clusters by applying the likelihood measure to all the samples of one cluster and then comparing it with all the samples of the other cluster. The two clusters with the best likelihood measure, from a pair of samples, are united. Median: This method calculates the median of each cluster. The likelihood measure is applied to the medians of the clusters, after which the clusters with the best median likelihood are then united. Ward: This method calculates the centroid for each cluster and the square of the likelihood measure of each sample in both the cluster and DSS Using Visualization of Multi-Algorithms Voting the centroid. The two clusters, which when united have the smallest (negative) affect on the sum of likelihood measures, are the clusters that need to be united. Likelihood-Similarity Measure In all the algorithms, we used the squared Euclidean distance measure as the likelihood-similarity measure. This measure calculates the distance between two samples as the square root of the sums of all the squared distances between the properties. As seen previously, the algorithms and the likelihood measures differ in their definition of the task, that is, the clusters are different and the distance of a sample from a cluster is measured differently. This results in the fact that the dataset classification differs without obvious dependency between the applied algorithms. The analysis becomes even more complicated if the true classification is unknown and the researcher has no means of identifying the core of the correct classification and the samples that are difficult to classify. Visualization: Dendrogram Currently, the results can be displayed in numeric tables, in 2D and 3D graphs, and when hierarchical classification algorithms are applied, also in a dendrogram, which is a tree-like graph that presents entire “clustering space,” that is, the merger of clusters from the initial case, where each sample is a different cluster to the total merger, where the whole dataset is one cluster. The vertical lines in a dendrogram represent clusters that are joined, while the horizontal lines represent the likelihood coefficient for the merger. The shorter the horizontal line, the higher the likelihood that the clusters will merge. Though the dendrogram provides the researcher with some sort of a visual representation, it is limited to a subset of the algorithms used. Furthermore, the information in the dendrogram relates to the used algorithm and does not compare or utilize additional algorithms. The information itself serves as a visual aid to joining clusters, but does not provide a clear indication of inconsistent samples in the sense that their position in the dataset spectrum, according to the chosen properties, is misleading, and likely to be wrongly classified. This is a common visual aid used by researchers but it is not applicable to all algorithms. Among the tools that utilize the dendrogram visual aid is the Hierarchical Clustering Explorer. This tool tries to deal with the multidimensional presentation of datasets with multiple variables. It produces a dashboard of presentations around the dendrogram that shows the classification process of the hierarchical clustering and the scatter plot that is a human readable presentation of the dataset, but limited to two variables (Seo & Shneiderman, 2002, 2005). Visualization: Additional Methods Discriminant Analysis and Factor Analysis The problem of clustering may be perceived as finding functions applied on the variables that discriminate between samples and decide to which cluster they belong. Since usually there are more than two or even three variables it is difficult to visualize the samples in such multidimensional spaces, some methods are using the discriminating functions, which are a transformation of the original variables and present them on two dimensional plots. Discriminant function analysis is quit analogous to multiple regression. The two-group discriminant analysis is also called Fisher linear discriminant analysis after Fisher (1936). In general, in these approaches we fit a linear equation of the type: Group = a + b1*x1 + b2*x2 + ... + bm*xm Where a is a constant and b1 through bm are regression coefficients. The variables (properties) with the significant regression coefficients are the ones that contribute most to the prediction of group membership. However, these coefficients do not tell us between which of the groups the respective functions discriminate. The means of the functions across groups identify the group’s discrimination. It can be visualized by plotting the individual scores for the discriminant functions. Factor analysis is another way to determine which variables (properties) define a particular discriminant function. The former correlations can be regarded as factor loadings of the variables on each discriminant function (Abdi, 2007). It is also possible to visualize both correlations; between the variables in the model (using adjusted factor analysis) and discriminant functions, using a tool that combines these two methods (Raveh, 2000). 45 D DSS Using Visualization of Multi-Algorithms Voting Each ray represents one variable (property). The angle between any two rays presents correlation between these variables (possible factors). • Self-Organization Maps (SOM) The model is implemented on known datasets to further demonstrate its usage in real-life research. SOM also known as Kohonen network is a method that is based on neural network models, with the intention to simplify the presentation of multidimensional data into the simpler more intuitive two-dimensional map (Kohonen, 1995). The process is an iterative process that tries to bring samples, in many cases a vector of properties, that are close, after applying on them the likelihood measure, next to each other in the two dimensional space. After a large number of iterations a map-like pattern is formed that groups similar data together, hence its use in clustering. Visualization: Discussion As described, these methodologies support visualization of a specific classification, based on a single set of parameters. Hence, current methodologies are usually incapable of making comparisons between different algorithms and leave the decision making, regarding which algorithm to choose, to the researcher. Furthermore, most of the visual aids, though giving a visual interpretation to the classification by the method of choice, lose some of the relevant information on the way, like in the case of discriminant analysis, where the actual relations between the dataset’s variable is being lost when projected on the two-dimensional space. This leaves the researcher with very limited visual assistance and prohibits the researcher from having a full view of the relations between the samples and a comparison between the dataset classifications based on the different available tools. DSS USING VISUALIZATION OF MULTI-ALGORITHMS VOTING This research presents the implementation of the multi-algorithm DSS. In particular, it demonstrates techniques to: • • Identify the profile of the dataset being researched Identify samples’ characteristics The Visual Analysis Model The tool presented in the current study presents the classification model from a clear, two-dimensional perspective, together with tools used for the analysis of this perspective. Vote Matrix The “vote matrix” concept process recognizes that each algorithm represents a different view of the dataset and its clusters, based on how the algorithm defines a cluster and measures the distance of a sample from a cluster. Therefore, each algorithm is given a “vote” as to how it perceives the dataset should be classified. The tool proposed in the current study presents the “vote matrix” generated by the “vote” of each algorithm used in the process. Each row represents a sample, while each column represents an algorithm and its vote for each sample about which cluster it should belong to, according to the algorithm’s understanding of both clusters and distances. Heterogeneity Meter The challenge in this method is to associate the different classifications, since each algorithm divides the dataset into different clusters. Although the number of clusters in each case remains the same for each algorithm, the tool is necessary in order to associate the clusters of each algorithm; for example, cluster number 2 according to algorithm A1 is the same as cluster number 3 according to algorithm A2. To achieve this correlation, we will calculate a measure called the heterogeneity meter for each row, that is, the collection of votes for a particular sample, and sum it up for all the samples. Multiple methods can be used to calculate the heterogeneity meter. These methods are described as follows: Identify the strengths and weaknesses of each clustering algorithm 46 DSS Using Visualization of Multi-Algorithms Voting Squared VE (Vote Error) This heterogeneity meter is calculated as the square sum of all the votes that did not vote for the chosen classification. It is calculated as follows: H= n ∑ (N − M ) 2 i i =1 Equation 1: Squared VE Heterogeneity Meter Where: H – is the heterogeneity meter N – is the number of algorithms voting for the sample M – is the maximum number of similar votes according to a specific association received for a single sample i – is the sample number n – is the total number of samples in the dataset Distance From Second Best (DFSB) This heterogeneity meter is calculated as the difference in the number of votes that the best vote, that is, the vote common to most algorithms, received and the number of votes the second-best vote received. The idea is to discover to what extent the best vote is distinguished from the rest. This meter is a reverse meter, as the higher it is, the less heterogenic the sample. It is calculated as follows: H= n i =1 D Heterogeneity Meter Implementation In order to find the best association, the heterogeneity meter needs to be minimized, that is, identifying the association that makes the votes for each sample as homogeneous as possible. The heterogeneity meter is then used to sort the voting matrix, giving the researcher a clear, two-dimensional perspective of the clusters and indicating how well each sample is associated with its designated cluster. Visual Pattern Characteristics In this section, we will demonstrate several typical patterns that can be recognized in the suggested DSS. In each pattern, we find the following columns: S – Samples number T – True clustering A1, A2, A3, A4, A5, A6 – Three algorithms used to for clustering For each example, there are five rows representing five different samples. Well-Classified Samples ∑ (B − SB ) i i – is the sample number n – is the total number of samples in the dataset i Equation 2: DFSB Heterogeneity Meter Where: H – is the Heterogeneity Meter B – is the best, that is, the cluster voted the most times as the cluster for a given sample SB – is the second-best cluster for a given sample Figure 1. Well-classified clusters In Figure 1, we can see that sample 68 was classified correctly by all algorithms. This is an indication that the variables used to classify the dataset work well with the sample; if this is consistent with the cluster, it shows that these variables can be used to identify it. Figure 2. A hard-to-classify example 47 DSS Using Visualization of Multi-Algorithms Voting Figure 3. Algorithms that are effective for a certain cluster Samples that are Hard to Classify In Figure 2, we see that while samples 59-62 are classified correctly and identically by nearly all the chosen methods, sample 71 is classified differently. This is an indication that this sample is hard to classify and that the parameters used for classification do not clearly designate it to any particular cluster. Algorithms that are Effective for a Certain Cluster In Figure 3, we see that algorithm A6 is more effective for classifying the red cluster, as it is the only algorithm that succeeded in classifying it correctly. This does not mean that it is the best algorithm overall, but it does indicate that if the researcher wants to find candidates for that particular cluster algorithm, then A6 is a good choice. Wrongly Classified Samples In Figure 4, we see that some samples, mainly 174, 175, and 178 were classified incorrectly by all algorithms. It is evident since the cluster color of the classification by the algorithms, marked A1-A6, is different than the true classification, marked T. This is an indication that the parameters by which the dataset was classified are probably not ideal for some samples; if it is consistent with a certain cluster, we can then say that the set of variables used to classify the dataset is not effective for identifying this cluster. Figure 4. Wrongly classified samples IMPLEMENTATION—THE CASE OF WINE RECOGNITION The Dataset To demonstrate the implementation of the DSS, we chose the Wine Recognition Data (Forina, Leardi, Armanino, & Lanteri, 1988; Gelbard, Goldman, & Spiegler, 2007). This is a collection of wines classified using thirteen different variables. The variables are as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols Flavanoids Non-flavanoid phenols Proanthocyanins Color intensity Hue OD280/OD315 of diluted wines Proline The target is to cluster the wines based on the given attributes into three different clusters, representing the three different cultivars from which the wines are derived. 48 DSS Using Visualization of Multi-Algorithms Voting Figure 5. Wine cases: Vote matrix part 1 Figure 6. Wine cases: Vote matrix part 2 The Implementation DISCUSSION We used six hierarchical clustering methods: The advantage of the visual representation of clustering the wine dataset is well depicted in Figures 5 and 6, as we get a graphical representation of the dataset and its classification. Examples of the immediate results from this presentation are as follows: Looking at the vote matrix, it is easy to see that two of the three clusters are well detected using the hierarchical clustering algorithms. It can also be seen that some samples, such as samples 70, 71, 74, and 75 are not easy to classify, while other samples, such as sample 44, are falsely associated. Furthermore, it can be seen that the average linkage (within group) is probably not an algorithm that will work well with this dataset. 1. 2. 3. 4. 5. 6. Average linkage (between Groups) Average linkage (within Group) Complete linkage (Furthest Neighbor) Centroid Median Ward We performed the cluster association using the DFSB heterogeneity meter; the resulting vote matrix is depicted in Figures 5 6. Figures 7 and Figure 8, in appendix A, rearrange the cases, that is, lines of Figures 5 and 6, in a way that agreed cases are placed close to each other, according to clusters order, creating a “Tetris-like” view. As aforesaid, each column represents a specific algorithm, each line represents a specific case, and each color represents a “vote”, that is, decision suggestion. Uni-color lines represent cases in which all algorithms vote for the same cluster (each cluster is represented by a different color). These agreed cases are “pushed down,” while multi-color lines “float” above, in the same way it is used in a Tetris game. D CONCLUSION AND FURTHER RESEARCH The DSS presented in the current article uses different algorithm results to present the researcher with a clear picture of the data being researched. The DSS is a tool that assists the researcher and allows the researcher to demonstrate his/her expertise in selecting the variables by which the data is classified and the algorithms used to classify it. In some cases, the researcher knows the expected number of clusters to divide the dataset into, while in other cases, the researcher needs assistance. The 49 DSS Using Visualization of Multi-Algorithms Voting discussed DSS works well in both cases, as it can present different pictures of the dataset as a result of the different classifications. The result is a tool that can assist researchers in analyzing and presenting a dataset otherwise difficult to comprehend. The researcher can easily see, rather than calculate, both the trends and the classifications in the researched dataset and can clearly present it to his/her colleagues. To activate the analysis, a tool was developed that performs the association of the different algorithms. This tool uses brute force and thus is still not scalable over a large number of clusters and algorithms. More efficient ways to perform the association require further research. There are also multiple methods for calculating the heterogeneity meter. Two of them were presented in the current study, but there is still room for using/presenting other methods that allow us to associate the clusters based on different trends, such as prioritizing an association with a clear classifications in as many samples as possible vs. associations with minimum errors over all the vote matrix. Erlich, Z., Gelbard, R., & Spiegler, I. (2002). Data mining by means of binary representation: A model for similarity and clustering. Information Systems Frontiers, 4, 187-197. Forina, M., Leardi, R., Armanino, C., & Lanteri, S. (1988). PARVUS—An extendible package for data exploration, classification and correlation. Genova, Italy: Institute of Pharmaceutical and Food Analysis and Technologies. Gelbard, R., Goldman, O., & Spiegler, I. (2007). Investigating diversity of clustering methods: An empirical comparison. Data & Knowledge Engineering, doi:10.1016/j.datak.2007.01.002. Grabmier, J., & Rudolph, A. (2002). Techniques of cluster algorithms in data mining. Data Mining and Knowledge Discovery, 6, 303-360. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Communication Surveys, 31, 264-323. REFERENCES Kohonen, T. (1995), Self-organizing maps. Series in Information Sciences, 30. Abdi, H. (2007). Discriminant correspondence analysis. In N. J. Salkind (Ed.), Encyclopedia of Measurement and Statistics. Sage. Raveh, A. (2000). Coplot: A graphic display method for geometrical representations of MCDM. European Journal of Operational Research, 125, 670-678. Boudjeloud, L., & Poulet, F. (2005). Visual interactive evolutionary algorithm for high dimensional data clustering and outlier detection. (LNAI 3518, pp. 426-431). Sharan, R., & Shamir, R. (2002). Algorithmic approaches to clustering gene expression data. In T. Jiang et al. (Eds.), Current topics in computational molecular biology (pp. 269-300). Cambridge, MA: MIT Press. Cadez, I., Heckerman, D., Meek, C., Smyth, P., & White, S. (2003). Model-based clustering and visualization of navigation patterns on a Web site. Data Mining and Knowledge Discovery, 7, 399-424. Shultz, T., Mareschal, D., & Schmidt, W. (1994). Modeling cognitive development on balance scale phenomena. Machine Learning, 16, 59-88. Clifford, H. T., & Stevenson, W. (1975). An introduction to numerical classification. Academic Press. De-Oliveira, M. C. F., & Levkowitz, H. (2003). From visual data exploration to visual data mining: A survey. IEEE Transactions on Visualization and Computer Graphics, 9(3), 378-394. Seo, J., & Shneiderman, B. (2002). Interactively exploring hierarchical clustering results. IEEE Computer, 35(7), 80-86. Seo, J., & Shneiderman, B. (2005). A rank-by-feature framework for interactive exploration of multidimensional data. Information Visualization, 4(2), 99-113. 50 DSS Using Visualization of Multi-Algorithms Voting KEY TERMS Decision Support System (DSS): DSS is a system used to help resolve certain problems or dilemmas. Dendrogram: Dendrogram is a method of presenting the classification of a hierarchical clustering algorithm. Distance From Second Best (DFSB): DFSB is a method of calculating the distribution of votes for a certain sample. This method is based on the difference between the highest number of similar associations and the second-highest number of similar associations. Heterogeneity Meter: Heterogeneity meter is a meter of how heterogenic a certain association of clusters resulting from the implementation of an algorithm is. Hierarchical Clustering Algorithms: Hierarchical clustering algorithms are clustering methods that classify datasets starting with all samples representing different clusters and gradually unite samples into clusters based on their likelihood measure. Likelihood Measurement: Likelihood measurement is the measure that allows for the classification of a dataset using hierarchical clustering algorithms. It measures the extent to which a sample and a cluster are alike. Vote Matrix: Vote matrix is a graphical tool used to present a dataset classification using multiple algorithms. 51 D DSS Using Visualization of Multi-Algorithms Voting Appendix A: The Rearranged Vote Matrix Figure 7. The rearranged vote matrix part 1 Figure 8. The rearranged vote matrix part 2 10 52