Download Figure 5: Fisher iris data set vote matrix after ordering.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Figure 5: Fisher iris data set vote matrix after ordering.
c 2007 The Authors. Journal Compilation c 2007 Blackwell Publishing Ltd.
38
Expert Systems, July 2007, Vol. 24, No. 3
183
6.2.2. Methodology application We did not
know how many clusters the data set should be
classified into, so we performed the test forcing
the tools to cluster it into the suggested number
of four clusters. We performed the same steps
for the suggested methodology as we performed
with the Fisher iris data set.
When we examined the final results, we noted
that the classification was not satisfactory. We
therefore applied the same steps with fewer
clusters – three and two, respectively.
6.2.3. Methodology implementation output The
user profiles data set ordered vote matrix with four
clusters (Figure 6) shows the results after applying the suggested methodology to the data set.
First, we tried to observe the outcome when
classifying the data set into four clusters as
suggested in the original research. The outcome
was very inconsistent. A cluster comprising samples 1, 4, 8, 9, 17, 23, 25, 29, 30, 31, 32, 37, 39 and
40 could be identified, but the rest of the samples,
excluding sample 26, were not clearly associated
with more than one additional cluster. This
certainly does not indicate that the initial assumption, that the data set could be classified into four
clusters based on the eight properties, is correct.
The user profiles data set ordered vote matrix
with three clusters (Figure 7) shows the results for
the user profiles data set when forcibly classified
into three clusters. We tried this classification
after receiving unsatisfactory results from the
four-cluster attempt. The results were quite similar, where the outstanding cluster comprising
samples 1, 4, 8, 9, 17, 23, 25, 29, 30, 31, 32, 37,
39 and 40 was clearly identified, while the rest of
the samples could not be clearly divided into
more than one additional cluster.
The user profiles data set ordered vote matrix
with two clusters (Figure 8) shows the user profiles
data set with the suggested methodology applied
to it assuming two clusters. This time, the results
were quite clear and two clusters could be easily
identified. It is important to note that the cluster
that was identified even when trying to classify
the data set into four clusters remained consistent
throughout all the methodology applications.
184
Expert Systems, July 2007, Vol. 24, No. 3
S
4
8
9
29
30
31
32
37
39
1
17
23
25
40
11
10
12
24
16
27
22
15
28
36
21
33
6
7
34
5
13
14
19
35
38
20
3
18
2
26
M1 M3 M4 M5 M6 M7 M8 M9 M10 HM
4
4
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
1
1
1
4
4
1
1
1
1
1
1
1
1
2
4
1
1
1
1
1
1
1
4
2
4
1
1
1
1
1
1
1
4
2
4
1
1
1
1
1
1
1
4
2
4
1
1
1
1
1
1
1
4
1
2
1
1
1
1
1
1
1
9
3
1
2
3
2
3
2
2
4
36
2
1
3
2
1
3
2
2
4
25
3
1
3
2
1
3
2
2
4
25
2
2
3
2
1
3
3
2
4
25
3
1
4
3
3
4
2
2
4
25
2
1
4
3
3
4
2
2
4
25
2
1
3
4
1
4
2
2
4
16
2
2
3
2
1
3
2
2
4
16
2
2
3
2
1
3
2
2
4
16
2
2
3
2
1
3
2
2
4
16
2
3
2
2
1
3
2
3
2
16
2
3
2
2
1
3
2
3
2
16
1
3
2
2
1
3
2
3
2
16
1
3
2
2
1
3
2
3
2
16
1
3
2
2
1
3
2
3
2
16
3
2
3
2
1
2
2
2
3
4
3
2
3
2
1
2
2
2
3
4
3
2
3
2
1
2
2
2
3
4
2
2
3
2
1
2
2
2
3
4
2
2
3
2
1
2
2
2
3
4
2
2
3
2
1
2
2
2
3
4
2
2
2
2
1
2
2
2
4
4
1
3
2
2
1
2
2
2
2
4
4
2
2
2
1
2
2
2
2
4
1
2
2
2
1
2
2
2
2
1
3
1
3
4
4
4
4
4
4
0
342
Figure 6: User profiles data set ordered vote
matrix with four clusters.
Applying the methodology to the other assumptions showed that the rest of the samples could
not be divided into additional clusters. Hence,
these two clusters were probably the best classification based on the given properties.
7. Discussion and conclusions
The suggested methodology produced a clear,
visual presentation of data set classifications,
c 2007 The Authors. Journal Compilation c 2007 Blackwell Publishing Ltd.
39
S
4
8
9
29
30
31
32
37
39
1
17
23
25
40
26
21
33
16
22
27
6
7
10
11
15
24
28
34
36
12
14
18
19
35
38
2
3
5
13
20
M1 M3 M4 M5 M6 M7 M8 M9 M10 HM
3
2
1
1
1
1
1
1
1
1
3
2
1
1
1
1
1
1
1
1
3
2
1
1
1
1
1
1
1
1
3
2
1
1
1
1
1
1
1
1
3
2
1
1
1
1
1
1
1
1
3
2
1
1
1
1
1
1
1
1
3
2
1
1
1
1
1
1
1
1
3
2
1
1
1
1
1
1
1
1
3
2
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
4
2
2
1
1
1
1
1
1
1
4
2
2
1
1
1
1
1
1
1
4
2
2
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
1
9
1
3
3
3
3
3
3
2
3
9
2
3
2
2
1
2
2
3
2
16
2
3
2
2
1
2
2
3
2
16
1
3
3
3
1
3
2
2
3
16
2
3
3
3
1
3
2
2
3
16
2
3
3
3
1
3
2
2
3
16
1
3
2
2
1
2
2
3
2
9
1
3
2
2
1
2
2
3
2
9
2
3
3
2
1
2
2
2
3
9
1
3
2
3
2
2
2
2
3
9
2
3
3
2
1
2
2
2
3
9
2
3
3
2
1
2
2
2
3
9
2
3
3
2
1
2
2
2
3
9
1
3
2
2
1
2
2
3
2
9
2
3
3
2
1
2
2
2
3
9
1
3
3
2
1
2
2
2
3
4
1
3
3
2
1
2
2
2
3
4
3
1
2
2
1
2
2
2
2
4
2
1
3
2
1
2
2
2
3
4
2
1
3
2
1
2
2
2
3
4
2
1
3
2
1
2
2
2
3
4
1
1
2
2
1
2
2
2
2
1
1
1
2
2
1
2
2
2
2
1
1
1
3
2
1
2
2
2
3
1
1
1
3
2
1
2
2
2
3
1
2
1
2
2
1
2
2
2
3
1
190
S
1
17
23
25
8
9
4
29
30
31
32
37
39
40
26
2
3
5
6
7
11
12
13
14
16
18
34
10
15
19
20
21
22
24
27
28
33
35
36
38
M1
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
M3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
M4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
M5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
M6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
M7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
M8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
M9 M10 HM
1
1
4
1
1
4
1
1
4
1
1
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
4
2
2
1
2
2
1
2
2
1
2
2
1
2
2
1
2
2
1
2
2
1
2
2
1
2
2
1
2
2
1
2
2
1
2
2
1
2
2
0
2
2
0
2
2
0
2
2
0
2
2
0
2
2
0
2
2
0
2
2
0
2
2
0
2
2
0
2
2
0
2
2
0
2
2
0
42
Figure 7: User profiles data set ordered vote
matrix with three clusters.
Figure 8: User profiles data set ordered vote
matrix with two clusters.
which can be used to identify samples that are
clustered correctly. The user profiles data set is a
good example of this, as we started with the
initial assumption that it should be classified into
four clusters but when performing the classification it became obvious that four clusters was
incorrect. Figure 6 demonstrates this very clearly.
Continuing to work according to the suggested
methodology, we reached a very clear classification that is demonstrated visually in Figure 8.
Looking at the Fisher iris data set, as demonstrated in Figure 5, we can see how samples that
are falsely clustered, such as sample 25, or samples that are difficult to cluster, such as sample 57,
stand out. Such a view is hard to reach using
legacy presentations, such as two- or threedimensional scatter charts, since in many cases
there are more than three properties by which the
data set is classified. However, there are no good
and clear means with which to present a data set
distribution in more than three dimensions. This
c 2007 The Authors. Journal Compilation c 2007 Blackwell Publishing Ltd.
40
Expert Systems, July 2007, Vol. 24, No. 3
185
requires us to use perspectives where not all the
properties are presented, causing an inaccurate
and sometimes even misleading presentation.
The suggested methodology produced a perspective that gives a clear presentation of the
effectiveness of the different algorithms; such a
perspective can be useful when applied to a
training data set in order to decide on the most
effective algorithm to use in order to classify
future samples. This is demonstrated by the
association of the clusters presented in Figures
3 and 4. Using the heterogeneity meter the outcome of the different algorithms can be compared, in contrast to legacy methods where
examplewise an outcome of six classification
samples {1, 2, 3, 4, 5 and 6} clustered by two
different algorithms as, for example, output A
{(1, 2), (3, 4), (5, 6)} and output B {(1, 4), (3, 6),
(2, 5)} is quite difficult to associate and analyse.
The current study suggests a methodology for
classifying data sets. Though limited regarding
the size of the data sets it can analyse, it succeeds
in providing a clear visual perspective of areas of
interest that fail to provide a satisfactory visual
presentation when using legacy tools.
We demonstrated the successful application of
the methodology to a well-known data set as well as
to a data set that could not have been analysed
correctly without using the suggested methodology.
The current study proves the need for such a
presentation and provides the means to produce
it. In this sense, it has opened a path for further
research that will allow the improvement of the
suggested methodology and its implementation
to data sets that are currently not covered.
7.1. Limitations and future research
The suggested methodology in its current application does not scale well and requires the
application of excessive computing power to
achieve its views. Therefore, it is not suggested
for use on large data sets, and is more applicable
regarding small data sets and training data sets.
The suggested methodology also fails to provide a clear means by which to order the samples
according to different types of perspective. To a
certain extent, this must be done manually.
186
Expert Systems, July 2007, Vol. 24, No. 3
There are still several open issues regarding
the use of the suggested methodology:
finding an efficient method to minimize the
heterogeneity meter in order to find the
correct association of the clusters according
to the different algorithms;
identifying which algorithms to use to cluster a specific data set that forms the desired
perspective;
adapting the application of the suggested
methodology for use with large data sets;
finding a formula to normalize the heterogeneity meter with respect to the number of
clusters the data set was classified into.
References
BOUDJELOUD, L. and F. POULET (2005) Visual interactive evolutionary algorithm for high dimensional
data clustering and outlier detection, Lecture Notes
in Artificial Intelligence, 3518, 426–431.
CLIFFORD, H.T. and W. STEVENSON (1975) An Introduction to Numerical Classification, New York: Academic Press.
DE-OLIVEIRA, M.C.F. and H. LEVKOWITZ (2003)
From visual data exploration to visual data mining:
a survey, IEEE Transactions on Visualization and
Computer Graphics, 9 (3), 378–394.
ERLICH, Z., R. GELBARD and I. SPIEGLER (2002) Data
mining by means of binary representation: a model
for similarity and clustering, Information Systems
Frontiers, 4, 187–197.
FISHER, R.A. (1936) The use of multiple measurements
in taxonomic problems, Annual Eugenics, 7, 179–188.
JAIN, A.K. and R.C. DUBES (1988) Algorithms for
Clustering Data, Upper Saddle River, NJ: Prentice
Hall.
JAIN, A.K., M.N. MURTY and P.J. FLYNN (1999) Data
clustering: a review, ACM Communication Surveys,
31, 264–323.
SHAPIRA, B., P. SHOVAL and U. HANANI (1999) Experimentation with an information filtering system
that combines cognitive and sociological filtering
integrated with user stereotypes, Decision Support
Systems, 27, 5–24.
SHARAN, R. and R. SHAMIR (2002) Algorithmic
approaches to clustering gene expression data, in
Current Topics in Computational Molecular Biology,
T. Jiang, T. Smith, Y. Xu and M.Q. Zhang (eds),
Boston, MA: MIT Press, 269–300.
SHULTZ, T., D. MARESCHAL and W. SCHMIDT (1994)
Modeling cognitive development on balance scale
phenomena, Machine Learning, 16, 59–88.
c 2007 The Authors. Journal Compilation c 2007 Blackwell Publishing Ltd.
41
Essay 2
“Decision Support System using - Visualization of Multi-Algorithms Voting”
42
DSS Using Visualization of Multi-Algorithms
Voting
Ran M. Bittmann
Graduate School of Business Administration – Bar-Ilan University, Israel
Roy M. Gelbard
Graduate School of Business Administration – Bar-Ilan University, Israel
•
INTRODUCTION
The problem of analyzing datasets and classifying
them into clusters based on known properties is a well
known problem with implementations in fields such as
finance (e.g., pricing), computer science (e.g., image
processing), marketing (e.g., market segmentation),
and medicine (e.g., diagnostics), among others (Cadez,
Heckerman, Meek, Smyth, & White, 2003; Clifford &
Stevenson, 2005; Erlich, Gelbard, & Spiegler, 2002;
Jain & Dubes, 1988; Jain, Murty, & Flynn, 1999; Sharan
& Shamir, 2002).
Currently, researchers and business analysts alike
must try out and test out each diverse algorithm and
parameter separately in order to set up and establish
their preference concerning the individual decision
problem they face. Moreover, there is no supportive
model or tool available to help them compare different
results-clusters yielded by these algorithm and parameter combinations. Commercial products neither show
the resulting clusters of multiple methods, nor provide
the researcher with effective tools with which to analyze
and compare the outcomes of the different tools.
To overcome these challenges, a decision support
system (DSS) has been developed. The DSS uses a
matrix presentation of multiple cluster divisions based
on the application of multiple algorithms. The presentation is independent of the actual algorithms used and it
is up to the researcher to choose the most appropriate
algorithms based on his or her personal expertise.
Within this context, the current study will demonstrate the following:
•
•
How to evaluate different algorithms with respect
to an existing clustering problem.
Identify areas where the clustering is more effective and areas where the clustering is less
effective.
Identify problematic samples that may indicate
difficult pricing and positioning of a product.
Visualization of the dataset and its classification is
virtually impossible using legacy methods when more
than three properties are used, as is the case in many
problems, since displaying the dataset in such a case will
require giving up some of the properties or using some
other method to display the dataset’s distribution over
four or more dimensions. This makes it very difficult
to relate to the dataset samples and understand which
of these samples are difficult to classify, (even when
they are classified correctly), and which samples and
clusters stand out clearly (Boudjeloud & Poulet, 2005;
De-Oliveira & Levkowitz, 2003; Grabmier & Rudolph,
2002; Shultz, Mareschal, & Schmidt, 1994).
Even when the researcher uses multiple algorithms
in order to classify the dataset, there are no available
tools that allow him/her to use the outcome of the
algorithms’ application. In addition, the researcher
has no tools with which to analyze the difference in
the results.
The current study demonstrates the usage of a
developed decision support methodology based upon
formal quantitative measures and a visual approach,
enabling presentation, comparison, and evaluation
of the multi-classification suggestions resulting from
diverse algorithms. The suggested methodology and
DSS support a cross-algorithm presentation; all resultant
classifications are presented together in a “Tetris-like
format” in which each column represents a specific
classification algorithm and each line represents a
specific sample case. Formal quantitative measures are
then used to analyze these “Tetris blocks,” arranging
them according to their best structures, that is, the most
agreed-upon classification, which is probably the most
agreed-upon decision.
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
43
D
DSS Using Visualization of Multi-Algorithms Voting
Such a supportive model and DSS impact the ultimate business utility decision significantly. Not only
can it save critical time, it also pinpoints all irregular
sample cases, which may require specific examination. In this way, the decision process focuses on key
issues instead of wasting time on technical aspects.
The DSS is demonstrated using common clustering
problems of wine categorizing, based on 13 measurable properties.
THEORETICAL BACKGROUND
Cluster Analysis
In order to classify a dataset of samples with a given
set of properties, researchers use algorithms that associate each sample with a suggested group-cluster,
based on its properties. The association is performed
using likelihood measure that indicates the similarity
between any two samples as well as between a sample,
to be associated, and a certain group-cluster.
There are two main clustering-classification
types:
•
Supervised (also called categorization), in which
a fixed number of clusters are predetermined, and
the samples are divided-categorized into these
groups.
•
Unsupervised (called clustering), in which the
preferred number of clusters, to classify the dataset
into, is formed by the algorithm while processing
the dataset.
There are unsupervised methods, such as hierarchical clustering methods, that provide visualization
of entire “clustering space” (dendrogram), and in the
same time enable predetermination of a fixed number
of clusters.
A researcher therefore uses the following steps:
1.
2.
The researcher selects the best classification algorithm based on his/her experience and knowledge
of the dataset.
The researcher tunes the chosen classification
algorithm by determining parameters, such as
the likelihood measure, and number of clusters.
Current study uses hierarchical clustering methods,
which are briefly described in the following section.
Hierarchical Clustering Methods
Hierarchical clustering methods refer to a set of algorithms that work in a similar manner. These algorithms
take the dataset properties that need to be clustered and
start out by classifying the dataset in such a way that
each sample represents a cluster. Next, it merges the
clusters in steps. Each step merges two clusters into a
single cluster until only one cluster (the dataset) remains.
The algorithms differ in the way in which distance is
measured between the clusters, mainly by using two
parameters: the distance or likelihood measure, for
example, Euclidean, Dice, and so forth, and the cluster
method, for example, between group linkage, nearest
neighbor, and so forth.
In the present study, we used the following wellknown hierarchical methods to classify the datasets:
•
•
•
•
•
44
Average linkage (within groups): This method
calculates the distance between two clusters by
applying the likelihood measure to all the samples
in the two clusters. The clusters with the best
average likelihood measure are then united.
Average linkage (between groups): This method
calculates the distance between two clusters by
applying the likelihood measure to all the samples
of one cluster and then comparing it with all the
samples of the other cluster. Once again, the two
clusters with the best likelihood measure are then
united.
Single linkage (nearest neighbor): This method,
as in the average linkage (between groups)
method, calculates the distance between two
clusters by applying the likelihood measure to all
the samples of one cluster and then comparing it
with all the samples of the other cluster. The two
clusters with the best likelihood measure, from a
pair of samples, are united.
Median: This method calculates the median of
each cluster. The likelihood measure is applied
to the medians of the clusters, after which the
clusters with the best median likelihood are then
united.
Ward: This method calculates the centroid for
each cluster and the square of the likelihood
measure of each sample in both the cluster and
DSS Using Visualization of Multi-Algorithms Voting
the centroid. The two clusters, which when united
have the smallest (negative) affect on the sum of
likelihood measures, are the clusters that need to
be united.
Likelihood-Similarity Measure
In all the algorithms, we used the squared Euclidean
distance measure as the likelihood-similarity measure.
This measure calculates the distance between two
samples as the square root of the sums of all the squared
distances between the properties.
As seen previously, the algorithms and the likelihood
measures differ in their definition of the task, that is,
the clusters are different and the distance of a sample
from a cluster is measured differently. This results in
the fact that the dataset classification differs without
obvious dependency between the applied algorithms.
The analysis becomes even more complicated if the
true classification is unknown and the researcher has no
means of identifying the core of the correct classification
and the samples that are difficult to classify.
Visualization: Dendrogram
Currently, the results can be displayed in numeric tables,
in 2D and 3D graphs, and when hierarchical classification algorithms are applied, also in a dendrogram,
which is a tree-like graph that presents entire “clustering
space,” that is, the merger of clusters from the initial
case, where each sample is a different cluster to the
total merger, where the whole dataset is one cluster.
The vertical lines in a dendrogram represent clusters
that are joined, while the horizontal lines represent
the likelihood coefficient for the merger. The shorter
the horizontal line, the higher the likelihood that the
clusters will merge. Though the dendrogram provides
the researcher with some sort of a visual representation, it is limited to a subset of the algorithms used.
Furthermore, the information in the dendrogram relates
to the used algorithm and does not compare or utilize
additional algorithms. The information itself serves as
a visual aid to joining clusters, but does not provide
a clear indication of inconsistent samples in the sense
that their position in the dataset spectrum, according
to the chosen properties, is misleading, and likely to be
wrongly classified. This is a common visual aid used by
researchers but it is not applicable to all algorithms.
Among the tools that utilize the dendrogram visual
aid is the Hierarchical Clustering Explorer. This tool
tries to deal with the multidimensional presentation of
datasets with multiple variables. It produces a dashboard
of presentations around the dendrogram that shows the
classification process of the hierarchical clustering and
the scatter plot that is a human readable presentation
of the dataset, but limited to two variables (Seo &
Shneiderman, 2002, 2005).
Visualization: Additional Methods
Discriminant Analysis and Factor Analysis
The problem of clustering may be perceived as finding
functions applied on the variables that discriminate
between samples and decide to which cluster they
belong. Since usually there are more than two or even
three variables it is difficult to visualize the samples in
such multidimensional spaces, some methods are using
the discriminating functions, which are a transformation of the original variables and present them on two
dimensional plots.
Discriminant function analysis is quit analogous
to multiple regression. The two-group discriminant
analysis is also called Fisher linear discriminant analysis
after Fisher (1936). In general, in these approaches we
fit a linear equation of the type:
Group = a + b1*x1 + b2*x2 + ... + bm*xm
Where a is a constant and b1 through bm are regression coefficients.
The variables (properties) with the significant regression coefficients are the ones that contribute most
to the prediction of group membership. However, these
coefficients do not tell us between which of the groups
the respective functions discriminate. The means of the
functions across groups identify the group’s discrimination. It can be visualized by plotting the individual
scores for the discriminant functions.
Factor analysis is another way to determine which
variables (properties) define a particular discriminant
function. The former correlations can be regarded as
factor loadings of the variables on each discriminant
function (Abdi, 2007).
It is also possible to visualize both correlations;
between the variables in the model (using adjusted
factor analysis) and discriminant functions, using a
tool that combines these two methods (Raveh, 2000).
45
D
DSS Using Visualization of Multi-Algorithms Voting
Each ray represents one variable (property). The angle
between any two rays presents correlation between
these variables (possible factors).
•
Self-Organization Maps (SOM)
The model is implemented on known datasets to
further demonstrate its usage in real-life research.
SOM also known as Kohonen network is a method that
is based on neural network models, with the intention
to simplify the presentation of multidimensional data
into the simpler more intuitive two-dimensional map
(Kohonen, 1995).
The process is an iterative process that tries to bring
samples, in many cases a vector of properties, that are
close, after applying on them the likelihood measure,
next to each other in the two dimensional space. After a large number of iterations a map-like pattern is
formed that groups similar data together, hence its use
in clustering.
Visualization: Discussion
As described, these methodologies support visualization of a specific classification, based on a single set of
parameters. Hence, current methodologies are usually
incapable of making comparisons between different
algorithms and leave the decision making, regarding
which algorithm to choose, to the researcher. Furthermore, most of the visual aids, though giving a visual
interpretation to the classification by the method of
choice, lose some of the relevant information on the
way, like in the case of discriminant analysis, where the
actual relations between the dataset’s variable is being
lost when projected on the two-dimensional space.
This leaves the researcher with very limited visual
assistance and prohibits the researcher from having a
full view of the relations between the samples and a
comparison between the dataset classifications based
on the different available tools.
DSS USING VISUALIZATION OF
MULTI-ALGORITHMS VOTING
This research presents the implementation of the
multi-algorithm DSS. In particular, it demonstrates
techniques to:
•
•
Identify the profile of the dataset being researched
Identify samples’ characteristics
The Visual Analysis Model
The tool presented in the current study presents the
classification model from a clear, two-dimensional
perspective, together with tools used for the analysis
of this perspective.
Vote Matrix
The “vote matrix” concept process recognizes that each
algorithm represents a different view of the dataset and
its clusters, based on how the algorithm defines a cluster
and measures the distance of a sample from a cluster.
Therefore, each algorithm is given a “vote” as to how
it perceives the dataset should be classified.
The tool proposed in the current study presents the
“vote matrix” generated by the “vote” of each algorithm
used in the process. Each row represents a sample,
while each column represents an algorithm and its vote
for each sample about which cluster it should belong
to, according to the algorithm’s understanding of both
clusters and distances.
Heterogeneity Meter
The challenge in this method is to associate the different
classifications, since each algorithm divides the dataset
into different clusters. Although the number of clusters
in each case remains the same for each algorithm, the
tool is necessary in order to associate the clusters of each
algorithm; for example, cluster number 2 according to
algorithm A1 is the same as cluster number 3 according
to algorithm A2. To achieve this correlation, we will
calculate a measure called the heterogeneity meter for
each row, that is, the collection of votes for a particular
sample, and sum it up for all the samples.
Multiple methods can be used to calculate the
heterogeneity meter. These methods are described as
follows:
Identify the strengths and weaknesses of each
clustering algorithm
46
DSS Using Visualization of Multi-Algorithms Voting
Squared VE (Vote Error)
This heterogeneity meter is calculated as the square
sum of all the votes that did not vote for the chosen
classification. It is calculated as follows:
H=
n
∑ (N − M )
2
i
i =1
Equation 1: Squared VE Heterogeneity Meter
Where:
H – is the heterogeneity meter
N – is the number of algorithms voting for the sample
M – is the maximum number of similar votes according
to a specific association received for a single sample
i – is the sample number
n – is the total number of samples in the dataset
Distance From Second Best (DFSB)
This heterogeneity meter is calculated as the difference
in the number of votes that the best vote, that is, the
vote common to most algorithms, received and the
number of votes the second-best vote received. The
idea is to discover to what extent the best vote is distinguished from the rest. This meter is a reverse meter,
as the higher it is, the less heterogenic the sample. It is
calculated as follows:
H=
n
i =1
D
Heterogeneity Meter Implementation
In order to find the best association, the heterogeneity
meter needs to be minimized, that is, identifying the
association that makes the votes for each sample as
homogeneous as possible.
The heterogeneity meter is then used to sort the
voting matrix, giving the researcher a clear, two-dimensional perspective of the clusters and indicating
how well each sample is associated with its designated
cluster.
Visual Pattern Characteristics
In this section, we will demonstrate several typical
patterns that can be recognized in the suggested DSS.
In each pattern, we find the following columns:
S – Samples number
T – True clustering
A1, A2, A3, A4, A5, A6 – Three algorithms used to
for clustering
For each example, there are five rows representing
five different samples.
Well-Classified Samples
∑ (B − SB )
i
i – is the sample number
n – is the total number of samples in the dataset
i
Equation 2: DFSB Heterogeneity Meter
Where:
H – is the Heterogeneity Meter
B – is the best, that is, the cluster voted the most times
as the cluster for a given sample
SB – is the second-best cluster for a given sample
Figure 1. Well-classified clusters
In Figure 1, we can see that sample 68 was classified
correctly by all algorithms. This is an indication that the
variables used to classify the dataset work well with the
sample; if this is consistent with the cluster, it shows
that these variables can be used to identify it.
Figure 2. A hard-to-classify example
47
DSS Using Visualization of Multi-Algorithms Voting
Figure 3. Algorithms that are effective for a certain
cluster
Samples that are Hard to Classify
In Figure 2, we see that while samples 59-62 are classified correctly and identically by nearly all the chosen
methods, sample 71 is classified differently. This is an
indication that this sample is hard to classify and that
the parameters used for classification do not clearly
designate it to any particular cluster.
Algorithms that are Effective for a
Certain Cluster
In Figure 3, we see that algorithm A6 is more effective
for classifying the red cluster, as it is the only algorithm
that succeeded in classifying it correctly. This does not
mean that it is the best algorithm overall, but it does
indicate that if the researcher wants to find candidates
for that particular cluster algorithm, then A6 is a good
choice.
Wrongly Classified Samples
In Figure 4, we see that some samples, mainly 174, 175,
and 178 were classified incorrectly by all algorithms.
It is evident since the cluster color of the classification
by the algorithms, marked A1-A6, is different than the
true classification, marked T. This is an indication that
the parameters by which the dataset was classified are
probably not ideal for some samples; if it is consistent
with a certain cluster, we can then say that the set of
variables used to classify the dataset is not effective
for identifying this cluster.
Figure 4. Wrongly classified samples
IMPLEMENTATION—THE CASE OF
WINE RECOGNITION
The Dataset
To demonstrate the implementation of the DSS, we
chose the Wine Recognition Data (Forina, Leardi,
Armanino, & Lanteri, 1988; Gelbard, Goldman, &
Spiegler, 2007). This is a collection of wines classified using thirteen different variables. The variables
are as follows:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Non-flavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline
The target is to cluster the wines based on the given
attributes into three different clusters, representing
the three different cultivars from which the wines are
derived.
48
DSS Using Visualization of Multi-Algorithms Voting
Figure 5. Wine cases: Vote matrix part 1
Figure 6. Wine cases: Vote matrix part 2
The Implementation
DISCUSSION
We used six hierarchical clustering methods:
The advantage of the visual representation of clustering
the wine dataset is well depicted in Figures 5 and 6, as
we get a graphical representation of the dataset and its
classification. Examples of the immediate results from
this presentation are as follows:
Looking at the vote matrix, it is easy to see that
two of the three clusters are well detected using the
hierarchical clustering algorithms.
It can also be seen that some samples, such as samples
70, 71, 74, and 75 are not easy to classify, while other
samples, such as sample 44, are falsely associated.
Furthermore, it can be seen that the average linkage
(within group) is probably not an algorithm that will
work well with this dataset.
1.
2.
3.
4.
5.
6.
Average linkage (between Groups)
Average linkage (within Group)
Complete linkage (Furthest Neighbor)
Centroid
Median
Ward
We performed the cluster association using the
DFSB heterogeneity meter; the resulting vote matrix
is depicted in Figures 5 6.
Figures 7 and Figure 8, in appendix A, rearrange
the cases, that is, lines of Figures 5 and 6, in a way that
agreed cases are placed close to each other, according to clusters order, creating a “Tetris-like” view. As
aforesaid, each column represents a specific algorithm,
each line represents a specific case, and each color
represents a “vote”, that is, decision suggestion.
Uni-color lines represent cases in which all algorithms vote for the same cluster (each cluster is represented by a different color). These agreed cases are
“pushed down,” while multi-color lines “float” above,
in the same way it is used in a Tetris game.
D
CONCLUSION AND FURTHER
RESEARCH
The DSS presented in the current article uses different
algorithm results to present the researcher with a clear
picture of the data being researched.
The DSS is a tool that assists the researcher and
allows the researcher to demonstrate his/her expertise
in selecting the variables by which the data is classified
and the algorithms used to classify it.
In some cases, the researcher knows the expected
number of clusters to divide the dataset into, while
in other cases, the researcher needs assistance. The
49
DSS Using Visualization of Multi-Algorithms Voting
discussed DSS works well in both cases, as it can
present different pictures of the dataset as a result of
the different classifications.
The result is a tool that can assist researchers in
analyzing and presenting a dataset otherwise difficult
to comprehend. The researcher can easily see, rather
than calculate, both the trends and the classifications
in the researched dataset and can clearly present it to
his/her colleagues.
To activate the analysis, a tool was developed that
performs the association of the different algorithms.
This tool uses brute force and thus is still not scalable
over a large number of clusters and algorithms. More
efficient ways to perform the association require further research.
There are also multiple methods for calculating the
heterogeneity meter. Two of them were presented in
the current study, but there is still room for using/presenting other methods that allow us to associate the
clusters based on different trends, such as prioritizing
an association with a clear classifications in as many
samples as possible vs. associations with minimum
errors over all the vote matrix.
Erlich, Z., Gelbard, R., & Spiegler, I. (2002). Data
mining by means of binary representation: A model
for similarity and clustering. Information Systems
Frontiers, 4, 187-197.
Forina, M., Leardi, R., Armanino, C., & Lanteri, S.
(1988). PARVUS—An extendible package for data
exploration, classification and correlation. Genova,
Italy: Institute of Pharmaceutical and Food Analysis
and Technologies.
Gelbard, R., Goldman, O., & Spiegler, I. (2007).
Investigating diversity of clustering methods: An empirical comparison. Data & Knowledge Engineering,
doi:10.1016/j.datak.2007.01.002.
Grabmier, J., & Rudolph, A. (2002). Techniques of
cluster algorithms in data mining. Data Mining and
Knowledge Discovery, 6, 303-360.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for
clustering data. Prentice Hall.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data
clustering: A review. ACM Communication Surveys,
31, 264-323.
REFERENCES
Kohonen, T. (1995), Self-organizing maps. Series in
Information Sciences, 30.
Abdi, H. (2007). Discriminant correspondence analysis.
In N. J. Salkind (Ed.), Encyclopedia of Measurement
and Statistics. Sage.
Raveh, A. (2000). Coplot: A graphic display method
for geometrical representations of MCDM. European
Journal of Operational Research, 125, 670-678.
Boudjeloud, L., & Poulet, F. (2005). Visual interactive evolutionary algorithm for high dimensional
data clustering and outlier detection. (LNAI 3518,
pp. 426-431).
Sharan, R., & Shamir, R. (2002). Algorithmic approaches to clustering gene expression data. In T. Jiang
et al. (Eds.), Current topics in computational molecular
biology (pp. 269-300). Cambridge, MA: MIT Press.
Cadez, I., Heckerman, D., Meek, C., Smyth, P., & White,
S. (2003). Model-based clustering and visualization of
navigation patterns on a Web site. Data Mining and
Knowledge Discovery, 7, 399-424.
Shultz, T., Mareschal, D., & Schmidt, W. (1994).
Modeling cognitive development on balance scale
phenomena. Machine Learning, 16, 59-88.
Clifford, H. T., & Stevenson, W. (1975). An introduction
to numerical classification. Academic Press.
De-Oliveira, M. C. F., & Levkowitz, H. (2003). From
visual data exploration to visual data mining: A survey.
IEEE Transactions on Visualization and Computer
Graphics, 9(3), 378-394.
Seo, J., & Shneiderman, B. (2002). Interactively exploring hierarchical clustering results. IEEE Computer,
35(7), 80-86.
Seo, J., & Shneiderman, B. (2005). A rank-by-feature
framework for interactive exploration of multidimensional data. Information Visualization, 4(2), 99-113.
50
DSS Using Visualization of Multi-Algorithms Voting
KEY TERMS
Decision Support System (DSS): DSS is a system
used to help resolve certain problems or dilemmas.
Dendrogram: Dendrogram is a method of presenting the classification of a hierarchical clustering
algorithm.
Distance From Second Best (DFSB): DFSB is a
method of calculating the distribution of votes for a
certain sample. This method is based on the difference
between the highest number of similar associations and
the second-highest number of similar associations.
Heterogeneity Meter: Heterogeneity meter is a meter of how heterogenic a certain association of clusters
resulting from the implementation of an algorithm is.
Hierarchical Clustering Algorithms: Hierarchical clustering algorithms are clustering methods that
classify datasets starting with all samples representing
different clusters and gradually unite samples into
clusters based on their likelihood measure.
Likelihood Measurement: Likelihood measurement is the measure that allows for the classification
of a dataset using hierarchical clustering algorithms.
It measures the extent to which a sample and a cluster
are alike.
Vote Matrix: Vote matrix is a graphical tool used
to present a dataset classification using multiple algorithms.
51
D
DSS Using Visualization of Multi-Algorithms Voting
Appendix A: The Rearranged Vote Matrix
Figure 7. The rearranged vote matrix part 1
Figure 8. The rearranged vote matrix part 2
10
52