Download Visual Techniques for the Interpretation of Data Mining Outcomes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Visual Techniques for the Interpretation
of Data Mining Outcomes
Ioannis Kopanakis1, Nikos Pelekis2, Haralampos Karanikas3,
and Thomas Mavroudkis4
1
Technological Educational Institute of Crete, Heraklion Crete, Greece
[email protected]
2
Univ. of Piraeus, Piraeus, Greece
[email protected]
3
UMIST, Manchester, UK
[email protected]
4
National & Kapodistrian Univ. of Athens,
Knowledge Management Lab., Athens, Hellas
Abstract. The visual senses for humans have a unique status, offering a very
broadband channel for information flow. Visual approaches to analysis and
mining attempt to take advantage of our abilities to perceive pattern and structure in visual form and to make sense of, or interpret, what we see. Visual Data
Mining techniques have proven to be of high value in exploratory data analysis
and they also have a high potential for mining large databases. In this work, we
try to investigate and expand the area of visual data mining by proposing a new
3-Dimensional visual data mining technique for the representation and mining
of classification outcomes and association rules.
Keywords: Visual Data Mining, Association Rules, Classification, Visual Data
Mining Models.
Categories: I.2.4, I.2.6
Research Paper: Data Bases, Work Flow and Data mining
1 Introduction and Motivation
Classification is a primary method for machine learning and data mining [Frawley,
92]. It is either used as a stand-alone tool to get insight into the distribution of a data
set, e.g. to focus further analysis and data processing, or as a pre-processing step for
other algorithms operating on the detected clusters. The main enquiries that the
knowledge engineer usually has on his/her attempt to understand the classification
outcomes are: How well separated are the different classes? What classes are similar
or dissimilar to each other? What kind of surface separates various classes, (i.e. are
the classes linearly separable?) How coherent or well formed is a given class?
Those questions are difficult to be answered by applying the conventional statistical methods over the raw data produced by the classification algorithm. Unless the
user is supported by a visual representation that will actually be his/her navigational
P. Bozanis and E.N. Houstis (Eds.): PCI 2005, LNCS 3746, pp. 25 – 35, 2005.
© Springer-Verlag Berlin Heidelberg 2005
26
I. Kopanakis et al.
tool in the N-dimensional classified world, concluding inferences will be a tedious
task [Keim, 95]. Our main aim therefore should be to visually represent and understand the spatial relationships between various classes in order to answer questions
such as the above mentioned.
Further more, mining for association rules, as a central task of data mining, has
been studied extensively by many researchers. Much of the existing research, however, is focused on how to generate rules efficiently. Limited work has been done on
how to help the user understand and use the discovered rules. In real-life applications
though, the knowledge engineer wants first to have a good understanding over a set of
rules before trusting them and use the mining outcomes [David, 01]. Investigation and
comprehension of rules is a critical pre-requirement for their application. Those issues
become even more tightening if we consider the “large resulting rule set”, the “hard to
understand” and the “rule behaviour” problem [Zhao, 01].
In this paper, the proposed visual data mining model constructs 3D graphical representations of the classification outcomes produced by common data mining processes. Furthermore, association rules are also visualized in that representation, revealing each association rule’s “state” in their original N-dimensional world. Our attempt
is to equip the knowledge engineer with a tool that would be utilized on his/her attempt to gain insight over the mined knowledge, presenting as much information
extracted in a human perceivable way. The model proposed have distinctive advantageous characteristics, addressing the commonly tedious issues that the knowledge
engineer handles during the exploitation of the classification outcomes. Furthermore,
it brings us one step closer to make human part of the data mining process, in order to
exploit human’s unmatched abilities of perception.
In section 2 we introduce our application domain, along with the presentation of
our 3D Class-Preserving Projection Technique. In section 4 we investigate the application of this model for the visualization of association rules, which is followed by
two case studies in sections 5 and 6. Finally, the related work is presented in section 7
and we summarize our work in section8.
2 Visualizing Data Mining Classification Outcomes
On our attempt to graphically reveal the knowledge extracted by a classifier we have
mainly based our research effort on the underlying ideas of the geometric projection
techniques [Dhillon, 98]. Among the several geometric projection techniques that we
have studied, the most interesting methodology was the Class-Preserving Projection
Algorithm [Dhillon, 99], due to the robust behaviour that it has and its middle level of
computational complexity.
The main characteristic of classified data embedded in high-dimensional Euclidean
space is that proximity in Rn implies similarity. During the mapping procedures,
class-preserving projection techniques preserve the properties of the classified data in
the Rn space also to the projection plane in order to construct corresponding representations from which accurate inferences could be extracted. Our research study on
those techniques formed a new geometric projection technique that expands the existing methods in the area of visualizing classified data. That new technique named 3D
Visual Techniques for the Interpretation of Data Mining Outcomes
27
Class-Preserving Projection technique projects from the Rn to the R3 space along with
being capable of preserving the class distances (discriminating) among a larger number of classes.
3 3D Class-Preserving Projection Technique
In this section we introduce 3D class-preserving projections of multidimensional data.
The main advantage of those projections is that they maintain the high-dimensional
class structure by the utilization of linear projections, which can be displayed on a
computer screen. The challenge is in the choice of those planes and the associated
projections. Considering the problem of visualizing high-dimensional data that have
been categorized into various classes, our goal is to choose those projections that best
preserve inter-class and intra-class distances in order to extract inferences regarding
their relationships.
On our attempt to expand the existing projection techniques we worked on the
definition of a projection scheme that would result on the construction of a 3D world.
Compared to the existing 2D class-preserving projection techniques, the proposed 3D
technique results on the construction of an information rich representation due to the
freedom provided by the additional dimension in the projection world would. In order
to project onto the 3D space we should define our orthonormal projection vectors
based on four points. If we chose those four points to be the class-means of the classes
of our interest, we have managed to maximize the inter-class distances among those
four classes on our projection. Such an approach provides the flexibility of distinguishing among four classes instead of three, as long as being promoted into the 3D
projection space.
We consider the case where the data is divided into four classes. Let x1, x2, …, xn
be all the N-dimensional data points, and m1, m2, m3, m4, denote the corresponding
class-centroids. Let w1, w2 and w3 be an orthonormal basis of the candidate 3D world
of projection. The point xi gets projected to (w1Txi, w2Txi, w3Txi) and consequently, the
means mj get mapped to (w1Tmj, w2Tmj, w3Tmj) j=1,2,3,4.
One way to obtain good separation of the projected classes is to maximize the difference between the projected means. This may be achieved by choosing vectors w1,
w2, w3 ∈ Rn such that the objective function
3
{
C ( w1 , w2 , w3 ) = ∑ wiT (m2 − m1 ) + wiT (m3 − m1 ) + wiT (m4 − m1 ) +
i =1
2
2
2
2
wiT ( m3 − m2 ) + wiT ( m4 − m2 ) + wiT ( m4 − m3 )
2
2
is maximized. The above may be rewritten as
3
{ {
} }
C (w1 , w2 , w3 ) = ∑ wiT (m2 − m1 )(m2 − m1 )T + ... + (m4 − m3 )(m4 − m3 )T wi
i =1
= w1T S B w1 + w2T S B w2 + w3T S B w3 = W T S BW
Where
W = [ w1 , w2 , w3 ], wiT wi = 1, wiT w j = 0, i ≠ j , i, j = 1,2,3 and
S B = (m2 − m1 )(m2 − m1 )T + ... + (m4 − m3 )(m4 − m3 )T
28
I. Kopanakis et al.
The positive semi-definite matrix SB can be interpreted as the inter-class or betweenclass scatter matrix. Note that SB has rank ≤ 3, since (m3 − m2 ) ∈ span{(m2 − m1 ), (m3 − m1 )} ,
(m4 − m2 ) ∈ span{(m4 − m1 ), (m2 − m1 )}, (m4 − m3 )∈ span{(m4 − m1 ), (m3 − m1 )}.
It is clear that the search for the maximizing w1, w2 and w3 can be restricted to the
column (or row) space of SB. But as we noted above, this space is at most of dimension 3. Thus, in general, the optimal w1, w2 and w3 must form an orthonormal basis
spanning the space determined by the vectors (m2 – m1), (m3 – m1) and (m4 – m1).
This technique can be applied in any number of classes. In the constructed visual
representation though it will best discriminate the four selected classes.
4 Class-Preserving Projection Techniques and Association Rules
Class-preserving projection techniques could be also applied in the area of visual
mining of association rules. Even in the case of association rules, inventing new visual data mining models is actually conceiving new mapping techniques from the multidimensional space to a lower dimensional space. As each attribute participating in a
rule is actually adding an additional dimension to our data space, we try to map each
association rule existing in Rn to a lower dimensional space. Those notions conform
to the fundamental theory of the class-preserving projection techniques.
Theoretically, each rule could be perceived as an n-dimensional surface which encloses a sub-space in the high dimensional data space. The boundaries of that area are
defined by the conditions of rule’s sub-expressions, which pose the limits in each
dimension (i.e. the sub-expressions of the association rule IF ((L1<x1<U1) and
(L2<x2<U2) and … (Ln<xn<Un)) THEN (…) set the upper and lower limits for each
dimension of the n-dimensional space). The set of tuples in the data set, corresponding to points in the high-dimensional space, that have been included into the subspace are those which satisfy rule’s conditions. This is actually a different perspective
that we could perceive the definition of association rules.
Following the mapping procedures of the class-preserving projection techniques,
we are able to construct 2D or 3D representations of the classified high-dimensional
data space, which has also been partitioned by the examined rule’s sub-space. That
attempt will equip us with a model capable to represent the “state” of an association
rule in the high-dimensional world that it belongs. As in the case of visual mining the
classified data space, the enquiries posed in this case will also be regarding coherence,
discrimination, relationships etc. among the classes and the rule’s sub-space. It will be
like representing an association rule under the prism of the projection of the highdimensional world.
In sections 5 and 6 we are evaluating the behavior of this model, which suggests
the application of the class-preserving projection techniques for the visual mining of
association rules. We are presenting two case studies, in order to examine the potential of constructing 2D and 3D representations of the classified high-dimensional
world when partitioned to the sub-space defined by the association rule examined.
Visual Techniques for the Interpretation of Data Mining Outcomes
29
5 Wine Case Study
To begin with, we selected for our case study the wine data set [Blake, 98]. These
data are the results of a chemical analysis of 178 wines grown in the same region in
Italy but derived from three different cultivars. The analysis determined the quantities
of 13 constituents found in each wine of the three cultivars. We selected to visualize
the following rules:
Rule
IF ((3.82 < ColorIntensity <= 4.85)) THEN (Class=Class3)
IF ((3.82 < ColorIntensity <= 4.85)) THEN (Class=Class2)
IF ((3.82 < ColorIntensity <= 4.85)) THEN (Class=Class1)
Sup.%
15.73
15.73
15.73
Conf.%
21.43
28.57
50
This set of rules is actually providing the information regarding the categorization of
the wines with color intensity in the range of (3.82 , 4.85] among the three cultivars. It
would be interesting to visually examine this information and derive inferences if
possible.
Fig. 1. 2D Class-Preserving Projection (Wine Case Study)
In Fig. 1 the sub-set of wines that has color intensity in the range of (3.82 , 4.85] has
been yellow marked. As expected, the distribution of those points is among all three
classes. In Fig. 2 we have visualized the first rule which provides the information regarding the third cultivar (class 3). In Fig. 3 all three rules have been visualized. The green
30
I. Kopanakis et al.
Fig. 2. 2D Class -Preserving Projection (THEN part)
Fig. 3. 2D Class -Preserving Projection (Set of Rules)
marks correspond to the wines of the first cultivar (class 1, red marks) that have color
intensity in the range examined. The yellow marks correspond to the wines of the second
cultivar (class 2, blue marks) in that range. Analogously is represented the third cultivar.
Visual Techniques for the Interpretation of Data Mining Outcomes
31
Several inferences are concluded by the investigation of the alternative perspectives of the segmented high dimensional space provided by those representations. The
distribution of the points in each class is quite high. In our application domain that is
translated to the conclusion that each cultivar has made sure that it has a variety of
types of “rose” wines. That range of middle color intensity is expected to correspond
to the category of “rose” wines. Therefore, the distribution of this category to the
whole class indicates the different properties of the chemical analysis that those wines
have. In other words, there is a significant chemical variance among the wines of this
sub-category, indicating the desirable variety of rose wines in each cultivar.
Following a reverse way of thinking, evaluating the three cultivars on respect to the
variety of the wines of this type would be like comparing the distribution and the
number of wines in each cultivar with color intensity in the range of (3.82 , 4.85].
Examining the corresponding representation of Fig. 3 we could urge that the first and
second cultivars have a satisfactory variety and number of “rose” wines, in contradiction to the third cultivar which has a small production of wines of this type. The small
variance of the third cultivar’s “rose” wines in the representation indicates also the
similarity among their chemical analysis factors and as a consequence the small variety among the types of wines of this category.
6 Letter Image Recognition Case Study
Having examined the application of class-preserving projection technique for the
visual mining of association rules in 2D we are expecting that the advanced properties
of the presented model will be enhanced even more by the additional dimension provided in the 3D space. That flexibility of our projection world is expected to represent
more accurately larger volumes of information regarding the classified data and the
rules’ sub-spaces.
For our case study we have selected the letter image recognition data set [Blake,
98]. Character images of the 26 capital letters of the English alphabet based on 20
different fonts were converted into 16 primitive numerical attributes (statistical moments and edge counts) that formed our data set. We have chosen to visualize the
instances of A, B, C and D letters for the following set of association rules.
Rule
IF ((5.50 < x2ybr <= 6.50)) THEN (lettr=C)
IF ((5.50 < x2ybr <= 6.50)) THEN (lettr=B)
IF ((5.50 < x2ybr <= 6.50)) THEN (lettr=D)
Sup.%
25
25
25
Conf.%
32
32
28
In the context of the classified world constructed, according to our projection technique, the set of tuples that satisfy rules left-hand clause has been represented in Fig.
4 by the white spheres. The instances of letters A, B, C and D have been represented
by the red, green, blue and mauve spheres correspondingly. The x2ybar attribute corresponds to the x2y statistical factor, where x and y are the mean values of the position
of the “on” pixels in each character in the horizontal and vertical direction. As expected
from the set of rules examined, the tuples with x2ybar within the range of (5.5 , 6.5] are
among the classes B, C and D.
32
I. Kopanakis et al.
Fig. 4. 3D Class-Preserving Projection (Letter case study)
Fig. 5. 3D Class-Preserving Projection (Rule B)
Fig. 6. 3D Class-Preserving Projection (Rule C)
Visual Techniques for the Interpretation of Data Mining Outcomes
33
In Fig. 6 and Fig. 5 we are analogously presenting each one of the rules selected to
be mined, following the same coloring scheme. For their printed presentation we tried
to select the best viewing angle that would give us an adequate perspective of the 3D
world. The resulting image though can not be compared with the actual model developed. The ability to interact and navigate in the constructed 3D world finding visual
patters, making assumptions and trying to verify them could not be presented in a
single figure.
Our attempt to derive inferences regarding the properties of the statistical factor
examined is quite difficult, as it is not easy to assign qualitative properties to the factor x2ybar. A combined observation which is derived when we examine each rule’s
representation is that they all tend to occupy space in the area among the classes B, C
and D. That directed us to make the assumption that the space among the three classes
is the projection space of the examined range (5.5 < x2ybar <= 6.5) in the 3D world.
In other words, that hypothesis indicates that most spheres that enter in that region of
the 3D world tend to be within the range of (5.5 , 6.5]. Our mapping procedure therefore, preserved the properties of the classified data, along with the sub-space of the
rules examined and projected the x2ybar statistical factor with quite good properties.
7 Related Work
In the context of visualizing classified data, geometric projection techniques try to
find “interesting” projections of multidimensional data sets in such a way that the
structure, properties and patterns of the data set in the n-dimensional space will be
revealed [Spears, 99] [Dhillon, 98] [Dhillon, 99]. Scatter Plots generate N(N-1)/2
pair-wise parallel projections with each one providing a general impression of the
relationships among the data visualized, within the context of the pair of dimensions
selected (i.e. Scatter-Plot Matrix, HyperSlice [Van, 93]). Advantages of scatter plots
include ease of interpretation and robustness to the size of the data set. Major limitation though is that the high dimensionality results in decreasing the screen space provided for each projection.
The Prosection Views model indicates the application of the various projection
techniques to sections of the data, in the hope that various multidimensional structures
will reveal themselves in lower dimensions [Furnas, 94]. Grand Tour Technique &
Projection Pursuit model [Spears, 99] smoothly rotate the 2D plane revealing unusual
structures within the multidimensional data [Asimov, 85]. The quest for “interesting”
projections of the data is referred to as “projection pursuit” [Friedman, 87]. Parallel
Coordinates [Inselberg, 85], Radial Coordinate (RadViz) [Hoffman, 00] and GridViz
[Hoffman, 99] are also well known techniques of this category.
In the commercial field, several innovative techniques have been proposed. Cluster
Visualizer of SGI’s MineSet tool [SGI, M] for the visualization of clustering results
uses box plots arranged in rows and columns. For the visualization of association
rules, IBM Intelligent Miner - Rules Graph has been based on the graph-based techniques [IBM, IMD]. The rules graph uses nodes to represent item sets and lines with
arrows to represent rules. The 3D Scatter-Plots of IBM’s Data Explorer [IBM, DE]
have been proposed for the exploration of raw data. The SAS Enterprise Miner Scatter-Plots [SAS, EM] has utilized the scatter-plot matrix technique linked with simple
34
I. Kopanakis et al.
bar and plot charts. According to SGI’s MineSet Scatter and Splat Visualizers, [SGI,
M] data points are represented in one, two, or three dimensional scatter-plots. For the
visualization of classified data and association rules several techniques have been
proposed. None of them though combines them both.
8 Conclusions
Conclusively, the utilization of the class-preserving projection techniques for the
visual mining of association rules is expected to enhance our attempt on gaining insight into the properties of the sub-spaces defined by the examined association rules
in the context of the classified high dimensional data space. As in the case studies
presented, we expect that in general the deductive abilities of the human analytical
mind will be capable to combine the perspectives of the high dimensional space provided by those views and analogously derive combined inferences. Interesting inferences are possible to be derived and the interaction among the visualization technique
and the human is enhanced by the flexibility of the model.
That flexibility and adaptive characteristics of this visual data mining model makes
us confident that further study in this research area will derive fruitful outcomes. The
research focus should be mainly targeted to the visualization capabilities regarding
lager volumes of data and association rules as long as its behavior and capability to
reveal visual patterns in a variety of case studies and application domains.
References
[Asimov, 85]
[Blake, 98]
[David, 01]
[Dhillon, 98]
[Dhillon, 99]
[Frawley, 92]
[Friedman, 87]
[Furnas ,94]
D. Asimov: “The Grand Tour: A Tool for Viewing Multidimensional Data”,
SIAM Journal on Scientific Computing, 6, 1985, pp.128-143.
C. L. Blake & C. J. Merz: “UCI Repository of machine learning databases”
[http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University
of California, Department of Information and Computer Science, (1998).
David Law & Yuh Foong: “A Visualization-Driven Approach for Strategic
Knowledge Discovery”, In: Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann 2001, pp. 182-190, (2001).
I. S. Dhillon, D. S. Modha & W. S. Spangler: “Visualizing Class Structure of
Multidimensional Data”, Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics, Interface Foundation of North America, vol. 30, pages 488-493, Minneapolis, May, (1998).
I. S. Dhillon, D. S. Modha & W. S. Spangler: “Class Visualization of HighDimensional Data with Application”, IBM Almaden Research Center, San
Jose, (1999).
W. Frawley, G. Piatetsky-Shapiro & C. Matheus: “Knowledge Discovery in
Databases: An Overview”, AI Magazine, pp. 213-228, (1992).
J. H. Friedman: “Exploratory Projection Pursuit”, Journal of the American
Statistical Association, Vol. 82, 1987, pp.249-266.
G. Furnas and A. Buja: “Prosection Views: Dimensional Inference through
Sections and Projections”, In: Journal of Computational and Graphical Statistics, Vol. 3, No. 4, pp. 323-353, 1994.
Visual Techniques for the Interpretation of Data Mining Outcomes
[Hoffman, 00]
[Hoffman, 99]
[IBM, DE]
[IBM, IMD]
[Inselberg, 85]
[Keim, 95]
[SAS, EM]
[SGI, M]
[SGI, M]
[Spears, 99]
[Van, 93]
[Zhao, 01]
35
P. E. Hoffman and G. Grinstein: “Dimensional Anchors: A Graphic Primitive
for Multidimensional Multivariate Information”, Workshop of New Paradigms in Information Visualization and Manipulation, in conjunction with the
ACM conference on Information and knowledge Management CIKM ’99, to
be published in 2000.
P. E. Hoffman: “Table Visualizations: A Formal Model and its Applications”,
Doctoral Dissertation, Computer Science Department, University of Massachusetts Lowell, MA, 1999.
IBM Open Visualization Data Explorer Project: “What is IBM Open Visualization Data Explorer?”, “Documentation”, “Proceedings of 1996 Data Explorer Symposium”, http://www.research.ibm.com/dx/
IBM DB2 Intelligent Miner for Data: “Using the Intelligent Miner for Data”,
http://www-3.ibm.com/software/data/iminer/fordata/
A. Inselberg: “The plane with Parallel Coordinates”, Special Issue on Computational Geometry, The Visual Computer, Vol. 1, 1985, pp. 69-91.
D. A. Keim & H.-P. Kriegel: “Possibilities and Limits in Visualizing Large
Amounts of Multidimensional Data”, In: Perceptual Issues in Visualization,
Springer 1995, pp. 203-214, (1995).
SAS Enterprise Miner: “Data Mining and Enterprise Miner Stand-alone Tutorial”, http://www.sas.com/products/miner/
SGI MineSet ™ Enterprise Edition: “User’s Guide for the Windows”, “Tutorial for Windows”, “Reference Guide”, “Interface Guide”,
http://www.sgi.com/software/mineset.html
SGI MineSet ™ Enterprise Edition: “User’s Guide for the Windows”, “Tutorial for Windows”, “Reference Guide”, “Interface Guide”,
http://www.sgi.com/software/mineset.html
W. Spears: “An Overview of Multidimensional Visualization Techniques”,
Visualization Workshop of GECCO’99, Genetic and Evolutionary Computation Conference, Orlando, Florida, USA, July 1999.
J. J. Van Wijk and R. Van Liere: “HyperSlice”, IEEE Visualization ’93, G.
M. Nielson and R. D. Bergeron editors, Los Alamitos, CA, IEEE Computer
Society Press, 1993, pp.119-125.
K. Zhao & B. Liu: “Visual Analysis of the Behavior of Discovered Rules”,
ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining (KDD
2001), Proc. Workshop on Visual Data Mining, San Francisco, USA. pp. 5964, (2001).