Download Lecture 8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CSE5230/DMS/2001/8
Data Mining - CSE5230
Data Mining and Information Visualization
CSE5230 - Data Mining, 2001
Lecture 8.1
Lecture Outline
 Overview
of information visualization
 The role of visualization in the process of data
mining
 The patterns being sought: clusters and outliers
 Issues when visualizing higher dimensional
relationships
 Criteria for comparison
 A range of visualization techniques for
exploratory data analysis
CSE5230 - Data Mining, 2001
Lecture 8.2
Information Visualization
 A conjunction
of a number of fields:
Data Mining
Cognitive Science
Graphic Design
Interactive Computer Graphics
 Information
Visualization attempts to use visual
approaches and dynamic controls to provide
understanding and analysis of multidimensional
data
 The data may have no inherent 2D or 3D
semantics and may be abstract in nature. There
is no underlying physical model. Much of the data
in databases is of this type
CSE5230 - Data Mining, 2001
Lecture 8.3
Role of Information Visualization
 Acts
as an exploratory tool
 Useful for identifying subsets of the data
 Structures, trends and outliers may be
identified
 Statistical tests tend incorporate isolated
instances into a broader model as they attempt to
formulate global features
 There is no requirement for an hypothesis, but
the techniques can also support the formulation
of hypotheses if wanted
CSE5230 - Data Mining, 2001
Lecture 8.4
Integrating Visualization with
Data Mining
 There
are four possible approaches:
Use the visualization technique to present the results of
the data mining process
Use visualization techniques as complements to the
data mining process. They complement and increase
understanding in a passive way.
Use visualization techniques to steer the data mining
process. The visualization aids in deciding the
appropriate data mining technique to use and
appropriate subsets of the data to consider.
Apply data mining techniques to the visualization rather
than directly to the data. The idea is to capture the
essential semantics visually then apply the data mining
tools.
CSE5230 - Data Mining, 2001
Lecture 8.5
The Process of Knowledge Discovery
in Databases (a.k.a. Data Mining)
Data
Cleaning &
Coding
selection Enrichment
-domain consistency
-de-duplication
-disambiguation
Data mining
Reporting
- clustering
- segmentation
- prediction
Informati
on
Requirement
Action
Feedback
Operational data
External data
The Knowledge Discovery in Databases (KDD) process (AdZ1996)
CSE5230 - Data Mining, 2001
Lecture 8.6
Visualization in the Context of the
Data Mining Process
 Visualization
tools can potentially be used at a
number of steps in the DM process. But:
the same tools may not be appropriate at each step
how they will be used may be different
 In
general, it is not important whether data
visualization is the first step in the process or not
the feedback loop which moves the process forward may be
commenced by either a visualization or a query
 some
visualizations, (e.g. see slide 25) require an
initial query to generate a visualization
this is an example of a complementary approach
» questions generate visualizations, which may prompt
further questions or generate hypotheses
CSE5230 - Data Mining, 2001
Lecture 8.7
Motivations for Visualization
 The
human visual system is extremely good at
recognizing patterns
it is quicker and easier to understand visual representations
than to absorb information from language or formal notations.
 Exploratory
visualization assists in:
identifying areas of interest
identifying questions which might usefully be asked
 i.e.
a relevant or revealing visualization of either part or
all of a data set, may suggest useful questions and/or
hypotheses to the analyst. These can then be confirmed
by more rigorous approaches
e.g. some clustering techniques require an initial estimate of the
number of clusters present in the data
» visualization techniques can assist in this estimation
CSE5230 - Data Mining, 2001
Lecture 8.8
Criteria for Comparison of
Visualization Tools
 Number
of dimensions that can be represented
 Number of data items that can be handled
 Ability to handle categorical and other nonnumeric data types
 Ability to reveal patterns
 Ease of use
 Learning Curve (to what degree is the technique
intuitive)
CSE5230 - Data Mining, 2001
Lecture 8.9
Examples - Scatterplot
 Each
pair of features (i.e. fields of records) in a
multidimensional database is graphed as a point
in two dimensions (2D)
This straightforward graphing procedure produces a
simple scatterplot - a projection of the multidimensional
data into 2D
 The
scatterplots of all pair-wise combinations of
features are arranged in a matrix
The figure on the following slide illustrates a scatter plot
matrix of 3D from a study of abrasion loss in tyres. The
features are hardness, tensile-strength, abrasion-loss
[Tie1989]
 Each
“sub-graph” gives insight into the
relationship between a pair of features
CSE5230 - Data Mining, 2001
Lecture 8.10
Scatterplot Matrix
 Scatterplot
CSE5230 - Data Mining, 2001
matrix of abrasion loss data [Tie1989]
Lecture 8.11
Possible Problems with Scatterplots
 Everitt
[Eve78, p. 5] gives two reasons why
scatter plots can prove unsatisfactory:
if number of features is greater than ~10, the number of
plots to be examined is very large
» this is just as likely to lead to confusion as to
knowledge of the structures in the data.
structures existing in multidimensional data set do not
necessarily appear in the 2D projections of the features
represented in scatterplots (see next slide)
 Despite
these potential problems, variations on
the scatterplot approach are the most commonly
used of all the visualization techniques
CSE5230 - Data Mining, 2001
Lecture 8.12
Scatterplots: recognizing highdimensional structures - 1
 A structure
which appears as a cluster in a 2D projection
may in fact be a “pipe” in 3D
a pipe is a structure in 3D that looks like a rod or pipe when
viewed in a 3D representation
 While
the pipe is easily identifiable in a 3D display only
projections of it will appear in the 2D components of the
scatterplot matrix
depending of the orientation of the pipe in 3D, it may not appear
as an obvious cluster, if at all
 Equivalent
structures can exist in higher dimensions, e.g.
a cluster in 5D might be a “pipe” in 6D
the appearance of high-D structures in lower-D projections
depends on the luck and skill of the analyst in choosing the
projections, and on the alignment of the structures to the axes
CSE5230 - Data Mining, 2001
Lecture 8.13
Scatterplots: recognizing highdimensional structures - 2
Random(Uniform)
A cluster in 2D
CSE5230 - Data Mining, 2001
May be a plane in 3D
May be a pipe in 3D
(or a cluster in 3D)
Lecture 8.14
Example Tool: Spotfire
http://www.spotfire.com/
CSE5230 - Data Mining, 2001
Lecture 8.15
Example Tool: Spotfire
http://www.spotfire.com/
 The
user interacts with data by choosing which features
will form the horizontal and vertical axes
 Other features can represented by color
this is an example of using the richness of visual representations
to provide more information to the user. As well as 2D spatial
position, other modes such as colour, size, shape and even
sound can be used to convey information about high-dimensional
data

On the previous slide, the data set contains a 3D cluster
in a 4D space (i.e. there are four features)
There are also some background “noise” instances
 The
cluster can seen, with its centre at around (20, 74)
all the points in the cluster are red, showing that it’s a 3D cluster
CSE5230 - Data Mining, 2001
Lecture 8.16
Example Tool: DBMiner
http://www.dbminer.com/
CSE5230 - Data Mining, 2001
Lecture 8.17
Example Tool: DBMiner
http://www.dbminer.com/
 DBMiner
is an integrated data mining tool
 It employs a data visualization known as a “data
cube” (see On-Line Analytic Processing - OLAP)
 After creating a data cube, user can apply a
variety of data mining techniques to analyze the
data further, including:
association, classification, prediction and clustering, etc.
 The
figure on the preceding slide shows a data
cube for a data set which has 3D cluster of data
instances in a 3D space
CSE5230 - Data Mining, 2001
Lecture 8.18
Examples: Parallel Coordinates - 1
 Uses
the idea of mapping a point in a
multidimensional feature space on to a number of
parallel axes
 Each feature is mapped one axis
as many axes as need can be lined up side to side
there is no limit to the number of dimensions that can
be represented
 A single
polygonal line connects the individual
coordinate mappings for each point
 The technique has been applied in air traffic
control, robotics, computer vision and
computational geometry
CSE5230 - Data Mining, 2001
Lecture 8.19
Examples: Parallel Coordinates - 2
Ci
Ci-1
Ci-1
Cn
C1
X1 X2 X3
Xi-1
Xn
 Parallel
axes for RN. The polygonal line shown
represents the point C= (C1, .... , C i-1, Ci, Ci+1, ... , Cn)
CSE5230 - Data Mining, 2001
Lecture 8.20
Examples: Parallel Coordinates - 3
 The
Parallel Coordinates visualization technique
is employed in the software WinViz
http://www.computer.org/intelligent/ex1996/x5069abs.htm
The main advantage of the technique is that it
can represent unlimited numbers of dimensions
 When many points are represented using the
parallel coordinates, the overlap of the polygonal
lines can make it difficult to identify structures in
the data.
 Certain structures, such as clusters, can often be
identified but others are hidden due to the
overlap.

CSE5230 - Data Mining, 2001
Lecture 8.21
Two Clusters In WinViz
CSE5230 - Data Mining, 2001
Lecture 8.22
Examples: Stick Figures
 The
stick figure technique is intended to make
use of the user’s low-level perceptual processes
[PGL1995], such as perception of:
texture, color, motion, and depth
 The
hope is that the user will “automatically” try
to make physical sense of the pictures of the data
created
 Visualizations which represent multidimensional
feature spaces by using a number of subspaces
of 3D or less (e.g. scatterplots) rely more on our
cognitive abilities than our perceptual abilities
 Stick figures avoid this, and present all variables
and data points in a single representation.
CSE5230 - Data Mining, 2001
Lecture 8.23
Iconographic
display using
stick figures US Census
Data
http://ivpr.cs.uml.edu/gallery/
CSE5230 - Data Mining, 2001
Lecture 8.24
Examples: Pixel-based techniques
http://www.dbs.informatik.uni-muenchen.de/dbs/projekt/visdb/visdb.html
 Query-Dependent
Pixel-based Techniques
based on a query, a “semantic distance” is calculated
between each of the query feature values and the
features of each instance in the DB.
overall distance between the data values for a specific
instance and the data attribute values used in the
predicate of the query is also calculated
 if an feature value for a specific instance matches the
query it is assigned a colour indicating a match
» e.g. a sequence of colours starting from yellow and
ending in black could be used, where black is
assigned if none of the instance features match
Instances are arranged on the screen, with the data
items with highest relevance in the centre of the display,
and then proceeding outwards in a spiral
CSE5230 - Data Mining, 2001
Lecture 8.25
Examples: Worlds within Worlds
http://www.cs.columbia.edu/graphics/projects/AutoVisual/AutoVisual.html
 Employs
virtual reality devices to represent an
nD virtual world in 3D or 4D-Hyperworlds
basic approach to reducing the complexity of a
multidimensional function is to hold one or more of its
independent variables constant
» equivalent to taking an infinitely thin slice of the
world perpendicular to the constant variable’s axis
can be repeated until there are 3 dimensions and the
resulting slice can be manipulated and displayed with
conventional 3D graphics hardware
 After
reducing the higher-dimensional space to 3
dimensions the additional dimensions can be
added back, by adding additional 3D worlds
within the first 3D world
CSE5230 - Data Mining, 2001
Lecture 8.26
Dynamic Techniques
 Allow
interaction with the visualization to explore
the data more effectively. Can potentially be
applied to all visualization techniques
Dynamic linking of the data attributes to the parameters
of the visualization.
Filtering
Linking and “brushing” between multiple visualizations
Zooming
Details on demand
CSE5230 - Data Mining, 2001
Lecture 8.27
Other Techniques
 Keim
and Kriegel’s query independent approach
 Chernoff faces
http://www.fas.harvard.edu/~stats/Chernoff/Hcindex.htm
 Cone
trees
 Perspective walls
 Visualization Spreadsheet
 A number of techniques especially developed for
web pages and their links
CSE5230 - Data Mining, 2001
Lecture 8.28
References








[AdZ1996] P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley, 1996.
[BeS1997] A. Berson & S. J. Smith, Data Warehousing, Data Mining and OLAP,
McGraw-Hill, 1997
[Eve1978] B. S. Everitt, Graphical Techniques for Multivariate Data, Heinemann
Educational Books Ltd., London, 1978
[Thu1999] B. Thuraisingham, Data Mining: Technologies, Techniques, Tools, and
Trends, CRC Press LLC, Boca Raton, Florida 1999
[Tie1989] L. Tierney, XLISP-STAT: A Statistical Environment Based on the XLISP
Language (Version 2.0), University of Minnesota School of Statistics, Technical
Report Number 528, July 1989
[PGL1995] R. M. Pickett, G. Grinstein,
H. Levkowitz and S. Smith, Harnessing Preattentive Perceptual Processes in
Visualization, pp. 9-21 in Perceptual Issues in Visualization
(Eds. G. Grinstein & H. Levkowitz), Springer-Verlag, Berlin, 1995
[WGL1996] Database issues for data visualization, Proceedings of the IEEE
Visualization '95 Workshop, A. Wierse, G. G. Grinstein and U. Lang, (eds),
Atlanta, Georgia, USA, October 28, 1995
[LeG1993] Database issues for data visualization, Proceedings of the IEEE
Visualization '93 Workshop, J. P. Lee and G. G. Grinstein, (eds), San Jose,
California, USA, October 26, 1993
CSE5230 - Data Mining, 2001
Lecture 8.29