Download cluster - Tripod

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Information Visualization
Lecture Outline






Overview of information visualization
The role of visualization in the process of
data mining
The patterns being sought: clusters and
outliers
Issues when visualizing higher dimensional
relationships
Criteria for comparison
A range of visualization techniques for
exploratory data analysis
2
Information Visualization

A conjunction of a number of fields:




Data Mining
Cognitive Science
Graphic Design
Interactive Computer Graphics
3
Information Visualization




Information Visualization attempts to use
visual approaches and dynamic controls to
provide understanding and analysis of
multidimensional data
The data may have no inherent 2D or 3D
semantics and may be abstract in nature.
There is no underlying physical model.
Much of the data in databases is of this type
4
Role of Information
Visualization





Acts as an exploratory tool
Useful for identifying subsets of the data
Structures, trends and outliers may be
identified
Statistical tests tend to incorporate isolated
instances into a broader model as they
attempt to formulate global features
There is no requirement for an hypothesis,
but the techniques can also support the
formulation of hypotheses if wanted
5
Integrating Visualization With
Data Mining

There are four possible approaches:


Use the visualization technique to present
the results of the data mining process.
Use visualization techniques as
complements to the data mining process.

They complement and increase understanding
in a passive way.
6
Integrating Visualization With
Data Mining

Use visualization techniques to steer the
data mining process.


The visualization aids in deciding the
appropriate data mining technique to use and
appropriate subsets of the data to consider.
Apply data mining techniques to the
visualization rather than directly to the
data.

The idea is to capture the essential semantics
visually then apply the data mining tools.
7
Discovery in Databases
(a.k.a. Data Mining)
Data
Cleaning &
Selection Enrichment
Coding
-domain consistency
-de-duplication
-disambiguation
Data mining
Reporting
- clustering
- segmentation
- prediction
Information
Requirement
Action
Feedback
Operational data
External data
The Knowledge Discovery in Databases (KDD) process (AdZ1996)
8
Visualization in the Context of
the Data Mining Process

Visualization tools can potentially be
used at a number of steps in the DM
process. But:


the same tools may not be appropriate at
each step
how they will be used may be different
9
Visualization in the Context of
the Data Mining Process

In general, it is not important whether
data visualization is the first step in the
process or not

the feedback loop which moves the
process forward may be commenced by
either a visualization or a query
10
Visualization in the Context of
the Data Mining Process

some visualizations, (e.g. see slide 25)
require an initial query to generate a
visualization

this is an example of a complementary
approach

questions generate visualizations, which may
prompt further questions or generate
hypotheses
11
Motivations for Visualization

The human visual system is extremely
good at recognizing patterns


it is quicker and easier to understand visual
representations than to absorb information
from language or formal notations.
Exploratory visualization assists in:


identifying areas of interest
identifying questions which might usefully be
asked
12
Motivations for Visualization

i.e. a relevant or revealing visualization of
either part or all of a data set, may suggest
useful questions and/or hypotheses to the
analyst. These can then be confirmed by more
rigorous approaches

e.g. some clustering techniques require an initial
estimate of the number of clusters present in the
data

visualization techniques can assist in this estimation
13
Criteria for Comparison of
Visualization Tools






Number of dimensions that can be
represented
Number of data items that can be handled
Ability to handle categorical and other nonnumeric data types
Ability to reveal patterns
Ease of use
Learning Curve (to what degree is the
technique intuitive)
14
Examples - Scatterplot

Each pair of features (i.e. fields of
records) in a multidimensional database
is graphed as a point in two dimensions
(2D)

This straightforward graphing procedure
produces a simple scatterplot - a projection
of the multidimensional data into 2D
15
Examples - Scatterplot

The scatterplots of all pair-wise combinations
of features are arranged in a matrix


The figure on the following slide illustrates a
scatter plot matrix of 3D from a study of abrasion
loss in tyres. The features are hardness, tensilestrength, abrasion-loss [Tie1989]
Each “sub-graph” gives insight into the
relationship between a pair of features
16
Scatterplot Matrix

Scatterplot matrix of abrasion loss data
[Tie1989]
17
Possible Problems With
Scatterplots

Everitt [Eve78, p. 5] gives two reasons why
scatter plots can prove unsatisfactory:

if number of features is greater than ~10, the
number of plots to be examined is very large


this is just as likely to lead to confusion as to knowledge
of the structures in the data.
structures existing in multidimensional data set do
not necessarily appear in the 2D projections of the
features represented in scatterplots (see next
slide)
18
Possible Problems With
Scatterplots

Despite these potential problems,
variations on the scatterplot approach
are the most commonly used of all the
visualization techniques
19
Scatterplots: Recognizing Highdimensional Structures - 1

A structure which appears as a cluster in
a 2D projection may in fact be a “pipe” in
3D

a pipe is a structure in 3D that looks like a rod
or pipe when viewed in a 3D representation
20
Scatterplots: Recognizing Highdimensional Structures - 1

While the pipe is easily identifiable in a 3D
display only projections of it will appear in
the 2D components of the scatterplot
matrix

depending of the orientation of the pipe in
3D, it may not appear as an obvious cluster, if
at all
21
Scatterplots: Recognizing Highdimensional Structures - 1

Equivalent structures can exist in higher
dimensions, e.g. a cluster in 5D might be
a “pipe” in 6D

the appearance of high-D structures in lowerD projections depends on the luck and skill of
the analyst in choosing the projections, and
on the alignment of the structures to the axes
22
Scatterplots: recognizing highdimensional structures - 2
Random(Uniform)
A cluster in 2D
May be a plane in 3D
May be a pipe in 3D
(or a cluster in 3D)
23
Example Tool: Spotfire
http://www.spotfire.com/
24
Example Tool: Spotfire
http://www.spotfire.com/


The user interacts with data by choosing which
features will form the horizontal and vertical
axes
Other features can be represented by color

this is an example of using the richness of visual
representations to provide more information to the
user. As well as 2D spatial position, other modes such
as colour, size, shape and even sound can be used to
convey information about high-dimensional data
25
Example Tool: Spotfire
http://www.spotfire.com/


On the previous slide, the data set
contains a 3D cluster
The cluster can seen, with its centre at
around (20, 74)

all the points in the cluster are red, showing
that it’s a 3D cluster
26
Example Tool: DBMiner
http://www.dbminer.com/
27
Example Tool: DBMiner
http://www.dbminer.com/


DBMiner is an integrated data mining
tool
It employs a data visualization known
as a “data cube” (see On-Line Analytic
Processing - OLAP)
28
Example Tool: DBMiner
http://www.dbminer.com/

After creating a data cube, user can
apply a variety of data mining
techniques to analyze the data further,
including:


association, classification, prediction and
clustering, etc.
The figure on the preceding slide shows
a data cube for a data set which has 3D
cluster of data instances in a 3D space
29
Examples: Parallel Coordinates
-1


Uses the idea of mapping a point in a
multidimensional feature space on to a
number of parallel axes
Each feature is mapped one axis


as many axes as need can be lined up side
to side
there is no limit to the number of
dimensions that can be represented
30
Examples: Parallel Coordinates
-1


A single polygonal line connects the
individual coordinate mappings for each
point
The technique has been applied in air
traffic control, robotics, computer vision
and computational geometry
31
Examples: Parallel Coordinates
-2
Ci
Ci-1
Ci-1
Cn
C1
X1 X2 X3

Xi-1
Xn
Parallel axes for RN. The polygonal line shown
represents the point C= (C1, .... , C i-1, Ci, Ci+1,
... , Cn)
32
Examples: Parallel Coordinates
-3

The Parallel Coordinates visualization
technique is employed in the software
WinViz
http://www.computer.org/intelligent/ex1996/x5069abs.htm

The main advantage of the technique is
that it can represent unlimited numbers
of dimensions
33
Examples: Parallel Coordinates
-3


When many points are represented
using the parallel coordinates, the
overlap of the polygonal lines can make
it difficult to identify structures in the
data.
Certain structures, such as clusters,
can often be identified but others are
hidden due to the overlap.
34
Two Clusters In WinViz
35
Examples: Stick Figures

The stick figure technique is intended to
make use of the user’s low-level perceptual
processes [PGL1995], such as perception of:


texture, color, motion, and depth
The hope is that the user will “automatically”
try to make physical sense of the pictures of
the data created
36
Examples: Stick Figures


Visualizations which represent
multidimensional feature spaces by using a
number of subspaces of 3D or less (e.g.
scatterplots) rely more on our cognitive
abilities than our perceptual abilities
Stick figures avoid this, and present all
variables and data points in a single
representation.
37
Iconographic
display using
stick figures US Census
Data
http://ivpr.cs.uml.edu/g
allery/
38
39
40
41
Examples: Pixel-based techniques
http://www.dbs.informatik.uni-muenchen.de/dbs/projekt/visdb/visdb.html

Query-Dependent Pixel-based Techniques



based on a query, a “semantic distance” is
calculated between each of the query feature
values and the features of each instance in the DB
Distance is mapped to colour for each attribute
Overall distance between the data values for a
specific instance and the data attribute values
used in the predicate of the query is also
calculated
42
Examples: Pixel-based techniques
http://www.dbs.informatik.uni-muenchen.de/dbs/projekt/visdb/visdb.html



Instances are arranged on the screen, with
the data items with highest relevance in
the centre of the display, and then
proceeding outwards in a spiral
the values for each of the attributes are
presented in separate subwindows
the arrangement inside the subwindows is
according to the overall distance
43
Query-Dependent Pixel-based
Techniques
Overall Distance

Result of a complex query [KeK1994]
44
Examples: Worlds within Worlds
http://www.cs.columbia.edu/graphics/projects/AutoVisual/AutoVisual.html

Employs virtual reality devices to represent an
nD virtual world in 3D or 4D-Hyperworlds

basic approach to reducing the complexity of a
multidimensional function is to hold one or more
of its independent variables constant


equivalent to taking an infinitely thin slice of the world
perpendicular to the constant variable’s axis
can be repeated until there are 3 dimensions and
the resulting slice can be manipulated and
displayed with conventional 3D graphics hardware
45
Examples: Worlds within Worlds
http://www.cs.columbia.edu/graphics/projects/AutoVisual/AutoVisual.html

After reducing the higher-dimensional
space to 3 dimensions the additional
dimensions can be added back, by
adding additional 3D worlds within the
first 3D world
46
Worlds within Worlds
47
Dynamic Techniques

Allow interaction with the visualization to
explore the data more effectively. Can
potentially be applied to all visualization
techniques





Dynamic linking of the data attributes to the
parameters of the visualization.
Filtering
Linking and “brushing” between multiple
visualizations
Zooming
Details on demand
48
Other Techniques






Keim and Kriegel’s query independent
approach
Chernoff faces
http://www.fas.harvard.edu/~stats/Chernoff/Hcindex
.htm
Cone trees
Perspective walls
Visualization Spreadsheet
A number of techniques especially developed
for web pages and their links
49
Web References


More lectures and demo software available
at:
http://www.cs.auc.dk/·DVDM/courses.html
50