Download Visual Data Mining in detail

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Visual Data Mining: An Overview

What is Visual Data Mining?

Survey of techniques

Data Visualization

Visualizing Data Mining Results

Visual Data Mining
What Is Visual Data Mining?


Visual data mining “discovers implicit and useful
knowledge from large data sets using data
and/or knowledge visualization techniques”
Data visualization + Data mining techniques
Why Visual Data Mining?


Advantages of human visual system
 Highly parallel processor
 Sophisticated reasoning engine
 Large knowledge base
Can be used to comprehend data distributions, patterns,
clusters, and outliers
Actionable
Evaluation
Flexibility
User Interaction
Data Mining Algorithms
+
+
–
–
Visualization
–
–
+
+
Why Not Only Visual Data Mining?

Disadvantages of human visual system

Needs training

Not automated

Intrinsic bias


Limit of about 106 or 107 observations
(Wegman 1995)
Power of integration with analytical methods
Scope of Visual Data Mining


Visualization: Use of computer graphics to create visual
images which aid in the understanding of complex, often
massive representations of data
Visual Data Mining: The process of discovering implicit but
useful knowledge from large data sets using visualization
techniques
Computer
Graphics
High
Performance
Computing
Multimedia
Systems
Pattern
Recognition
Human
Computer
Interfaces
Purpose of Visualization

Gain insight into an information space by mapping data
onto graphical primitives

Provide qualitative overview of large data sets

Search for patterns, trends, structure, irregularities,
relationships among data

Help find interesting regions and suitable parameters for
further quantitative analysis

Provide a visual proof of computer representations
derived
Visual Data Mining & Data Visualization


Integration of visualization and data mining
 data visualization
 data mining result visualization
 data mining process visualization
 interactive visual data mining
Data visualization
 Data in a database or data warehouse can be viewed
 at different levels of abstraction
 as different combinations of attributes or
dimensions
 Data can be presented in various visual forms
Abilities of Humans and Computers
abilities of
the computer
Data Storage
Numerical Computation
Searching
Logic
Planning
Diagnosis
Prediction
Perception
Creativity
General Knowledge
human abilities
Visual Mining vs. Scientific Vis. & Graphics


Scientific Visualization
 Often visualize physical model, low
dimensionality
Graphics
 More concerned with how to render (draw)
rather than what to render
Data Visualization

View data in database or data warehouse

User may control


Different levels of details

Subset of attributes
Drawn using boxplots, histograms, polylines, etc.
Historical Overview of Exploratory
Data Visualization Techniques (cf. [WB 95])



Pioneering works of Tufte [Tuf 83, Tuf 90] and Bertin [Ber
81] focus on
 Visualization of data with inherent 2D-/3D-semantics
 General rules for layout, color composition, attribute
mapping, etc.
Development of visualization techniques for different types
of data with an underlying physical model
 Geographic data, CAD data, flow data, image data,
voxel data, etc.
Development of visualization techniques for arbitrary
multidimensional data (w.o. an underlying physical model)
 Applicable to databases and other information resources
Dimensions of Exploratory Data Visualization
Data Visualization Techniques
Geometric
Icon-based
Distortion Techniques
Pixel-oriented
Complex
Hierarchical
Graph-based
Simple
Interaction Techniques
Mapping
Projection
Filtering Link & Brush Zooming
Classification of Data Visualization Techniques



Geometric Techniques:

Scatterplots, Landscapes, Projection Pursuit, Prosection Views,
Hyperslice, ParallelCoordinates...
Icon-based Techniques:

Chernoff Faces, Stick Figures, Shape-Coding, Color Icons, TileBars,...
Pixel-oriented Techniques:




Recursive Pattern Technique, Circle Segments Technique, Spiral- & AxesTechniques,...
Hierarchical Techniques:

Dimensional Stacking, Worlds-within-Worlds,Treemap, Cone Trees,
InfoCube,...
Graph-Based Techniques:

Basic Graphs (Straight-Line, Polyline, Curved-Line,...)

Specific Graphs (e.g., DAG, Symmetric, Cluster,...)

Systems (e.g., Tom Sawyer, Hy+, SeeNet, Narcissus,...)
Hybrid Techniques: arbitrary combinations from above
Distortion & Dynamic/Interaction Techniques

Distortion Techniques



Simple Distortion (e.g. Perspective Wall, Bifocal Lenses,
TableLens, Graphical Fisheye Views,...)
Complex Distortion (e.g. Hyperbolic Repr. Hyperbox,...)
Dynamic/Interaction Techniques



Data-to-Visualization Mapping (e.g. Auto Visual, S Plus, XGobi,
IVEE,...)
Projections: (e.g. GrandTour, S Plus, XGobi,...)
Filtering (Selection, Querying) (e.g. MagicLens, Filter/Flow
Queries, InfoCrystal,...)

Linking & Brushing (e.g. Xmdv-Tool, XGobi, DataDesk,...)

Zooming (e.g. PAD++, IVEE, DataSpace,...)

Detail on Demand (e.g. IVEE, TableLens, MagicLens, VisDB,...)
Visual Survey

Data visualization techniques

Scatterplot Matrices, Landscapes, Parallel Coordinates

Icon-based, Dimensional Stacking, Treemaps
Direct Visualization
Ribbons with Twists Based on Vorticity
Geometric Techniques


Basic Idea
 Visualization of geometric transformations and
projections of the data
Methods
 Landscapes [Wis 95]
 Projection Pursuit Techniques [Hub 85] (a
techniques for finding meaningful projections of
multidimensional data)
 Scatterplot-Matrices [And 72, Cle 93]
 Prosection Views [FB 94, STDS 95]
 Hyperslice [WL 93]
 Parallel Coordinates [Ins 85, ID 90]
Used by ermission of M. Ward, Worcester Polytechnic Institute
Scatterplot-Matrices [Cleveland 93]
matrix of scatterplots (x-y-diagrams) of the k-dimensional data [total of
(k2/2-k) scatterplots]
Used by permission of B. Wright, Visible Decisions Inc.
Landscapes [Wis 95]


news articles
visualized as
a landscape
Visualization of the data as perspective landscape
The data needs to be transformed into a (possibly artificial) 2D spatial
representation which preserves the characteristics of the data
Parallel Coordinates [Ins 85, ID 90]



n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
the axes are scaled to the [minimum, maximum]―range of the
corresponding attribute
every data item corresponds to a polygonal line which intersects
each of the axes at the point which corresponds to the value for the
attribute
• • •
Attr. 1
Attr. 2
Attr. 3
Attr. k
Parallel Coordinates
Icon-Based Techniques

Basic Idea


Visualization of the data values as features of icons
Overview

Chernoff-Faces [Che 73, Tuf 83]

Stick Figures [Pic 70, PG 88]

Shape Coding [Bed 90]

Color Icons [Lev 91, KK 94]

TileBars [Hea 95]
(use of small icons representing the relevance feature
vectors in document retrieval)
Stick Figures
census data
showing age,
income, sex,
education, etc.
Hierarchical Techniques


Basic Idea: Visualization of the data using a
hierarchical partitioning into subspaces.
Overview
 Dimensional Stacking [LWW 90]
 Worlds-within-Worlds [FB 90a/b]
 Treemap [Shn 92, Joh 93]
 Cone Trees [RMC 91]

InfoCube [RG 93]
Dimensional Stacking [LWW 90]
attribute 4
attribute 2
attribute 3
attribute 1



partitioning of the n-dimensional attribute space in 2dimensional subspaces which are ‘stacked’ into each other
partitioning of the attribute value ranges into classes the
important attributes should be used on the outer levels
adequate especially for data with ordinal attributes of low
cardinality
Dimensional Stacking
Visualization of oil mining data with longitude and
latitude mapped to the outer x-, y-axes and ore grade
and depth mapped to the inner x-, y-axes
Used by permission of M. Ward, Worcester Polytechnic Institute
Dimensional Stacking

Disadvantages:
 Difficult to display more than nine dimensions
 Important to map dimensions appropriately
 May be difficult to understand visualizations at
first
Treemap [JS 91, Shn 92, Joh 93]
Screen-filling method which uses a hierarchical
partitioning of the screen into regions depending on the
attribute values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
MSR Netscan image:
Treemap of a File System (Schneiderman)
Treemaps



The attributes used for the partitioning and their
ordering are user-defined (the most important
attributes should be used first)
The color of the regions may correspond to an
additional attribute
Suitable to get an overview over large amounts
of hierarchical data (e.g., file system) and for
data with multiple ordinal attributes (e.g., census
data)
Data Mining Result Visualization


Presentation of the results or knowledge obtained from
data mining in visual forms
Examples

Scatter plots and boxplots (obtained from descriptive
data mining)

Decision trees

Association rules

Clusters

Outliers

Generalized rules

Text mining
Boxplots from Statsoft: Multiple
Variable Combinations
Visualization of Data Mining Results in
SAS Enterprise Miner: Scatter Plots
Visualization of Association Rules
in SGI/MineSet 3.0
Visualization of Decision Tree in
SGI/MineSet 3.0
Vizualization of Decision Trees
Visualization of Cluster Grouping
IBM Intelligent Miner
Association Rules (MineSet)



LHS and RHS items
are mapped to x-,
y-axis
Confidence,
support correspond
to height of the bar
or disc, respectively
Interestingness is
mapped to Color
MineSet: Association Rules
Association Ball Graph (DBMiner)



Items are
visualized as balls
Arrows indicate
rule implication
Size represents
support
Classification (SAS EM [SAS 01])
Tree Viewer


Color corresponds to relative frequency of a class in a
node
Branch line thickness is proportional to the square root of
the objects
Cluster Analysis
Cluster
(H-BLOB: Hierarchical BLOB) [SBG 00]
Form ellipsoids
Form blobs
(implicit surfaces)
H-BLOB
Text Mining (ThemeRiver [WCF+ 00])


Visualization of thematic Changes in documents
Vertical distance indicates collective strength of the themes
Data Mining Process Visualization

Presentation of the various processes of data mining in
visual forms so that users can see the flow of data
cleaning, integration, preprocessing, mining

Data extraction process

Where the data is extracted

How the data is cleaned, integrated, preprocessed,
and mined

Method selected for data mining

Where the results are stored

How they may be viewed
Visualization of Data Mining Processes
by Clementine
See your solution
discovery
process clearly
Understand
variations with
visualized data
Interactive Visual Data Mining


Using visualization tools in the data mining process to
help users make smart data mining decisions
Example


Display the data distribution in a set of attributes using
colored sectors or columns (depending on whether the
whole space is represented by either a circle or a set of
columns)
Use the display to which sector should first be selected
for classification and where a good split point for this
sector may be
Visual data mining



Projection Pursuits
(Class) Tours [Dhillon et al. ’98]
Visual Classification [Ankerst et al. KDD ’99]
Projection Pursuits

Exploratory projection pursuit:
 Goal: reduce dimensionality
 Define “interestingness” index to each possible
projection of a data set
 Maximize this index, project linearly
 Not always possible/useful
Class Tours



“Visualizing Class Structure of Multidimensional
Data” by Dhillon et al. 1998
Problem: Visualize multidimensional data
categorized into classes
Solution: Project data into 2D while preserving
distances between class means
Class-Preserving Projection:
Preserves distances between
projected means
Tours



Tours are animated and interpolated sequences
of 2D projections [Asimov 1985]
Class tours: sequences of class-preserving 2dimensional projections
Captures “inter-class structure of complex, multidimensional data”
Interactive Visual Mining by
Perception-Based Classification (PBC)
Visual Classification


“Visual Classification: An
Interactive Approach to
Decision Tree
Construction” by
Ankerst et al. KDD 99
Exploit expert’s domain
knowledge and human
visual processing
Visual Classification
Visual Classification Results



Comparable classification accuracy
Can produce more understandable decision trees
Expert domain knowledge can be exploited
Audio Data Mining

Uses audio signals to indicate the patterns of data or the
features of data mining results




An interesting alternative to visual mining
An inverse task of mining audio (such as music)
databases which is to find patterns from audio data
Visual data mining may disclose interesting patterns
using graphical displays, but requires users to
concentrate on watching patterns
Instead, transform patterns into sound and music and
listen to pitches, rhythms, tune, and melody in order to
identify anything interesting or unusual
Summary



Many visualization methods available
How to evaluate and compare methods?
Need for:
 Integrated visualization/exploration systems
 Studies of interaction techniques for mining
 Practical case studies
Acknowledgments



Many slides and images from Mihael Ankerst, Boeing,
Daniel A. Keim, AT&T, Tutorial at PKDD'2001
Some pictures from Information Visualization in Data
Mining and Knowledge Discovery, edited by Usama
Fayyad, Georges Grinstein and Andreas Wierse
A good set of slides were prepared by Andrew Wu (Spring
2004)