Download The Role of Visualization in Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
The Role of Visualization in
Data Mining
Björn M. Gustafsson, Jonas K. Gustafsson and Ragnar E. Hammarqvist
Data Mining, TNM033, 2007, University of Linköping
Abstract
Visual data exploration allows faster data exploration and generally
provides a better result than automatic data mining algorithms.
The classification of VDM techniques are done in three dimensions:
data type to be visualized, visualization technique, and interaction
and distortion. There exist a large number of different visualization
techniques all depending on the suitability to the type of data that
are to be visualized. The two major driving forces behind
visualizing data mining models are understanding and trust. In
general leads good understanding to trust. Exploratory data
analysis is used to find systematic relations between variables
when there are little or no knowledge of what the forthcoming
result may be. That is why exploratory analysis only works as a
first step of the prediction of a model.
1
Introduction
Data from a lot of different areas (monitoring
systems, credit cards and so on) are collected
today because people believe that the
information is useful. The problem is finding the
valuable information hidden in the data. This is
a difficult task and is where visual data
exploration in data mining comes in.
It is important to include the human in the data
exploration process in order for the data mining
to be useful and effective. The idea of visual
exploration in data mining is to represent the
raw data with visualization. The human can then
gain insight, draw conclusions and interact with
the data. The main advantages of visual data
exploration compared with automatic data
mining techniques accordingly to [1] are:
•
•
Visual data exploration can easily deal with
non homogeneous and noisy data.
Visual data exploration is intuitive and
requires no understanding of complex
mathematical or statistical algorithms or
parameters.
The result of visual data exploration is that it
allows faster data exploration and generally
provides a better result (especially when
automatic algorithms fail).
Visual Data Exploration is usually done in three
steps called "the Visual Exploration Paradigm".
The steps are: Overview first, zoom and filter,
and then details-on-demand. The user needs
firstly an overview of the data. Secondly, the
user may want to focus on interesting patterns.
Finally, wants the user to examine and analyze
the patterns and therefore needs to drill-down
to look at details of the data. All this can be
done visually using different techniques.
Classification of Visual Data
Mining (VDM) techniques
The classification of VDM techniques are done in
three dimensions: data type to be visualized,
visualization technique, and interaction and
distortion. The variables of each dimension are
shown in figure 1. All dimensions are orthogonal
to each other, which mean that any combination
is possible. Another thing is that a specific
system may support different data types and
may
also
use
combination
of
multiple
visualization and interaction techniques.
Figure 1: Classification of techniques
Data type to be visualized
The data in Information Visualization usually
consists of many records. The number of
attributes differs from data set to data set. The
number of variables is called the dimensionality
of the data set. There are many types of data
that needs to be visualized. First we have the
dimensional
data:
1D,
2D
and
multi
dimensional. Secondly we have other type of
data such as text and hypertext, hierarchies and
graphs, and algorithms and software.
1D data has one dense dimension. One example
of 1D data is temporal data and can easily be
visualized with a time line. 2D data has two
distinct dimensions. One example of 2D data is
maps, geographical data (longitude, latitude).
One good way to display 2D data is x/y-plots.
Multidimensional data refers to data that has a
dimensionality of three or higher. One example
of such data can be data from relational
databases that can have lots of columns (up to
hundreds). There is no simple way to map this
data to a 2D screen, other more complex
techniques are needed. One such technique is
parallel coordinates. Parallel Coordinates (figure
2) display each multidimensional data item as a
polygonal line which intersects the horizontal
dimension axes. The intersection is at the
position corresponding to the data value for the
corresponding dimension.
2
Iconic displays
Iconic displays are when data is mapped multi
dimensional data to feature's of an icon.
Difference values changes the appearance of
the mapped attributes of the icon depending on
the value of the corresponding data record. A
simple way of modifying the icons can be to let
the size, angle, color or density be mapped to
data record. The icons can also be built in a
more complex way where more specific features
of an icon are mapped to values of the data
record. Examples of such are Chernoff faces
(figure 3), star icons and stick figure icons.
Figure 2: Parallel Coordinates Visualization
There are other data that can not be described
using dimensionality. Text and hypertext are
one type of this data, it can not easily be
described with numbers and must therefore first
be transformed to vectors before visualization
techniques
can
be
used.
One
simple
transformation example of this is word counting
which can be combined with multidimensional
scaling. Another group of data that can not be
described with dimensions is hierarchies and
graphs. A graph consists of nodes (sets of
objects) and edges (connections between the
nodes). There are many specific techniques to
deal with this kind of data. Algorithms and
software are another class of data. Handling
software projects is difficult, and visualization
can be used to support software development.
The techniques for doing this are also many and
specific.
Visualization techniques
There exist a large number
visualization techniques.
of
different
Geometrically transformed displays
Geometrical transformed displays aims on
finding
interesting
transformation
of
multidimensional data sets. The purpose is that
the user easily can find interesting regions to
dig deeper into to confirm whether the finding is
interesting or not. Well known techniques can
for example be scatter plot matrices or parallel
coordinates.
Figure 3: Chernoff faces
Dense Pixel Displays
Each dimension value is mapped to a pixels
color and the pixels are grouped in a way so
that neighboring pixels dimensions is related.
Since the dense pixel display technique only
uses one pixel for each data record, this method
allows large amount of data to be displayed at
the same time (up to about 1.000.000 data
values). The main problem with this technique
is how to group pixels in a way so that the user
can get a grip of the data. Different grouping
methods are used depending on the purpose of
the display. By grouping the pixels in an
appropriate way, the visualization provides
information on dependencies, hot spots and
correlations.
Stacked Displays
Stacked displays focus on presenting data in a
hierarchical way. If the data is multi
dimensional, the data dimensions to be used for
3
partitioning the data and building the hierarchy
have to be selected appropriately.
Interaction and Distortion
Dynamic projections
Dynamic Projection is about to dynamically
change the projection in order to explore a multi
dimensional data set. The number of possible
projection of a multi-dimensional data is
exponential to the number of dimensions.
Therefore it can not be used to visualize data
sets that consist of many dimensions.
Interactive filtering
When exploring huge data sets, it is impotent to
narrow down the region of interesting subsets.
This can be done by either selecting a subset of
interest(browsing) or by filtering out data that is
of no interest(querying).Browsing is hard if the
data set is large and querying often dose not
produce the desired result. There exist numbers
of interactive techniques to filter data when
exploring. They all have one thing in common
and that is that they instantly produces result,
this gives the user the possibility to modify
queries and immediately see the result.
Interactive distortion
The idea with interactive distortion is to
interactively change the level of detail on
different parts of the screen. This enables the
user to quickly drill down in a certain area and
still have an overview of the rest of the data
set.
Interactive Linking and Brushing
To overcome the fact that all visualization
techniques have some strengths and some
weaknesses a visualization of a data set often is
displayed in multiple views. Different methods
are combined to make a more clear view of the
data set. To make this easier to handle the
views is linked to each other. Changes in one
view do not only change the current view, it
also changes all views linked to that view. The
ability to pick one data attribute in one view and
immediately see where that object is in the
other views helps to find correlations and
dependencies.
Visualizing
models
data
mining
The role of Data Mining is to extract information
from a data base that the user did not already
know about. The result is findings of models and
patterns which describes useful relationships.
There are many ways to graphically represent a
model, the visualizations that are used should
therefore be chosen to maximize the value for
the viewer. To be able to do this we need to
understand the user’s needs and design the
visualization after that. For this purpose we
need orienting principles as a template for the
visualization, so it would both fit beginners and
experts.
The Orienting principles can be described as
maps and landmarks. By following a chain of
landmarks found in a map you will find your
way to the end-destination. A global coordinate
system (the map) in combination with a local
coordinate system (the landmarks) must fit
together in order to give confidence (otherwise
you will get lost).
The two major driving forces behind visualizing
data mining models are understanding and
trust. The simplest way to look at a data mining
model is to see it as a black box, with some
inputs and outputs. In this way the user gains
almost no understanding since he or she does
not know what is going on, so how can the user
then trust the model. Another often much better
way is to get the user to understand what is
going on. There is no automatic process to do
this. If the output or the model can not be
understood it nor can be trusted. If the user can
understand what and how something has been
discovered he or she will trust it. The two most
important problems to handle in order to gain
understanding are firstly to visualize the data
mining output in a meaningful way and secondly
to allow the user to interact with the
visualization.
4
Trust
Trust can not be measured with only one
quantity. It has to be described in many
dimensions with the key factors that contribute
to trust. Visualizing the limitations of a model is
very important since one ultimately can only
disprove a model.
The ways to assessing trust are many (the key
factors)
and more
clear
than
assessing
understanding.
•
•
•
•
•
•
•
Not violate expected qualitative principles
when having a general knowledge of the
domain. Example of violation: finding
correlation between shoe size and IQ.
Domain knowledge is also critical for outlier
detection. If you know that the domain is
between the numbers 10 and 50, you can
not put numbers outside it. It simply makes
no sense.
Assessing trust is closely related to model
comparison. Especially comparing and
measuring sensitivity and speed of a model.
Statistical summaries are particularly useful
when comparing relative trust between two
models. Relationships differ most between
two models when focusing on the analysis
on subsets of features.
Drill-trough and multiple scales of data
enhance the summaries. It makes it for
example easier to see global and local max
and min values of the entire range.
Measure their trustworthiness in some way,
such as a quantified measurement of
variance.
Checking the model for internal consistency
in the many transformations (standard
cross validation and beyond).
Understanding
As mentioned before, understanding leads to
trust. Accuracy of a model is often traded for
understandability. This is because understanding is more important than accuracy in a
model. There are three components for
understanding
a
model:
representation,
interaction and integration.
Representing
the
model
with
suitable
components that are already known to the user
improves understanding. In many cases the
model contains of to much information to
provide a representation that is both complete
and
understandable.
For
example,
3D
representations can show more information than
2D, but it must have navigation and interaction
to work.
Interact with the model in real-time (answering
user queries) can be done in many different
ways depending on the model. Common forms
are: interactive classification, interactive model
building, drill-up/down, animation, searching,
filtering and level-of-detail (LOD) manipulation.
Searching, filtering and drill-up/down make
finding of hidden information in a model easier.
Interactive classification and interactive model
building on the other hand helps the
understanding of the model.
Integration between models and views provides
user context. For a user to truly understand a
model he must understand how the model
correlates to the data from which it was derived.
The three techniques that are used for this are:
drill-through, brushing and coordinated views.
Drill-through is to access original data by
selecting a piece of the model. Brushing on the
other hand refers to select pieces of the model
and have them appear in a different
representation. Brushing will be brought up in
more detail later. Coordinated views is to have
multiple and linked (shows changes in all views)
representations, combined with representation
of original data. All these techniques help the
user understand how the model relates to
original data and therefore gives an external
context for the model and enhances validation.
How to compare models using
visualization
You can compare models in three approaches:
input/output
mapping,
algorithms
and
processes.
The input/output approach simply considers the
mapping from a defined input space to a defined
output space. You classify the input/outputs for
each model as a data set. Example, two
classifiers could be described by a set of
input/output pairs, such as (obs1, class a),
(obs2, class b), (obs3, class c), and so on.
In the algorithms comparison approach you
express the model as a series of algorithmic
5
steps. Each algorithm can then be analyzed
using standard methods for measurement such
as complexity, stability, computation time and
computation size. These measurements can
then be visualized with for example bar charts
with colors and symbols.
The modeling process includes everything in and
around the modeling, such as: the methods, the
user, the database and the support resources. It
also includes constraints such as: knowledge,
time and analysis implementation. The fact that
there are so many things makes this approach
the most imperfectly defined and we need to
narrow it down and neglect everything except
analysis methods and implementation issues.
We can do this if we say that the comparison is
made for one user on one database over a short
time period. Now we choose a set of metrics
that suits the models being compared and then
visualize the comparison like we did in
algorithmic comparison using for example bar
charts with colors and symbols to show the
results.
Exploratory Data Analysis,
EDA
EDA should not be mistaken for hypothesis
testing. While hypothesis testing is the process
of verifying a hypothesis, EDA on the other
hand is used to find systematic relations
between
variables
when
there
are
no
expectations of what the result might be. One of
the reasons for this is the large amount of
variables that often are used in exploratory
analysis.
Computational EDA
There are several different EDA methods. They
include both simple statistics and more
advanced techniques to identify patterns in
multivariate data sets.
Basic statistical exploratory
In basic statistical exploratory basic methods
are used such as distribution of variables,
reviewing
large
correlation
matrices
for
coefficients that meet certain or examining
multi-way frequency tables.
As mentioned one basic method is to examine
the distribution of variables, i.e. the frequency
of values from different ranges of the variable.
Often you are interested in how close the
distribution is to a normal distribution. This can
be viewed in a histogram. When you examine
the distribution one might see that the
distribution (figure 2) have two peaks (bimodal)
which suggests that the sample is not
homogeneous, but maybe coming from two
different populations.
Figure 4: A bimodal histogram
Multivariate exploratory analysis
These techniques are used especially to identify
patterns in multivariate analysis. A few
examples are cluster analysis, factor analysis,
log-linear analysis and non-linear regression.
Neural Networks
In Neural Networks we take into account the
learning in the cognitive system and the
neurological functions of the brain when making
the analysis. This uses the capability of
predicting new observations from previous
observations after a learning process from
existing data.
Graphical EDA techniques
The graphical EDA techniques are used to
identify relations that only can be seen in a
graphical representation that will be hidden or
hard to detect in unstructured data sets.
Brushing
Brushing is the most common graphical EDA
technique. It uses an interactive method
allowing the user to select data points or
6
subsets of data and identify their common
characteristics or effects on relations between
variables. To visualize these relations we use a
fitted function, for example 2D-lines or 3D
surfaces. The user can then interactively choose
specific subsets of the data and see changes in
the fitted functions (figure 3). One example is
selecting all data points belonging to medium
income level in the matrix scatter plot. Then we
see how this observation affects relations
between other variables in the same data set,
such as the correlation between assets and
debts. You might discover that the correlation
between assets and debts is higher for people
with medium income level than for people with
high income level.
Other techniques
Even though brushing is the most common
technique, there are several others techniques
that use function fitting and plotting, data
smoothing, and categorizing data just to
mention a few techniques.
Verification of results of EDA
As the name implies, exploratory data analysis
is only exploratory and is only a first stage of
analysis. In a second stage the data needs to be
confirmed, cross validated using a different
subset. In the cases where the exploratory
stage suggests a model, the validity of the
model can be tested by applying the model to a
new data set and testing how well is fits, this is
also known as making a predictive validity test
for the model.
Figure 5: Brushing
References
[1] Daniel Keim, Information Visualization and
Visual Data Mining, IEEE transactions on visualization and computer graphics, vol. 7 No. 1 JanuaryMarch 2002
[2] Michael Friendly, “Gallery of Data Visualization” http://www.math.yorku.ca/SCS/Gallery/, April
2007.
[3] Kurt Thearling, Barry Becker among others, "Visualizing Data Mining Models"
http://www.thearling.com/text/dmviz/modelviz.htm, April 2007.
[4] Statsoft, “Exploratory Data Analysis” http://www.statsoft.com/textbook/stdatmin.html#eda, April
2007.
[5] Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining, Addison-Wesley,
2006.
7