Download Data Mining in GeoVISTA Studio

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

Exploratory factor analysis wikipedia , lookup

Factor analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
Data Mining in GeoVISTA Studio:
Two Sample Applications
1: Introduction
This section will introduce an integrated geographic knowledge discovery package,
which includes a set of visualization and computational components to explore
multivariate patterns in geographic datasets in a highly interactive manner.
Specifically, this tutorial describes:
(1) Interactive feature selection components to identify interesting subsets of
variables for further analysis;
(2) Self-organizing map (SOM) components to cluster data objects with only the
variables selected above;
(3) A high-dimensional visualization component-Parallel Coordinate Plot (PCP)-to
explore and present multivariate patterns; and
(4) A geographic map component to visualize the spatial distribution of discovered
patterns.
This tutorial assumes that you have learned how to connect two beans (components) in
Studio and does not present the details in building each connection (wire), which are
covered in the QuickStart tutorial. Rather, this tutorial focuses on the overall data flow
and operational details. However, after loading the design into Studio, you can learn the
details for each connection by right-clicking on a wire and select "Property".
This tutorial directly introduces you to two designs. The first design is simpler because it
simply relies on the user to manually pick a subset of attributes (variables) for
subsequent analysis. The second design is more complicated and includes a suite of
beans to assist the user in feature selection.
2: A ‘Simple’ Datamining Design
Click here to launch a full version of GeoVISTA Studio that has this design pre-loaded:
http://www.geovistastudio.psu.edu/autobuild/gvstudio-datamining1.jnlp
This design includes 6 groups of components listed as follows.
(1) Components for loading data;
(2) A simple component for selecting a subset of variables;
(3) Components for constructing an SOM (Self-Organizing Maps);
(4) A PCP component for visualizing multidimensional data and patterns;
(5) A GeoMap component for geographic mapping; and
(6) A Coordinator component to link and coordinate a set of components.
See figure 1 for corresponding components, which are grouped and labeled.
(5)
(1)
(6)
(2)
(3)
(4)
Figure 1: The Design (with a simple feature selection component).
(1) Components for Loading Data
A MtSimpleFileChooser component lets the user browse and locate a data file (an
ArcView shape file accompanied by a CSV file). A ShapeFileDataReader component
reads in the shape file; A ShapeFileToShape component then transforms the data
(spatial shapes) into Java data objects that are used in the GeoMap component; A
DataSetAppsWrapper component transforms the data (non-spatial attributes) into
Java data objects that are used in other components (i.e., those introduced below).
(2) A Simple Component for Feature Selection
An AttributeList component allows the user to manually select a subset of attributes
from the original list of attributes available in the data file. Later in this tutorial a more
complicated suite of components will be included (replacing the AttriuteList component)
for this feature selection task.
(3) Components For Constructing an SOM
An AssigningWeights component holds the data objects passed from the
DataSetAppsWrapper component, keeps a list of selected attributes from the
AttributeList component, and allows the user to assign different weights for each
selected attribute (default weights are all equal to each other). Then SOM component
gets the data (only for those selected attributes with specified weights) from the
AssigningWeights component and constructs a self-organizing map (SOM). The
EntriesToSOMCells component transforms SOM codebook vectors into a 2D array of
SOMCells, which are objects that can be visualized in other components (e.g., the PCP
component). An SOMColoring component assigns a color to each SOMCell according to
a 2D color scheme so that nearby SOMCells in the 2D array have similar colors. A Umat
component constructs a U-matrix, which is a 2D array of numbers that represent the
average distances between neighboring SOM codebook vectors. With a U-matrix and an
array of colored SOMCells, the SOMViewer component visualizes the SOM. The
StartTraining component is simply a button that triggers the SOM process.
For more details on SOM (Self-Organizing Maps) methodologies, and applications, see:
Kohonen, T., 2001. Self-organizing maps, Berlin ; New York : Springer.
(4) PCP For Visualizing Multidimensional Data And Patterns
A PCP (Parallel Coordinate Plot) component accepts a list of SOMCells and a list of name
for those involved attributes. Each SOMCell is visualized as string in the color assigned in
the SOMColoring component.
For more details on PCPs, see: Inselberg, A., 1985. The plane with parallel coordinates.
Visual Computer 1: 69-97.
(5) GeoMap For Geographic Mapping
A GeoMap component accepts a list of shapes coming from the ShapeFileToShape
component and maps them.
(6) Coordinator For Linking Various Components
A Coordinator component is used to integrate components that have no prior
knowledge of each other, as long as they follow standard interfaces for message (event)
passing. There are several structure of interest can be coordinated, e.g., color schemes,
data, selection, and indication.
As seen in the remainder of this section, not all components mentioned above are
"visible" in terms of having a GUI. Those invisible components perform computation or
data transformation tasks without any visual output or human interaction. Among the
components mentioned above, visible components include MtSimpleFileChooser,
AttributeList, AssigningWeights, StartTraining, SOMViewer, PCP, GeoMap,
and Coordinator. So, in following snapshots you can only see these components.
3: The Mining Process Using The Basic Design
A normal cycle within the iterative exploration process can be: loading data,
transforming the data, selecting interesting subsets of variables for subsequent analysis,
identifying multivariate clusters of the data (using selected variables), interactively
exploring and interpreting those clusters, visualizing the clusters in a map to examine
the spatial distribution of those discovered multivariate patterns.
Loading Data
In the MtSimpleFileChooser component (top-left corner in the Studio GUIBox), click
the ‘Select’ button to locate a data file (an ArcView shape file accompanied by a .csv
file). All attributes available in the data will be passed to the AttributeList component
and all shapes is passed to the GeoMap component (see figure 2). Here a cancer
dataset with over 70 variables is loaded.
Figure 2: Loading data.
Selecting a Subset of Variables (Feature Selection)
This feature selection step is necessary for two reasons. First, the more variables
involved, the harder to find patterns. Second, very often there are many irrelevant
variables in the dataset which should be removed in subsequent analysis. In the
AttributeList component, the user can manually select a subset of variables according to
her/his expertise. After picking an interesting subset of variables, the user can click the
"Subspace" button to pass on the selection for further analysis.
In
♦
♦
♦
♦
figure 3, four variables are manually selected:
%AAallDist-%of all cancer incidences that are diagnosed at distant stage;
%AAallLocal-% of all cancer incidences that are diagnosed at local stage;
%AAallMissing-% of all cancer incidences that are diagnosed at missing stage;
%AAallRegion-% of all cancer incidences that are diagnosed at regional stage;
Figure 3: Selecting variables and assigning weights.
Assigning Weights for Selected Variables
Attributes selected in the AttributeList component are then passed to the
AssigningWeights component, where the user can specify a weight for each attribute
(see figure 3). The more weight one attribute gets (compared to other attributes'
weights), the more influence it will have in the subsequent analysis. Default weights are
all equal. The user can assign any positive number for a weight. Click the "OK" button
after adjusting the weights or simply accepting the default values.
SOM Clustering and Coloring
Once the "OK" button is pressed in the AssigningWeights component, the values of
those selected attributes will be extracted from the data, normalized, and adjusted
according to their weights. This transformed data is passed to the group of components
for constructing an SOM, which is visualized in the SOMViewer component (see figure 4).
In the SOMViewer, the colored circles are non-empty codebook nodes, each at least
having one data object assigned to it. The radius of a circle proportionally represents the
number of data objects contained in that node. The colors are assigned according to a
2D color scheme so that nearby nodes have similar colors. Since nearby nodes contain
similar data objects, similar data objects will have similar colors.
Figure 4: SOM clustering, coloring, PCP visualization, and GeoMap mapping.
Visualization and Coordination
Once the SOM components finish the clustering and coloring, those non-empty SOM
cells and their colors are passed along with an event to the Coordinator. As shown in
the design (figure 1), the SOMViewer, the GeoMap, and the PCP component all
registered with the Coordinator, which means that the Coordinator will coordinate
these components by directing events (messages) fired by one component to others
that listen to those events. Here the Coordinator passes the SOMCells and their colors to
both the GeoMap component and the PCP component. While GeoMap only uses the
spatial dimensions (e.g., shape and locations), PCP visualizes the attribute values. In
PCP, each string is an SOMCell, which can contain one or more counties.
Since the coordinator makes sure that all registered components use the same color for
the same data object, the user can visually identify the same data object in different
components. For example, with both the GeoMap and the PCP in figure 4, we can see
that counties in red are those with very high percentage of missing stages and very low
percentage of local stages. This means that these counties are having serious problems
in early detection of cancer. And these counties mostly are in east Kentucky.
Interactive Exploration
Figure 5: Selection made in PCP.
Figure 6: Selection made in GeoMap.
4: A Complex Data Mining Design
Click here to launch a full version of GeoVISTA Studio that has this design pre-loaded:
http://www.geovistastudio.psu.edu/autobuild/gvstudio-datamining2.jnlp
This design is different from the first design in that the AttributeList component is
replaced by a suite of components to support feature selection.
A suite of components for selecting a subset of variables (see figure 7).
For details of the feature selection approach, see:
Guo, D., 2003. Coordinating Computational and Visualization Approaches for
Interactive Feature Selection and Multivariate Clustering. Information Visualization 2(4):
232-246.
(5)
(1)
(6)
(3)
(2)
(4)
Figure 7: The second design—the AttributeList component is replaced by a suite of
components, which can assist the user in selecting interesting subsets of variables.
The components introduced below support effective feature selection with following
steps:
♦ First, measures of mutual information for each pair of variables are calculated to
evaluate the “goodness of clustering” in a 2-D data space;
♦ Second, a matrix of these measures is constructed with each column or row
representing a variable;
♦ Third, a hierarchical clustering method is used to derive a sorting of all variables and
produces an enhanced visualization of the matrix to show relationships among
variables;
♦ Then interesting multidimensional subspaces consisting of more than two
dimensions can then be interactively identified.
Measures of Mutual Information
In the design shown in Figure 7: three types of measure are used:
♦ Conditional entropy;
♦ Linear correlation;
♦ Chi-square.
Each of the above measures is a component that implements a generic measure
interface. Thus it is very easy to introduce new measures. Each measure component
(here they are ConditionalEntropy, ChiSquare, and LinearCorrelaton) needs to
register with the FeastureSelection component.
Sorting Variables in the Matrix
In the design shown in Figure 7, two types of sorting methods are provided:
♦ A hierarchical clustering method;
♦ Null sorting (i.e., simply using the original order in the data).
Each of the above sorting methods is a component that implements a generic sorting
interface. Thus it is very easy to introduce new sorting methods. Each sorting
component (here they are MSTBasedSorting and NullSorting) needs to register with
the FeastureSelection component.
The FeatureSelection Component
The FeatureSelection component centers on a matrix, which can show two different
types of measures at the same time and sort the variables using one measure and one
sorting method. Figure 8 shows a matrix of the same cancer dataset used earlier in this
tutorial. Each cell with a color represents a measure value between two variables. In the
snapshot, conditional entropy values of paired variables are displayed below the
diagonal and correlation values of paired variables are displayed above the diagonal. In
both cases, the brighter cells represent good values: low conditional values or high
correlation values. The variables are sorted based one conditional entropy values using
the MSTBasedSorting method. After sorting with the MSTBasedSorting method,
variables that have strong associations with each other tend to be close to each other in
the ordering. Thus a block of cells with brighter colors will appear as “hot spots”.
With the mouse over a cell, the measure value of that cell pops out. The diagonal
provides access to each variable; the user can select, add to, or subtract from a subset
by simply clicking on the variable’s diagonal cell. A selected subset can be broadcast to
other components (e.g., those SOM components) for further analysis.
DataSpaceManager Component
A DataSpaceManager component can visualize the domain structure of variables.
Variables are organized into domains, e.g., census data, cancer data, etc. Then cancer
variables can again be organized into breast cancer variables, cervical cancer variables,
etc. These structural information is maintain in a file, which should be loaded after the
data file is loaded.
SubspaceList Component
A SubspaceList component simply keeps the current selection of variables, which
constitutes a subspace for subsequent analysis. Once the “Construct” button is clicked,
this list of selected variables will be passed to the AssigningWeights component, from
where all are the same as shown in the first design.
There are also two components (actually they are the instantiation of the same class) to
allow the user interactively configure the coloring of the matrix cells.
5: Mining With The Complex Design
The process here is almost the same as for the first design, except more steps for
feature selection. A normal cycle within the iterative exploration process can be:
♦ loading data,
♦ transforming the data,
♦ selecting interesting subsets of variables for subsequent analysis,
♦ identifying multivariate clusters of the data (using selected variables),
♦ interactively exploring and interpreting those clusters,
♦ visualizing clusters in a map to examine the spatial distribution of patterns.
Loading Data
The loading procedure is the same as introduced earlier, except you will be prompted for
loading a concept hierarchy file after locating a data file. Then the matrix will
automatically be constructed (see figure 8).
Figure 8: The matrix-- conditional entropy values of paired variables are displayed below the
diagonal and correlation values above the diagonal. The diagonal provides access to each
variable—the user can select an attribute by clicking on the variable’s diagonal cell
Selecting Variables
The matrix is organized into nested subgroups of cells according to the hierarchical
structure imposed on the variables. The user can click on a subgroup to zoom in—those
cells will be shown in another window with their associated variable names (see figure
9). The user can select either in the main matrix by clicking diagonal cells, or in the
zoom-in window by clicking variables names. In the zoom-in window in figure 9, seven
variables are selected (shown in red):
♦ %AAallLocal—% of all cancer incidences that are diagnosed at local stage;
♦ %65+allLocal—% of all cancer incidences (age >=65) that are diagnosed at local
stage;
♦
♦
♦
♦
♦
%4064allLocal—% of all cancer incidences (40=<age<65) that are diagnosed at
local stage;
pcincome—per capita income;
pctpoor—% living below federal poverty line;
rent—median rent;
crowded=% of families living with > 1 person per room on average.
Figure 9: Selecting variables.
Figure 10: Assigning weights. From now on, the analysis is the same as introduced earlier.
Figure 11: SOM clustering, coloring, PCP visualization, and GeoMap mapping.
6: Tell Us What You Think!
Comments, Questions, Suggestions? Please let us know by sending email to
[email protected] . We’d love to hear about anything interesting you discover using
our tools or ideas you might have for future applications.