Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining in GeoVISTA Studio: Two Sample Applications 1: Introduction This section will introduce an integrated geographic knowledge discovery package, which includes a set of visualization and computational components to explore multivariate patterns in geographic datasets in a highly interactive manner. Specifically, this tutorial describes: (1) Interactive feature selection components to identify interesting subsets of variables for further analysis; (2) Self-organizing map (SOM) components to cluster data objects with only the variables selected above; (3) A high-dimensional visualization component-Parallel Coordinate Plot (PCP)-to explore and present multivariate patterns; and (4) A geographic map component to visualize the spatial distribution of discovered patterns. This tutorial assumes that you have learned how to connect two beans (components) in Studio and does not present the details in building each connection (wire), which are covered in the QuickStart tutorial. Rather, this tutorial focuses on the overall data flow and operational details. However, after loading the design into Studio, you can learn the details for each connection by right-clicking on a wire and select "Property". This tutorial directly introduces you to two designs. The first design is simpler because it simply relies on the user to manually pick a subset of attributes (variables) for subsequent analysis. The second design is more complicated and includes a suite of beans to assist the user in feature selection. 2: A ‘Simple’ Datamining Design Click here to launch a full version of GeoVISTA Studio that has this design pre-loaded: http://www.geovistastudio.psu.edu/autobuild/gvstudio-datamining1.jnlp This design includes 6 groups of components listed as follows. (1) Components for loading data; (2) A simple component for selecting a subset of variables; (3) Components for constructing an SOM (Self-Organizing Maps); (4) A PCP component for visualizing multidimensional data and patterns; (5) A GeoMap component for geographic mapping; and (6) A Coordinator component to link and coordinate a set of components. See figure 1 for corresponding components, which are grouped and labeled. (5) (1) (6) (2) (3) (4) Figure 1: The Design (with a simple feature selection component). (1) Components for Loading Data A MtSimpleFileChooser component lets the user browse and locate a data file (an ArcView shape file accompanied by a CSV file). A ShapeFileDataReader component reads in the shape file; A ShapeFileToShape component then transforms the data (spatial shapes) into Java data objects that are used in the GeoMap component; A DataSetAppsWrapper component transforms the data (non-spatial attributes) into Java data objects that are used in other components (i.e., those introduced below). (2) A Simple Component for Feature Selection An AttributeList component allows the user to manually select a subset of attributes from the original list of attributes available in the data file. Later in this tutorial a more complicated suite of components will be included (replacing the AttriuteList component) for this feature selection task. (3) Components For Constructing an SOM An AssigningWeights component holds the data objects passed from the DataSetAppsWrapper component, keeps a list of selected attributes from the AttributeList component, and allows the user to assign different weights for each selected attribute (default weights are all equal to each other). Then SOM component gets the data (only for those selected attributes with specified weights) from the AssigningWeights component and constructs a self-organizing map (SOM). The EntriesToSOMCells component transforms SOM codebook vectors into a 2D array of SOMCells, which are objects that can be visualized in other components (e.g., the PCP component). An SOMColoring component assigns a color to each SOMCell according to a 2D color scheme so that nearby SOMCells in the 2D array have similar colors. A Umat component constructs a U-matrix, which is a 2D array of numbers that represent the average distances between neighboring SOM codebook vectors. With a U-matrix and an array of colored SOMCells, the SOMViewer component visualizes the SOM. The StartTraining component is simply a button that triggers the SOM process. For more details on SOM (Self-Organizing Maps) methodologies, and applications, see: Kohonen, T., 2001. Self-organizing maps, Berlin ; New York : Springer. (4) PCP For Visualizing Multidimensional Data And Patterns A PCP (Parallel Coordinate Plot) component accepts a list of SOMCells and a list of name for those involved attributes. Each SOMCell is visualized as string in the color assigned in the SOMColoring component. For more details on PCPs, see: Inselberg, A., 1985. The plane with parallel coordinates. Visual Computer 1: 69-97. (5) GeoMap For Geographic Mapping A GeoMap component accepts a list of shapes coming from the ShapeFileToShape component and maps them. (6) Coordinator For Linking Various Components A Coordinator component is used to integrate components that have no prior knowledge of each other, as long as they follow standard interfaces for message (event) passing. There are several structure of interest can be coordinated, e.g., color schemes, data, selection, and indication. As seen in the remainder of this section, not all components mentioned above are "visible" in terms of having a GUI. Those invisible components perform computation or data transformation tasks without any visual output or human interaction. Among the components mentioned above, visible components include MtSimpleFileChooser, AttributeList, AssigningWeights, StartTraining, SOMViewer, PCP, GeoMap, and Coordinator. So, in following snapshots you can only see these components. 3: The Mining Process Using The Basic Design A normal cycle within the iterative exploration process can be: loading data, transforming the data, selecting interesting subsets of variables for subsequent analysis, identifying multivariate clusters of the data (using selected variables), interactively exploring and interpreting those clusters, visualizing the clusters in a map to examine the spatial distribution of those discovered multivariate patterns. Loading Data In the MtSimpleFileChooser component (top-left corner in the Studio GUIBox), click the ‘Select’ button to locate a data file (an ArcView shape file accompanied by a .csv file). All attributes available in the data will be passed to the AttributeList component and all shapes is passed to the GeoMap component (see figure 2). Here a cancer dataset with over 70 variables is loaded. Figure 2: Loading data. Selecting a Subset of Variables (Feature Selection) This feature selection step is necessary for two reasons. First, the more variables involved, the harder to find patterns. Second, very often there are many irrelevant variables in the dataset which should be removed in subsequent analysis. In the AttributeList component, the user can manually select a subset of variables according to her/his expertise. After picking an interesting subset of variables, the user can click the "Subspace" button to pass on the selection for further analysis. In ♦ ♦ ♦ ♦ figure 3, four variables are manually selected: %AAallDist-%of all cancer incidences that are diagnosed at distant stage; %AAallLocal-% of all cancer incidences that are diagnosed at local stage; %AAallMissing-% of all cancer incidences that are diagnosed at missing stage; %AAallRegion-% of all cancer incidences that are diagnosed at regional stage; Figure 3: Selecting variables and assigning weights. Assigning Weights for Selected Variables Attributes selected in the AttributeList component are then passed to the AssigningWeights component, where the user can specify a weight for each attribute (see figure 3). The more weight one attribute gets (compared to other attributes' weights), the more influence it will have in the subsequent analysis. Default weights are all equal. The user can assign any positive number for a weight. Click the "OK" button after adjusting the weights or simply accepting the default values. SOM Clustering and Coloring Once the "OK" button is pressed in the AssigningWeights component, the values of those selected attributes will be extracted from the data, normalized, and adjusted according to their weights. This transformed data is passed to the group of components for constructing an SOM, which is visualized in the SOMViewer component (see figure 4). In the SOMViewer, the colored circles are non-empty codebook nodes, each at least having one data object assigned to it. The radius of a circle proportionally represents the number of data objects contained in that node. The colors are assigned according to a 2D color scheme so that nearby nodes have similar colors. Since nearby nodes contain similar data objects, similar data objects will have similar colors. Figure 4: SOM clustering, coloring, PCP visualization, and GeoMap mapping. Visualization and Coordination Once the SOM components finish the clustering and coloring, those non-empty SOM cells and their colors are passed along with an event to the Coordinator. As shown in the design (figure 1), the SOMViewer, the GeoMap, and the PCP component all registered with the Coordinator, which means that the Coordinator will coordinate these components by directing events (messages) fired by one component to others that listen to those events. Here the Coordinator passes the SOMCells and their colors to both the GeoMap component and the PCP component. While GeoMap only uses the spatial dimensions (e.g., shape and locations), PCP visualizes the attribute values. In PCP, each string is an SOMCell, which can contain one or more counties. Since the coordinator makes sure that all registered components use the same color for the same data object, the user can visually identify the same data object in different components. For example, with both the GeoMap and the PCP in figure 4, we can see that counties in red are those with very high percentage of missing stages and very low percentage of local stages. This means that these counties are having serious problems in early detection of cancer. And these counties mostly are in east Kentucky. Interactive Exploration Figure 5: Selection made in PCP. Figure 6: Selection made in GeoMap. 4: A Complex Data Mining Design Click here to launch a full version of GeoVISTA Studio that has this design pre-loaded: http://www.geovistastudio.psu.edu/autobuild/gvstudio-datamining2.jnlp This design is different from the first design in that the AttributeList component is replaced by a suite of components to support feature selection. A suite of components for selecting a subset of variables (see figure 7). For details of the feature selection approach, see: Guo, D., 2003. Coordinating Computational and Visualization Approaches for Interactive Feature Selection and Multivariate Clustering. Information Visualization 2(4): 232-246. (5) (1) (6) (3) (2) (4) Figure 7: The second design—the AttributeList component is replaced by a suite of components, which can assist the user in selecting interesting subsets of variables. The components introduced below support effective feature selection with following steps: ♦ First, measures of mutual information for each pair of variables are calculated to evaluate the “goodness of clustering” in a 2-D data space; ♦ Second, a matrix of these measures is constructed with each column or row representing a variable; ♦ Third, a hierarchical clustering method is used to derive a sorting of all variables and produces an enhanced visualization of the matrix to show relationships among variables; ♦ Then interesting multidimensional subspaces consisting of more than two dimensions can then be interactively identified. Measures of Mutual Information In the design shown in Figure 7: three types of measure are used: ♦ Conditional entropy; ♦ Linear correlation; ♦ Chi-square. Each of the above measures is a component that implements a generic measure interface. Thus it is very easy to introduce new measures. Each measure component (here they are ConditionalEntropy, ChiSquare, and LinearCorrelaton) needs to register with the FeastureSelection component. Sorting Variables in the Matrix In the design shown in Figure 7, two types of sorting methods are provided: ♦ A hierarchical clustering method; ♦ Null sorting (i.e., simply using the original order in the data). Each of the above sorting methods is a component that implements a generic sorting interface. Thus it is very easy to introduce new sorting methods. Each sorting component (here they are MSTBasedSorting and NullSorting) needs to register with the FeastureSelection component. The FeatureSelection Component The FeatureSelection component centers on a matrix, which can show two different types of measures at the same time and sort the variables using one measure and one sorting method. Figure 8 shows a matrix of the same cancer dataset used earlier in this tutorial. Each cell with a color represents a measure value between two variables. In the snapshot, conditional entropy values of paired variables are displayed below the diagonal and correlation values of paired variables are displayed above the diagonal. In both cases, the brighter cells represent good values: low conditional values or high correlation values. The variables are sorted based one conditional entropy values using the MSTBasedSorting method. After sorting with the MSTBasedSorting method, variables that have strong associations with each other tend to be close to each other in the ordering. Thus a block of cells with brighter colors will appear as “hot spots”. With the mouse over a cell, the measure value of that cell pops out. The diagonal provides access to each variable; the user can select, add to, or subtract from a subset by simply clicking on the variable’s diagonal cell. A selected subset can be broadcast to other components (e.g., those SOM components) for further analysis. DataSpaceManager Component A DataSpaceManager component can visualize the domain structure of variables. Variables are organized into domains, e.g., census data, cancer data, etc. Then cancer variables can again be organized into breast cancer variables, cervical cancer variables, etc. These structural information is maintain in a file, which should be loaded after the data file is loaded. SubspaceList Component A SubspaceList component simply keeps the current selection of variables, which constitutes a subspace for subsequent analysis. Once the “Construct” button is clicked, this list of selected variables will be passed to the AssigningWeights component, from where all are the same as shown in the first design. There are also two components (actually they are the instantiation of the same class) to allow the user interactively configure the coloring of the matrix cells. 5: Mining With The Complex Design The process here is almost the same as for the first design, except more steps for feature selection. A normal cycle within the iterative exploration process can be: ♦ loading data, ♦ transforming the data, ♦ selecting interesting subsets of variables for subsequent analysis, ♦ identifying multivariate clusters of the data (using selected variables), ♦ interactively exploring and interpreting those clusters, ♦ visualizing clusters in a map to examine the spatial distribution of patterns. Loading Data The loading procedure is the same as introduced earlier, except you will be prompted for loading a concept hierarchy file after locating a data file. Then the matrix will automatically be constructed (see figure 8). Figure 8: The matrix-- conditional entropy values of paired variables are displayed below the diagonal and correlation values above the diagonal. The diagonal provides access to each variable—the user can select an attribute by clicking on the variable’s diagonal cell Selecting Variables The matrix is organized into nested subgroups of cells according to the hierarchical structure imposed on the variables. The user can click on a subgroup to zoom in—those cells will be shown in another window with their associated variable names (see figure 9). The user can select either in the main matrix by clicking diagonal cells, or in the zoom-in window by clicking variables names. In the zoom-in window in figure 9, seven variables are selected (shown in red): ♦ %AAallLocal—% of all cancer incidences that are diagnosed at local stage; ♦ %65+allLocal—% of all cancer incidences (age >=65) that are diagnosed at local stage; ♦ ♦ ♦ ♦ ♦ %4064allLocal—% of all cancer incidences (40=<age<65) that are diagnosed at local stage; pcincome—per capita income; pctpoor—% living below federal poverty line; rent—median rent; crowded=% of families living with > 1 person per room on average. Figure 9: Selecting variables. Figure 10: Assigning weights. From now on, the analysis is the same as introduced earlier. Figure 11: SOM clustering, coloring, PCP visualization, and GeoMap mapping. 6: Tell Us What You Think! Comments, Questions, Suggestions? Please let us know by sending email to [email protected] . We’d love to hear about anything interesting you discover using our tools or ideas you might have for future applications.