Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Role of Visualization in Data Mining Björn M. Gustafsson, Jonas K. Gustafsson and Ragnar E. Hammarqvist Data Mining, TNM033, 2007, University of Linköping Abstract Visual data exploration allows faster data exploration and generally provides a better result than automatic data mining algorithms. The classification of VDM techniques are done in three dimensions: data type to be visualized, visualization technique, and interaction and distortion. There exist a large number of different visualization techniques all depending on the suitability to the type of data that are to be visualized. The two major driving forces behind visualizing data mining models are understanding and trust. In general leads good understanding to trust. Exploratory data analysis is used to find systematic relations between variables when there are little or no knowledge of what the forthcoming result may be. That is why exploratory analysis only works as a first step of the prediction of a model. 1 Introduction Data from a lot of different areas (monitoring systems, credit cards and so on) are collected today because people believe that the information is useful. The problem is finding the valuable information hidden in the data. This is a difficult task and is where visual data exploration in data mining comes in. It is important to include the human in the data exploration process in order for the data mining to be useful and effective. The idea of visual exploration in data mining is to represent the raw data with visualization. The human can then gain insight, draw conclusions and interact with the data. The main advantages of visual data exploration compared with automatic data mining techniques accordingly to [1] are: • • Visual data exploration can easily deal with non homogeneous and noisy data. Visual data exploration is intuitive and requires no understanding of complex mathematical or statistical algorithms or parameters. The result of visual data exploration is that it allows faster data exploration and generally provides a better result (especially when automatic algorithms fail). Visual Data Exploration is usually done in three steps called "the Visual Exploration Paradigm". The steps are: Overview first, zoom and filter, and then details-on-demand. The user needs firstly an overview of the data. Secondly, the user may want to focus on interesting patterns. Finally, wants the user to examine and analyze the patterns and therefore needs to drill-down to look at details of the data. All this can be done visually using different techniques. Classification of Visual Data Mining (VDM) techniques The classification of VDM techniques are done in three dimensions: data type to be visualized, visualization technique, and interaction and distortion. The variables of each dimension are shown in figure 1. All dimensions are orthogonal to each other, which mean that any combination is possible. Another thing is that a specific system may support different data types and may also use combination of multiple visualization and interaction techniques. Figure 1: Classification of techniques Data type to be visualized The data in Information Visualization usually consists of many records. The number of attributes differs from data set to data set. The number of variables is called the dimensionality of the data set. There are many types of data that needs to be visualized. First we have the dimensional data: 1D, 2D and multi dimensional. Secondly we have other type of data such as text and hypertext, hierarchies and graphs, and algorithms and software. 1D data has one dense dimension. One example of 1D data is temporal data and can easily be visualized with a time line. 2D data has two distinct dimensions. One example of 2D data is maps, geographical data (longitude, latitude). One good way to display 2D data is x/y-plots. Multidimensional data refers to data that has a dimensionality of three or higher. One example of such data can be data from relational databases that can have lots of columns (up to hundreds). There is no simple way to map this data to a 2D screen, other more complex techniques are needed. One such technique is parallel coordinates. Parallel Coordinates (figure 2) display each multidimensional data item as a polygonal line which intersects the horizontal dimension axes. The intersection is at the position corresponding to the data value for the corresponding dimension. 2 Iconic displays Iconic displays are when data is mapped multi dimensional data to feature's of an icon. Difference values changes the appearance of the mapped attributes of the icon depending on the value of the corresponding data record. A simple way of modifying the icons can be to let the size, angle, color or density be mapped to data record. The icons can also be built in a more complex way where more specific features of an icon are mapped to values of the data record. Examples of such are Chernoff faces (figure 3), star icons and stick figure icons. Figure 2: Parallel Coordinates Visualization There are other data that can not be described using dimensionality. Text and hypertext are one type of this data, it can not easily be described with numbers and must therefore first be transformed to vectors before visualization techniques can be used. One simple transformation example of this is word counting which can be combined with multidimensional scaling. Another group of data that can not be described with dimensions is hierarchies and graphs. A graph consists of nodes (sets of objects) and edges (connections between the nodes). There are many specific techniques to deal with this kind of data. Algorithms and software are another class of data. Handling software projects is difficult, and visualization can be used to support software development. The techniques for doing this are also many and specific. Visualization techniques There exist a large number visualization techniques. of different Geometrically transformed displays Geometrical transformed displays aims on finding interesting transformation of multidimensional data sets. The purpose is that the user easily can find interesting regions to dig deeper into to confirm whether the finding is interesting or not. Well known techniques can for example be scatter plot matrices or parallel coordinates. Figure 3: Chernoff faces Dense Pixel Displays Each dimension value is mapped to a pixels color and the pixels are grouped in a way so that neighboring pixels dimensions is related. Since the dense pixel display technique only uses one pixel for each data record, this method allows large amount of data to be displayed at the same time (up to about 1.000.000 data values). The main problem with this technique is how to group pixels in a way so that the user can get a grip of the data. Different grouping methods are used depending on the purpose of the display. By grouping the pixels in an appropriate way, the visualization provides information on dependencies, hot spots and correlations. Stacked Displays Stacked displays focus on presenting data in a hierarchical way. If the data is multi dimensional, the data dimensions to be used for 3 partitioning the data and building the hierarchy have to be selected appropriately. Interaction and Distortion Dynamic projections Dynamic Projection is about to dynamically change the projection in order to explore a multi dimensional data set. The number of possible projection of a multi-dimensional data is exponential to the number of dimensions. Therefore it can not be used to visualize data sets that consist of many dimensions. Interactive filtering When exploring huge data sets, it is impotent to narrow down the region of interesting subsets. This can be done by either selecting a subset of interest(browsing) or by filtering out data that is of no interest(querying).Browsing is hard if the data set is large and querying often dose not produce the desired result. There exist numbers of interactive techniques to filter data when exploring. They all have one thing in common and that is that they instantly produces result, this gives the user the possibility to modify queries and immediately see the result. Interactive distortion The idea with interactive distortion is to interactively change the level of detail on different parts of the screen. This enables the user to quickly drill down in a certain area and still have an overview of the rest of the data set. Interactive Linking and Brushing To overcome the fact that all visualization techniques have some strengths and some weaknesses a visualization of a data set often is displayed in multiple views. Different methods are combined to make a more clear view of the data set. To make this easier to handle the views is linked to each other. Changes in one view do not only change the current view, it also changes all views linked to that view. The ability to pick one data attribute in one view and immediately see where that object is in the other views helps to find correlations and dependencies. Visualizing models data mining The role of Data Mining is to extract information from a data base that the user did not already know about. The result is findings of models and patterns which describes useful relationships. There are many ways to graphically represent a model, the visualizations that are used should therefore be chosen to maximize the value for the viewer. To be able to do this we need to understand the user’s needs and design the visualization after that. For this purpose we need orienting principles as a template for the visualization, so it would both fit beginners and experts. The Orienting principles can be described as maps and landmarks. By following a chain of landmarks found in a map you will find your way to the end-destination. A global coordinate system (the map) in combination with a local coordinate system (the landmarks) must fit together in order to give confidence (otherwise you will get lost). The two major driving forces behind visualizing data mining models are understanding and trust. The simplest way to look at a data mining model is to see it as a black box, with some inputs and outputs. In this way the user gains almost no understanding since he or she does not know what is going on, so how can the user then trust the model. Another often much better way is to get the user to understand what is going on. There is no automatic process to do this. If the output or the model can not be understood it nor can be trusted. If the user can understand what and how something has been discovered he or she will trust it. The two most important problems to handle in order to gain understanding are firstly to visualize the data mining output in a meaningful way and secondly to allow the user to interact with the visualization. 4 Trust Trust can not be measured with only one quantity. It has to be described in many dimensions with the key factors that contribute to trust. Visualizing the limitations of a model is very important since one ultimately can only disprove a model. The ways to assessing trust are many (the key factors) and more clear than assessing understanding. • • • • • • • Not violate expected qualitative principles when having a general knowledge of the domain. Example of violation: finding correlation between shoe size and IQ. Domain knowledge is also critical for outlier detection. If you know that the domain is between the numbers 10 and 50, you can not put numbers outside it. It simply makes no sense. Assessing trust is closely related to model comparison. Especially comparing and measuring sensitivity and speed of a model. Statistical summaries are particularly useful when comparing relative trust between two models. Relationships differ most between two models when focusing on the analysis on subsets of features. Drill-trough and multiple scales of data enhance the summaries. It makes it for example easier to see global and local max and min values of the entire range. Measure their trustworthiness in some way, such as a quantified measurement of variance. Checking the model for internal consistency in the many transformations (standard cross validation and beyond). Understanding As mentioned before, understanding leads to trust. Accuracy of a model is often traded for understandability. This is because understanding is more important than accuracy in a model. There are three components for understanding a model: representation, interaction and integration. Representing the model with suitable components that are already known to the user improves understanding. In many cases the model contains of to much information to provide a representation that is both complete and understandable. For example, 3D representations can show more information than 2D, but it must have navigation and interaction to work. Interact with the model in real-time (answering user queries) can be done in many different ways depending on the model. Common forms are: interactive classification, interactive model building, drill-up/down, animation, searching, filtering and level-of-detail (LOD) manipulation. Searching, filtering and drill-up/down make finding of hidden information in a model easier. Interactive classification and interactive model building on the other hand helps the understanding of the model. Integration between models and views provides user context. For a user to truly understand a model he must understand how the model correlates to the data from which it was derived. The three techniques that are used for this are: drill-through, brushing and coordinated views. Drill-through is to access original data by selecting a piece of the model. Brushing on the other hand refers to select pieces of the model and have them appear in a different representation. Brushing will be brought up in more detail later. Coordinated views is to have multiple and linked (shows changes in all views) representations, combined with representation of original data. All these techniques help the user understand how the model relates to original data and therefore gives an external context for the model and enhances validation. How to compare models using visualization You can compare models in three approaches: input/output mapping, algorithms and processes. The input/output approach simply considers the mapping from a defined input space to a defined output space. You classify the input/outputs for each model as a data set. Example, two classifiers could be described by a set of input/output pairs, such as (obs1, class a), (obs2, class b), (obs3, class c), and so on. In the algorithms comparison approach you express the model as a series of algorithmic 5 steps. Each algorithm can then be analyzed using standard methods for measurement such as complexity, stability, computation time and computation size. These measurements can then be visualized with for example bar charts with colors and symbols. The modeling process includes everything in and around the modeling, such as: the methods, the user, the database and the support resources. It also includes constraints such as: knowledge, time and analysis implementation. The fact that there are so many things makes this approach the most imperfectly defined and we need to narrow it down and neglect everything except analysis methods and implementation issues. We can do this if we say that the comparison is made for one user on one database over a short time period. Now we choose a set of metrics that suits the models being compared and then visualize the comparison like we did in algorithmic comparison using for example bar charts with colors and symbols to show the results. Exploratory Data Analysis, EDA EDA should not be mistaken for hypothesis testing. While hypothesis testing is the process of verifying a hypothesis, EDA on the other hand is used to find systematic relations between variables when there are no expectations of what the result might be. One of the reasons for this is the large amount of variables that often are used in exploratory analysis. Computational EDA There are several different EDA methods. They include both simple statistics and more advanced techniques to identify patterns in multivariate data sets. Basic statistical exploratory In basic statistical exploratory basic methods are used such as distribution of variables, reviewing large correlation matrices for coefficients that meet certain or examining multi-way frequency tables. As mentioned one basic method is to examine the distribution of variables, i.e. the frequency of values from different ranges of the variable. Often you are interested in how close the distribution is to a normal distribution. This can be viewed in a histogram. When you examine the distribution one might see that the distribution (figure 2) have two peaks (bimodal) which suggests that the sample is not homogeneous, but maybe coming from two different populations. Figure 4: A bimodal histogram Multivariate exploratory analysis These techniques are used especially to identify patterns in multivariate analysis. A few examples are cluster analysis, factor analysis, log-linear analysis and non-linear regression. Neural Networks In Neural Networks we take into account the learning in the cognitive system and the neurological functions of the brain when making the analysis. This uses the capability of predicting new observations from previous observations after a learning process from existing data. Graphical EDA techniques The graphical EDA techniques are used to identify relations that only can be seen in a graphical representation that will be hidden or hard to detect in unstructured data sets. Brushing Brushing is the most common graphical EDA technique. It uses an interactive method allowing the user to select data points or 6 subsets of data and identify their common characteristics or effects on relations between variables. To visualize these relations we use a fitted function, for example 2D-lines or 3D surfaces. The user can then interactively choose specific subsets of the data and see changes in the fitted functions (figure 3). One example is selecting all data points belonging to medium income level in the matrix scatter plot. Then we see how this observation affects relations between other variables in the same data set, such as the correlation between assets and debts. You might discover that the correlation between assets and debts is higher for people with medium income level than for people with high income level. Other techniques Even though brushing is the most common technique, there are several others techniques that use function fitting and plotting, data smoothing, and categorizing data just to mention a few techniques. Verification of results of EDA As the name implies, exploratory data analysis is only exploratory and is only a first stage of analysis. In a second stage the data needs to be confirmed, cross validated using a different subset. In the cases where the exploratory stage suggests a model, the validity of the model can be tested by applying the model to a new data set and testing how well is fits, this is also known as making a predictive validity test for the model. Figure 5: Brushing References [1] Daniel Keim, Information Visualization and Visual Data Mining, IEEE transactions on visualization and computer graphics, vol. 7 No. 1 JanuaryMarch 2002 [2] Michael Friendly, “Gallery of Data Visualization” http://www.math.yorku.ca/SCS/Gallery/, April 2007. [3] Kurt Thearling, Barry Becker among others, "Visualizing Data Mining Models" http://www.thearling.com/text/dmviz/modelviz.htm, April 2007. [4] Statsoft, “Exploratory Data Analysis” http://www.statsoft.com/textbook/stdatmin.html#eda, April 2007. [5] Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2006. 7