Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and visualization (2) Alfredo Vellido Plan A brief introduction to data visualization Visualization & history Perception Visual exploratory DM The good, the bad & the ugly … Visualization recap Recap … PRINCIPLES: the data mining visual cycle, or Visual Exploratory Data Mining Data gathering Data Manipulation Hipothesis of reality DATA Preprocessing & transformation MODEL Model manipulation Graphic engine Data exploration Control & navegation visuo-spatial model cognitive-logic model Recap … CRISP: Methodology phases Recap 6 .. Data visualization vs model visualization Recap 7 … Data visualization vs model visualization Plan A brief introduction to data visualization Visualization & history Perception Visual exploratory DM The good, the bad & the ugly … What type of visualization are we looking for? Descriptive? Exploratory? What type of visualization are we looking for? DESCRIPTIVE PRINCIPLES: a good visualization should... ...show data and/or results... ...at different levels of detail, from the overall landscape to the fine detail ... in a coherent manner, even if we are dealing with large collections. ... avoiding, as much as possible, distortion in their representation ...focus attention in the most relevantes features... ...minimizing the impact of uninformative and misleading data ...integrating statistical results and linguistic descriptions (if possible and relevant). DATA EXPLORATION: The CURSE of dimensionality Most data available to us are stored in different kinds of databases and in numeric format, mostly organized in table structures (remember survey!) An extension of these are the data cubes generated by OLAP processes. What are your preferred methods for storing data for data mining? [403 votes total] Text, CSV (comma-separated) (72) Text, space or tab separated (55) Excel (38) SAS (57) SPSS (31) S-Plus/R (15) Weka ARFF (23) Other data mining tool format (11) In a database system (93) Other - please comment (8) How to display multiple dimensions? Cases: Low dimensionality (1-3D) Moderate dimensionality (4-10D) High dimensionality (>10D) 18% 14% 9% 14% 8% 4% 6% 3% 23% 2% DATA EXPLORATION: low-moderate dimensionality <10D Spatial coordinates 3D requires interactivity Further pre-cognitive visual elements allow us to “add” extra dimensions: color, movement, shape, … Exotic solutions Glyph*: Chernoff faces, stickfigures, whiskers... * Un glifo es una representación gráfica de uno o varios caracteres, o de parte de un carácter. Un carácter es una entidad textual mientras que un glifo es una entidad gráfica. … some of those alternatives Gantt diagrams… … some of those alternatives Chernoff faces Herman Chernoff (1973). "Using faces to represent points in k-dimensional space graphically". Journal of the American Statistical Association 68 (342): 361–368. DATA EXPLORATION: high dimensionality data How do we visualize data of high (or even very high) dimensionality? Some of the alternatives are rather straightforward… some others are not… Eliminate dimensions (data variables): those which are redundant and / or uninformative (at least you manage to alleviate part of the problem…) Divide & conquer: a classic: create multiple visualizations of low dimensionality. Latent and projection models DATA EXPLORATION: The Grand Tour: multiple visualization of Iris data www.ics.uci.edu/~mlearn/MLRepository.html TECHNIQUES: Latency and projection Projection Dimensionality compression Similitude information coding Clustering Finding grouping structure in data Similitude information coding Self-Organizing Map (SOM) & Generative Topographic Mapping (GTM) They combine latent representation and clustering TECHNIQUES: projection Representation in <4-D, so that the distanceneighborhood relations between multi-dimensional points are faithfully preserved It is impossible to preserve information integrally Some scale normalization is required Linear vs. non-linear projections TECHNIQUES: projection: methods Methods based on inter-point distances, where: dx = distance in the original space dy = distancie in the projection space h = neighborhood function E = (dx – dy)2 E = (dx – dy)2 / dx E = (dx – dy)2 e-dy E = dx2 h(dy) MDS, PCA Sammon’s projection CCA SOM ... and in which we aim to minimize an inherent projection distorsion (E) TECHNIQUES: projection: methods in a nutshell MDS: technique used in data visualisation for exploring similarities or dissimilarities in data. An MDS algorithm starts with a matrix of item-item similarities, then assigns a location of each item in a low-dimensional space, suitable for graphing or 3D visualisation. Taxonomy: Metric multidimensional scaling -- assumes the input matrix is just an item-item distance matrix. Analogous to PCA, an eigenvector problem is solved to find the locations that minimize distortions to the distance matrix. Its goal is to find a Euclidean distance approximating a given distance. Generalized multidimensional scaling (GMDS) -- A superset of metric MDS that allows for the target distances to be non-Euclidean. Non-metric multidimensional scaling -- It finds a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distance between items, and the location of each item in the low-dimensional space Biblio: Abdi, H. (2007). Metric multidimensional scaling. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage. Kruskal, J. B., and Wish, M. (1978), Multidimensional Scaling, Sage University Paper series on Quantitative Application in the Social Sciences, 07-011. Beverly Hills and London: Sage Publications. TECHNIQUES: projection: methods in a nutshell PCA: It is a linear transformation that represents the data in a new coordinate system such that the greatest variance explained by the data lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset by retaining only those characteristics of the dataset that contribute most to its variance. Taxonomy: Kernel PCA PPCA, CCA (when unfolding a nonlinear structure, Sammon's mapping cannot reproduce all distances. One way to face this problem consists in favouring local topology: CCA tries to reproduce short distances first, while long distances remain secondary. FA Some source code: Open Computer Vision Library @ sourceforge.net/projects/opencvlibrary/ Murtagh’s page @ http://astro.u-strasbg.fr/~fmurtagh/mda-sw/ TECHNIQUES: projection: example Sammon’s projection PCA CCA TECHNIQUES: projection: discussion; pros & cons Projection techniques code proximity / similarity information in spacial coordinates (plus, sometimes, extra precognitive elements such as colour ...) They allow… … Finding “natural” data groupings (clusters) on the basis of some sort of similarity … Finding the “shapes” of these groupings But ... Projection is always limited by error and information loss New projection coordinates are not always readily interpretable (latency by definition) The original relations between data dimensions are lost Quite often, the computacional effort is to be taken into account, as most of these methods are based on distances between multivariate points. TECHNIQUES: multiple visualizations How to get some of the info conveyed by observable variables back into the projections? One possibility: Using multiple visualizations. Parallel coordinates and pre-cognitive stimuli (colour, position...) TECHNIQUES: SOM & GTM Self-Organizing Feature Map (or Kohonen Maps) k-means is an special case of SOM Discretization (in the form of network grids) and projection are simultaneously performed Set of prototypes» model Cooperative learning (through neighbourhood function) Competitive learning (winner takes most –if not all-) GTM is a probabilistic alternative to SOM (i.e., a form of statistical learning) GTM is a generative model and, therefore, aims to reproduce data density distributions It defines a proper error function It is a non-linear latent model that can be interpreted as a mixture model, as well. All the learning parameters can be adaptively optimized. TECHNIQUES: SOM & GTM: training / fitting The learning process for both models can be illustrated by the fisherman network simile. TECHNIQUES: SOM & GTM: clustering The SOM and GTM “units” can be interpreted as micro-clusters U-matrix (distance in local neighbourhood) or Magnification Factor (distorsion levels) Discrete or fuzzy clusters discretos o borrosos, from local density or probability maxima Hierarchical clustering and dendrograms TECHNIQUES: SOM & GTM: multiple visualization TECHNIQUES: SOM & GTM: visualization of class membership Visualization: further exotisms Exotisms: Conic trees Exotisms: Mapscapes Visualization: software Visualizing data: Simple and off the shelf: SS&C: Heatmaps® …Complex and off the shelf: TheBrain Tech. Corp. “This is the knowledge crisis – An ever-increasing demand for organizational knowledge coupled with an unforgiving environment in which to produce it. Currently, we have no systems to automate and capture the knowledge processes that are critical to our success.” Woven and off the shelf: Ixacta Web Analyzer Neighborhood sitemap diagram: Ixsite creates this diagram to help you visualize the relationship between the files on your site. Woven and free: http://graphics.stanford.edu/ SOM off the selve: Visumap (www.visumap.net) Ellipse eSOM (www.ellipse.fi) SOM fishing: REEFSOM Applied Neuroinformatics Group, Bielefeld University, Germany Visualization: in summary … In summary ... Which are the features of a good, successful visualization? Show the data (exploratory element) Focus the attention (… in the most relevant aspects) Never forget the “human factor” in visual perception The science of vision is the necessary framework for the visualization techniques You have to be careful with pre-cognitive elements (position, movement, colour, shape) in visual coding of dimensions. How to use visualization in exploratory data mining? Visualization allows especulation and model validation. Visualization of high-dimensional data sets can be accomplished through: projections and clustering methods multiple simultaneous visualizations. Plan A brief introduction to data visualization Visualization & history Perception Visual exploratory DM The good, the bad & the ugly … The good ... According to Michael Friendly’s Gallery of Data Visualization (Psych./York Univ.) NY weather in 1980. NYT, Jan.1981 2200 data pieces!!! The good ... According to Michael Friendly’s Gallery of Data Visualization (Psych./York Univ.) ... And the bad and ugly According to Michael Friendly’s Gallery of Data Visualization (Psych./York Univ.) Off-campus Off-campus FRIDAY 27 - AFTERNOON LIGHT AND DATA: A JOURNEY THROUGH THE NEW AESTHETICS OF INFORMATION ArtFutura is dedicating its first afternoon to the work of artists, scientists and designers who are developing new and innovative ways of visualizing information and giving it meaning. 18:00 hours / Room MAC -Mercat de las FlorsAndrew Vande Moere, Information Aesthetics (AUS) “Forms follow Data: An introduction to the art of data visualization” http://www.infosthetics.com/ Andrew Vande Moere is the editor of Information Aesthetics, the outstanding weblog dedicated to exploring the art and science of the dynamic representation of information. In his blog, Andrew shows and analyses artistic projects of design and investigation based on the exploration in real time of large databases and the communication, by means of innovative interfaces, of the meaningful patterns hidden within their interiors. Information Aesthetics offers an in-depth look into the exciting world of data landscapes, a discipline that having seduced artists and scientists promises too radically change our user experience in the area of information. Off-campus Museo de la ciencia y la técnica de Catalunya (Terrassa) http://www.mnactec.com/eng/index.htm Until December 17th