Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Visualization for Classification and Clustering Techniques Marc René CSE 8331 Data Mining - Project 1 Overview Importance of Data Visualization in the KDD Process Understanding and Trust Visualization techniques – – 2 Classification Clustering Future Directions Marc René - CSE 8331 KDD Process Selection – Preprocessing – Once the data is in a useable format/content, apply various algorithms based upon the results trying to be achieved Interpretation/Evaluation – 3 After preprocessing the data, analyze the format/amount of data Data Mining – After selecting the data, clean it to make sure it is consistent Transformation – Obtain data from all of sources Finally, present the results of the data mining step to the user, so that the results can be used to solve the business need at hand Marc René - CSE 8331 Importance of Data Visualization The final step in the KDD process : Highly dependent on the Data Visualization technique Bad/inappropriate technique may result in misunderstanding Misunderstanding may cause an incorrect (or no) decision It is important to consider that the KDD process is useless if the results are not understandable 4 Marc René - CSE 8331 Current Issues w/Data Visualization The literature suggests a significant reliance on expert users General lack of data visualization support in many data mining tools [Goebel99] These are significant problems if KDD/DM/Data Visualization will expand at the rates suggested – 5 Data visualization tool market – $2.2 billion by 2007 [Nuttall03] Marc René - CSE 8331 Suggested Direction Need to determine techniques that balance simplicity with completeness If this can be done for non-expert users – – – – Simplicity & Completeness Understanding Understanding Trust Trust more use of KDD/DM Result will be: 6 Better business value Higher ROI Marc René - CSE 8331 Common Visualization Techniques Visualization techniques dependent upon – – The type of data mining technique chosen The underlying structure and attributes of the data Classification 7 Clustering Decision Trees - Scatter Plots Scatter Plots - Dendrograms Axis-Parallel Decision Trees - Smoothed Data Histograms Circle Segments - Self-Organizing Maps Decision Tables - Proximity Matrixes Marc René - CSE 8331 Classification 8 Marc René - CSE 8331 Decision Tree Information limited to – – – 9 Attributes Splitting values Terminal node class assignments Marc René - CSE 8331 Decision Tree with Histograms Data mining rarely classify 100% of the data correctly: – – – 10 Include the success of properly classifying the data histogram added for each terminal node Percentage of data that was classified correctly/incorrectly Assists users in determining if the classification is ‘good enough’ Marc René - CSE 8331 Decision Tree - Different Format Vertical representation allows for easy user interaction – – 11 Combines the split points and classification accuracy - compactly Key difference - colors are matched with a specific classification Marc René - CSE 8331 Scatter Plot with Regression Line 12 Excellent way to view 2dimensional data Familiar to anyone who has taken high-school algebra Regression lines provide descriptive techniques for classification Marc René - CSE 8331 Axis-Parallel Decision Tree 13 Combination Scatter Plot and Decision Tree Areas divided in parallel regions on the axis Well suited for classification problems with two attribute values High visibility into the impact of outliers Marc René - CSE 8331 Circle Segments Multi-dimension data Maps dataset with n dimensions onto a circle divided by n segments – – – 14 Each segment is a different attribute Each pixel inside a segment is a single value of the attribute Values of each attribute are then sorted (independently) and assigned a different colors based upon its class Marc René - CSE 8331 Decision Table Interactive technique Maps attribute data to a 2D hierarchical matrix Levels can be drilled down - another set of attributes Height of a cell conveys the number of data entities Cells color coded – – 15 Neutral color no data in that intersection point Color coded by class (percentage) Marc René - CSE 8331 Decision Table 16 Marc René - CSE 8331 Clustering 17 Marc René - CSE 8331 Scatter Plot Extensions include, displaying points in: – – – – – 18 Various sizes and colors to indicate additional attributes Shading of points to introduce a third dimension Using different brightness levels of the same color to represent continuous values for the same attribute Using various points or classification identifiers (i.e., numbers, symbols) Using various glyphs to display additional attributes Marc René - CSE 8331 Scatter Plot 19 Map decision trees on top of scatter plots to describe clusters Marc René - CSE 8331 Scatter Plot with Regression Lines 20 Marc René - CSE 8331 Scatter Plot w/Min Spanning Tree 21 Marc René - CSE 8331 Dendrogram Intuitive representation - hierarchical decomposition of data into sets of nested clusters. From an agglomerative perspective: – – – – 22 Each leaf - a single data entity Each internal node - the union of all data entities in its sub-tree The root - the entire dataset The height of any internal node - the similarity between its ‘children’. Marc René - CSE 8331 Dendrogram with Exemplars The “most typical member of each cluster” [Wishart99] – – 23 Underlined labels of the leafs Done in combination with shading to identify the clustering level Marc René - CSE 8331 Smoothed Data Histogram 24 Represents data on a ‘display map’ Similar data items are located close to each other More defined the clusters – lighter colors Marc René - CSE 8331 Smoothed Data Histogram - Detail 25 Marc René - CSE 8331 Self-Organizing Map ‘Grid’ Source of Smoothed Data Histogram Numbers indicate most ‘common’ cluster 1 5 2 3 2 5 6 5 2 2 2 4 5 5 5 7 1 1 1 5 7 7 8 7 7 7 10 7 7 9 7 7 11 7 10 7 8 26 Marc René - CSE 8331 7 Proximity Matrix 27 Graphically display the relationship between data elements Usually symmetric, but can be sorted by the strength of relationships Marc René - CSE 8331 Proximity Matrix and Dendrogram 28 Marc René - CSE 8331 Summary 29 Data visualization techniques are extremely important for understanding the KDD process A balance of simplicity and completeness is important The techniques discussed allow average users to understand the results of the KDD process Understanding KDD results to be interpreted/trusted by non-expert users extending the business value If data visualization techniques do not establish a high level of trust in the KDD process, the process will fail Marc René - CSE 8331 Future Direction Significant effort will be spent on improving data visualization techniques in the next few years – – Trends are moving to a more interactive mode – – 30 KDD process and data mining are becoming more widespread Business will expect tools to become more ‘user-friendly’ and support the varied level of skills Static reporting techniques (i.e., standard decision trees, standard circle segments) are being replaced Interactive techniques (i.e., smoothed data histograms, interactive circle segments and decision tables) Very interactive data models ‘virtual reality’ are also being considered/proposed Marc René - CSE 8331 References Part 1 Ahlberg, C., “Spotfire: An Information Exploration Environment”, ACM SIGMOD Record, Volume 25, Number 4, December 1996 Ankerst, M., et. al., “Visual Classification: An Interactive Approach to Decision Tree Construction”, KDD-99, San Diego, CA Ankerst, M., et. al., “Towards an Effective Cooperation of the User and the Computer for Classification”, KDD’00, Boston, MA, USA Apte C. and Weiss S.M., “Data Mining with Decision Trees and Decision Rules”, Future Generation Computer Systems, November 1997 Arkin, E., et. al., “Decision Trees for Geometric Models”, ACM, 9th Annual Computational Geometry, 5/93/CA, USA de Hann, G., et. al., “Towards Intuitive Exploration Tools for Data Visualization in VR”, VRST’02, November 11-13, 2003, Hong Kong Dunham, M. H., Data Mining – Introductory and Advanced Topics, Prentice Hall, 2003. Fekete, J. and Plaisant, C., Excentric Labeling: Dynamic Neighborhood Labeling for Data Visualization, Proceedings of the Conference on Human factors in Computer Systems (CHI'99), ACM , New York Fredrikson, A., et. al., “Temporal, Geographical and Categorical Aggregations Viewed through Coordinated Displays: A Case Study with Highway Incident Data”, NPIVM’99, Kansas City, MO, 1999 Goebel, M. and Gruenwald, L., “A Survey of Data Mining and Knowledge Discovery Software Tools”, SIGKDD Explorations, June 1999. Han, J. and Cersone, N., “RuleViz: A Model for Visualizing Knowledge Discovery Process”, Sixth International Conference on Knowledge Discovery and Data Mining, 2000 Ho, T., et. al., “Visualization Support for a User-Centered KDD Process”, SIGKDD’02, 2002. 31 Marc René - CSE 8331 References Part 2 Hsieh, H. and Shipman, F. M. III, “VITE: A Visual Interface Supporting the Direct Manipulation of Structured Data Using Two-Way Mappings”, IUI 2000, New Orleans LA “Solving Business Problems with IBM DB2 Intelligent Miner”, Presented by DB2 Developer Domain, http://www7b.software.ibm.com/dmdd Jain, A. K., et. al., “Data Clustering: A Review”, ACM Computing Surveys, Volume 3, Number 3, September 1999 Keim, D. A., “Visual Techniques for Exploring Databases”, KDD’97, Newport Beach, CA, 1997 Kohavi, R., and Sommerfield, D, “Targeting Business Users with Decision Table Classifiers”, KDD’99, New York City, 1998 Kohavi, R., et. al., “Emerging Trends in Business Analytics”, Communications of the ACM, Volume 45, Number 8, August 2002 Liu, B., et. al., “Clustering Through Decision Tree Construction”, CIKM 2000, ACM, McLean VA, 2000 Louie, J. Q. and Kraay, T., “Origami: A New Visualization Tool”, KDD-99, San Diego, CA Moret, B. M. E., “Decision Trees and Diagrams”, Computing Surveys, Volume 14, Number 4, December 1982 Nuttall, C., "It's a Vision Thing", Financial Times-IT Review , November 12, 2003 Pampalk, E. et. al., “Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps”, Proceeding of the International Conference on Artificial Neural Networks (ICANN’02), Springer Lecture Notes in Computer Science, Madrid Spain, 2002 Pampalk, E. et. al., “Content-based Organization and Visualization of Music Archives”, Proceeding of the 10th ACM International Conference on Multimedia (MM’02), Juan-les-Pins, France, 2002 Pampalk, E., et. al., “A New Approach to Hierarchical Clustering and Structuring of Data with Self-Organizing Maps”, Intelligent Data Analysis Journal (IDA), Volume 8, Number 2, 2003 32 Marc René - CSE 8331 References Part 3 Rauber, A., et. al., “Empirical Evaluation of Clustering Algorithms”, Journal of Information and Organizational Sciences (JIOS), Volume 24, Number 2, 2000 “Finding the Solution to Data Mining – Exploring the Features and Components of Enterprise Miner, Release 4.1 from SAS” SAS White Paper, 2001 See5 - Data Mining Tools, Release 1.9, Rulequest Research 1997-2003 Simoff, S. J., “VDM@ECML/PKDD2001: The International Workshop on Visual Data Mining at ECML/PKDD 2001”, SIGKDD Explorations, Volume 3, Issue 2, 2001 Thearling, K., “Understanding Data Mining: It’s All in the Interaction”, DS Star: The On-Line Executive Journal for Data-Intensive Decision Support”, Volume 1, Number 10, December 9, 1997 Thearling, K., et. al., “Visualizing Data Mining Models”, as published in Information Visualization in Data Mining and Knowledge Discovery, edited by Fayyad, Usama, et. al., Morgan Kaufman, 2001 Ward, M. O., “XmdvTool: Integrating Multiple Methods for Visualizing Multivariate Data”, Proceedings of IEEE Visualization '94 (Washington, DC, 1994). Wishart, D., “Efficient Hierarchical Cluster Analysis for Data Mining and Knowledge Discovery”, Computing Science and Statistics, Volume 30, 1998. Wishart, D., “ClustanGraphics3 – Interactive Graphics for Cluster Analysis”, Published in: Classification in the Information Age, Gaul W. and Locarrek-Junge, H (Eds.), Springer 1999 XmdvTool Home Page (http://davis.wpi.edu/~xmdv/visualizations.html) 33 Marc René - CSE 8331