Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008. On Interactive Data Mining INTRODUCTION Exploring and extracting knowledge from data is one of the fundamental problems in science. Data mining consists of important tasks, such as description, prediction and explanation of data, and applies computer technologies to nontrivial calculations. Computer systems can maintain precise operations under a heavy information load, and also can maintain steady performance. Without the aid of computer systems, it is very difficult for people to be aware of, to extract, to search and to retrieve knowledge in large and separate datasets, let alone interpreting and evaluating data and information that are constantly changing, and then making recommendations or predictions based on inconsistent and/or incomplete data. On the other hand, the implementations and applications of computer systems reflect the requests of human users, and are affected by human judgement, preference and evaluation. Computer systems rely on human users to set goals, to select alternatives if an original approach fails, to participate in unanticipated emergencies and novel situations, and to develop innovations in order to preserve safety, avoid expensive failure, or increase product quality (Elm, et al., 2004; Hancock & Scallen, 1996; Shneiderman, 1998). Users possess varied skills, intelligence, cognitive styles, and levels of tolerance of frustration. They come to a problem with diverse preferences, requirements and background knowledge. Given a set of data, users will see it from different angles, in different aspects, and with different views. Considering these differences, a universally applicable theory or method to serve the needs of all users does not exist. This motivates and justifies the co-existence of numerous theories and methods of data mining systems, as well as the exploration of new theories and methods. According to the above observations, we believe that interactive systems are required for data mining tasks. Generally, interactive data mining is an integration of human factors and artificial intelligence (Maanen, Lindenberg and Neerincx, 2005); an interactive system is an integration of a human user and a computer machine, communicating and exchanging information and knowledge. Through interaction and communication, computers and users can share the tasks involved in order to achieve a good balance of automation and human control. Computers are used to retrieve and keep track of large volumes of data, and to carry out complex mathematical or logical operations. Users can then avoid routine, tedious and error-prone tasks, concentrate on critical decision making and planning, and cope with unexpected situations (Elm, et al., 2004; Shneiderman, 1998). Moreover, interactive data mining can encourage users’ learning, improve insight and understanding of the problem to be solved, and stimulate users to explore creative possibilities. Users’ feedback can be used to improve the system. The interaction is mutually beneficial, and imposes new coordination demands on both sides. BACKGROUND The importance of human-machine interaction has been well recognized and studied in many disciplines. One example of interactive systems is an information retrieval system or a search Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008. engine. A search engine connects users to Web resources. It navigates searches, stores and indexes resources and responses to users’ particular queries, and ranks and provides the most relevant results to each query. Most of the time, a user initiates the interaction with a query. Frequently, feedback will arouse the user’s particular interest, causing the user to refine the query, and then change or adjust further interaction. Without this mutual connection, it would be hard, if not impossible, for the user to access these resources, no matter how important and how relevant they are. The search engine, as an interactive system, uses the combined power of the user and the resources, to ultimately generate a new kind of power. Though human-machine interaction has been emphasized for a variety of disciplines, until recently it has not received enough attention in the domain of data mining (Ankerst, 2001; Brachmann & Anand, 1996; Zhao & Yao, 2005). In particular, the human role in the data mining processes has not received its due attention. Here, we identify two general problems in many of the existing data mining systems: 1. Overemphasizing the automation and efficiency of the system, while neglecting the adaptiveness and effectiveness of the system. Effectiveness includes human subjective understanding, interpretation and evaluation. 2. A lack of explanations and interpretations of the discovered knowledge. Human-machine interaction is always essential for constructing explanations and interpretations. To study and implement an interactive data mining system, we need to pay more attention to the connection between human users and computers. For cognitive science, Wang and Liu (2003) suggest a relational metaphor, which assumes that relations and connections of neurons represent information and knowledge in the human brain, rather than the neurons alone. Berners-Lee (1999) explicitly states that “in an extreme view, the world can be seen as only connections, nothing else.” Based on this statement, the World Wide Web was designed and implemented. Following the same way of thinking, we believe that interactive data mining is sensitive to the capacities and needs of both humans and machines. A critical issue is not how intelligent a user is, or how efficient an algorithm is, but how well these two parts can be connected and communicated, adapted, stimulated and improved. MAIN THRUST The design of interactive data mining systems is highlighted by the process, forms and complexity issues of interaction. Processes of interactive data mining The entire knowledge discovery process includes data preparation, data selection and reduction, data pre-processing and transformation, pattern discovery, pattern explanation and evaluation, and pattern presentation (Brachmann & Anand, 1996; Fayyad, et al., 1996; Mannila, 1997; Yao, Zhao & Maguire, 2003; Yao, Zhong & Zhao, 2004). In an interactive system, these phases can be carried out as follows: Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008. o Interactive data preparation observes raw data with a specific format. Data distribution and relationships between attributes can be easily observed. o Interactive data selection and reduction involves the reduction of the number of attributes and/or the number of records. A user can specify the attributes of interest and/or data area, and remove data that is outside of the area of concern. o Interactive data pre-processing and transformation determines the number of intervals, as well as cut-points for continuous datasets, and transforms the dataset into a workable dataset. o Interactive pattern discovery interactively discovers patterns under the user’s guidance, selection, monitoring and supervision. Interactive controls include decisions made on search strategies, directions, heuristics, and the handling of abnormal situations. o Interactive pattern explanation and evaluation explains and evaluates the discovered pattern if the user requires it. The effectiveness and usefulness of this are subject to the user’s judgement. o Interactive pattern presentation visualizes the patterns that are perceived during the pattern discovery phase, and/or the pattern explanation and evaluation phase. Practice has shown that the process is virtually a loop, which is iterated until satisfying results are obtained. Most of the existing interactive data mining systems add visual functionalities into some phases, which enable users to invigilate the mining process at various stages, such as raw data visualization and/or final results visualization (Brachmann & Anand, 1996; Elm, et al., 2004). Graphical visualization makes it easy to identify and distinguish the trend and distribution. This is a necessary feature for human-machine interaction, but is not sufficient on its own. To implement a good interactive data mining system, we need to study the types of interactions users expect, and the roles and responsibilities a computer system should take. Forms of interaction Users expect different kinds of human-computer interactions: proposition, information/guidance acquisition, and manipulation. These interactions proceed with the entire data mining process we mentioned above to arrive at desirable mining results. Users should be allowed to make propositions, describe decisions and selections based on their preference and judgement. For example, a user can state an interested class value for classification tasks, express a target knowledge representation, indicate a question, infer features for explanation, describe a preference order of attributes, set up the constraints, and so on. Subjects of propositions differ among the varying views of individuals. One may initiate different propositions at different times based on different considerations at different cognitive levels. The potential value consideration enters in to the choice of proposition. Information acquisition is a basic form of interaction associated with information analysis. Information might be presented in various fashions and structures. Raw data is raw information. Mined rules are extracted knowledge. Numerous measurements show the information of an object from different aspects. Each data mining phase contains and generates much information. An object might be changed; the information it holds might be erased, updated or manipulated by the user in question. Benchmarks, official standards and de facto standards are valuable reference knowledge, which can make it easier to learn and evaluate new applications. In general, information acquisition can be conducted by granular computing and hierarchy theory. A granule Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008. in a higher level can be decomposed into many granules in a lower level, and conversely, some granules in a lower level can be combined into a granule in a higher level. A granule in a lower level provides a more detailed description than that of a parent granule in the higher level, and a granule in a higher level has a more abstract description than a child granule in the lower level. Users need to retrieve the information in an interactive manner, namely, “show it correctly when I want to or need to see it, and in an understandable format.” Guidance acquisition is another form of interaction. A consultant role that an interactive system can play is to provide knowledge or skills that the user does not have in-house, for example, doing an evaluation or providing an analysis of the implications of environmental trends. To achieve this expert role, the interactive system must be able to “understand” the human proposition, and be able to make corresponding inferences. Guidance is especially useful while the domain is complex and the search space is huge. To achieve guidance, the system needs to store an extra rule base (usually serving as a standard or a reference), and be context aware. The inference function helps users to pay attention to items that are easily ignored, considered as “boundary” issues, or are important but not part of the current focus. The inference function takes the role and responsibility of a consultant. It ensures the process develops in a more balanced manner. Manipulation is the form of interaction that includes selecting, retrieving, combining and changing objects, using operated objects to obtain new objects. Different data mining phases require different kinds of manipulations. Interactive manipulations obligate the computer system to provide necessary cognitive supports, such as: a systematic approach that uses an exhaustive search or a well-established, recursive search for solving a problem in a finite number of steps; a heuristic approach that selectively searches a portion of a solution space, a sub-problem of the whole problem, or a plausible solution according to the user’s special needs; and an analogy approach that uses known solutions to solve an existing problem (Chiew & Wang, 2004; Matlin, 1998; Mayer, 1992; Ormrod, 1999). In addition, interactive systems should allow users to build their own mental buildings using the standard blocks. The blocks can be connected by functions similar to the pipe command in UNIX systems. What this means is that the standard output of the command to the left of the pipe is sent as standard input of the command to the right of the pipe. A result of this interaction is that users can define their own heuristics and algorithms. The interaction should be directed to construct a reasonable and meaningful cognitive structure to each user. To a novice, the constructive operation is the psychological paradigm in which one constructs his/her own mental model of a given domain; to an expert, the constructive operation is an experienced practice containing anticipation, estimation, understanding and management of the domain. Figure 1 illustrates the process and the forms of interactive data mining. A particular interactive data mining system can involve interactions of all four forms at six different phases. Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008. Information acquisition Guidance acquisition Pattern discovery Data preprocessing Pattern explanation and evaluation Data selection Pattern representation Data preparation Data Selected data Preprocessed data Patterns Proposition Explained and evaluated patterns Knowledge Manipulation Figure 1: Interactive data mining Complexity of interactive data mining systems Because of the special forms of interaction, complexity issues often raise concerns during implementation. Weir (1991) identified three sources of complexity in interactive applications. Complexity of the domain: The domain can be very complex because of the size and type of data, the high dimensionality and high degree of linkage that exist in the data. Modelling the domain to a particular search space is essential. Some search spaces may embody a larger number of possible states than others. Knowledge may be not determined by a few discrete factors but by a compound of interrelated factors. Complexity of control: The complexity of a specific control studies how much time and memory space a chosen computer routine/algorithm may take. It is characterized by its search direction, heuristic, constraint and threshold. Different routines/algorithms have different complexities of control. Normally, a complex domain yields a complex search space, and requires a complex control for searching solutions in the search space. Complexity of interaction: Complexity of interaction concerns the execution issues of the four interaction forms, some of which are: deciding the degree of involvement of a specific form, scheduling process, doing, undoing, iteration and rollback of a specific control, goal setting and resetting, visualization and recommendation. The greater user demand is, the more complex the overall system becomes. Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008. Implementation examples We have implemented an interactive classification system using a granule network (Zhao & Yao, 2005). A granule network systematically organizes all the subsets of the universe and formulas that define the subsets. A consistent classification task can be understood as a search for the distribution of classes in a granule network defined by the descriptive attribute set. Users can freely decide to use a partition-based method, a covering-based method, or a hybrid method for facilitating the search. Classification information can be easily retrieved in the form of a treeview, a pie chart, a bar chart and/or a pivot table representation. The measurements of attribute and attribute-values are listed. These help the user to judge and select one for splitting. Measures can be chosen from the pre-defined measurement set, or can be composed by the user. Users can validate the mined classification rules at any given time, continue or cease the training process according to the evaluation, split the tree node for higher accuracy, or remove one entire tree branch for simplicity. Another implementation for interactive attribute selection is currently under construction. In order to keep the original interdependency and distribution of the attribute, the concept of reduct in rough set theory is introduced (Pawlak, 1991). Therefore, the selected attribute set is individually necessary and jointly sufficient for retaining all the information contained in the original attribute set. In this system, users can state a preference order of attributes, satisfying a weak order. Based on this order, a reduct that is most consistent, instead of a random reduct among many, can be computed and presented. Different construction strategies, such as add, adddelete and delete approaches, can be selected. Users can set their preferred attribute order once, or change the order dynamically in order to evaluate different results. In this case, users are allowed to choose a target reduct that is able to preserve accuracy, cost and utility, or distribution property. When a certain reduct is too complicated or too expensive to obtain, an approximate reduct can be constructed. An interactive explanation-oriented system is our third implementation. The subjects selected for explanation, the explanation context, the explanation construction methods, as well as the explanation evaluation methods all highly dependent upon the preference of an individual user. Please refer to another paper (Yao, Zhao & Maguire, 2003) for further details on this topic. FUTURE TRENDS Interactive analysis and mining combines the power of both human users and computer systems. It relies on powerful intuition, analytical skills, insight, and creativity of humans, and fast processing speed, huge storage, and massive computational power of computers. Prototype systems will be implemented to demonstrate the usefulness of the proposed theoretical framework. The seamless integration of humans and computer systems may require the development of multilevel interactive systems, i.e., interaction applied from a low level to a high level, or from fully manual to fully automatic. From the application point of view, interactive data analysis and mining plays a supporting role for a user. This enables us to design and implement next generation systems that support effective Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008. usage of data, for example, decision support systems, business support systems, research support systems and teaching support systems. Considerable research remains to be done. CONCLUSION The huge volume of raw data is far beyond a user's processing capacity. One goal of data analysis and mining is to discover, summarize and present information and knowledge from data in concise and human-understandable forms. It should be realized that, at least in the near future, insight about data, as well as its semantics, may not be achieved by a computer system alone. Users, in fact, need to interact with and utilize computer systems as research tools to browse, explore and understand data, and to search for knowledge and insight from data. Implementing interactive computer systems is an emerging trend in the field of data mining. It aims to have human involvement in the entire data mining process in order to achieve an effective result. This interaction requires adaptive, autonomous systems and adaptive, active users. The performance of these interactions depends upon the complexities of the domain, control, and the available interactive approaches. REFERENCES Ankerst, M. (2001) Human involvement and interactivity of the next generations' data mining tools, ACM SIGMOD Workshop on Research Issues in Data mining and Knowledge Discovery, Santa Barbara, CA. Berners-Lee, T. (1999) Weaving the Web - The Original Design and Ultimate Destiny of the World Wide Web by its Inventor, Harper Collins Inc. Brachmann, R. & Anand, T. (1996) The process of knowledge discovery in databases: a humancentered approach, Advances in Knowledge Discovery and Data mining, AAAI Press & MIT Press, Menlo Park, CA, 37-57. Chiew, V. & Wang, Y. (2004) Formal description of the cognitive process of problem solving, Proceedings of International Conference of Cognitive Informatics, 74-83. Elm, W.C., Cook, M.J., Greitzer, F.L., Hoffman, R.R., Moon, B. & Hutchins, S.G. (2004) Designing support for intelligence analysis, Proceedings of the Human Factors and Ergonomics Society, 20-24. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. & Uthurusamy, R. (Eds.) (1996) Advances in Knowledge Discovery and Data mining, AAAI/MIT Press. Hancock, P.A. and Scallen, S.F. (1996) The future of function allocation, Ergonomics in Design, 4(4), 24-29. Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008. Maanen, P., Lindenberg, J. and Neerincx, M.A. (2005) Integrating human factors and artificial intelligence in the development of human-machine cooperation, Proceedings of International Conference on Artificial Intelligence, 10-16. Mannila, H. (1997) Methods and problems in data mining, Proceedings of International Conference on Database Theory, 41-55. Matlin, M.V. (1998) Cognition, fourth edition, Harcount Brace Company. Mayer, R.E. (1992) Thinking, Problem Solving, Cognition, second edition, W.H. Freeman Company. Ormrod, J.E. (1999) Human Learning, third edition, Prentice-Hall, Inc., Simon and Schuster/A Viacom Company. Pawlak, Z. (1991) Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht. Shneiderman, B. (1998) Designing the User Interface: Strategies for Effective Human-Computer Interaction, third edition, Addison-Wesley. Wang, Y.X. & Liu, D. (2003) On information and knowledge representation in the brain, Proceedings of International Conference of Cognitive Informatics, 26-29. Weir, G.R. (1991) Living with complex interactive systems, in: Weir, G.R. and Alty, J.L. (Eds.) Human-Computer Interaction and Complex Systems, Academic Press Ltd. Yao, Y.Y., Zhao, Y. & Maguire, R.B. (2003) Explanation-oriented association mining using rough set theory, Proceedings of Rough Sets, Fuzzy Sets and Granular Computing, 165172. Yao, Y.Y., Zhong, N. & Zhao, Y. (2004) A three-layered conceptual framework of data mining, Proceedings of ICDM Workshop of Foundation of Data mining, 215-221. Zhao, Y. & Yao, Y.Y. (2005) Interactive user-driven classification using a granule network, Proceedings of International Conference of Cognitive Informatics, 250-259. Zhao, Y. & Yao, Y.Y. (2005) On interactive data mining, Proceedings of Indian International Conference on Artificial Intelligence, 2444-2454. TERMS AND DEFINITIONS Interactive data mining: an integration of human factors and artificial intelligence. An interactive system thus is an integration of a human user with a computer machine. The study of interactive data mining and interactive systems is directly related to cognitive science. Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008. Process of interactive data mining: interactive data preparation, interactive data selection and reduction, interactive data pre-processing and transformation, interactive pattern discovery, interactive pattern explanation and evaluation, and interactive pattern presentation. Forms of interactive data mining: proposition, information and guidance acquisition, and manipulation. Complexity of interactive data mining: complexity of the domain, complexity of control and complexity of interaction. The greater user demand, the more complex the overall system becomes.