Download On Interactive Data Mining - University of Regina

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008.
On Interactive Data Mining
INTRODUCTION
Exploring and extracting knowledge from data is one of the fundamental problems in science.
Data mining consists of important tasks, such as description, prediction and explanation of data,
and applies computer technologies to nontrivial calculations. Computer systems can maintain
precise operations under a heavy information load, and also can maintain steady performance.
Without the aid of computer systems, it is very difficult for people to be aware of, to extract, to
search and to retrieve knowledge in large and separate datasets, let alone interpreting and
evaluating data and information that are constantly changing, and then making recommendations
or predictions based on inconsistent and/or incomplete data.
On the other hand, the implementations and applications of computer systems reflect the requests
of human users, and are affected by human judgement, preference and evaluation. Computer
systems rely on human users to set goals, to select alternatives if an original approach fails, to
participate in unanticipated emergencies and novel situations, and to develop innovations in order
to preserve safety, avoid expensive failure, or increase product quality (Elm, et al., 2004;
Hancock & Scallen, 1996; Shneiderman, 1998).
Users possess varied skills, intelligence, cognitive styles, and levels of tolerance of frustration.
They come to a problem with diverse preferences, requirements and background knowledge.
Given a set of data, users will see it from different angles, in different aspects, and with different
views. Considering these differences, a universally applicable theory or method to serve the
needs of all users does not exist. This motivates and justifies the co-existence of numerous
theories and methods of data mining systems, as well as the exploration of new theories and
methods.
According to the above observations, we believe that interactive systems are required for data
mining tasks. Generally, interactive data mining is an integration of human factors and artificial
intelligence (Maanen, Lindenberg and Neerincx, 2005); an interactive system is an integration of
a human user and a computer machine, communicating and exchanging information and
knowledge. Through interaction and communication, computers and users can share the tasks
involved in order to achieve a good balance of automation and human control. Computers are
used to retrieve and keep track of large volumes of data, and to carry out complex mathematical
or logical operations. Users can then avoid routine, tedious and error-prone tasks, concentrate on
critical decision making and planning, and cope with unexpected situations (Elm, et al., 2004;
Shneiderman, 1998). Moreover, interactive data mining can encourage users’ learning, improve
insight and understanding of the problem to be solved, and stimulate users to explore creative
possibilities. Users’ feedback can be used to improve the system. The interaction is mutually
beneficial, and imposes new coordination demands on both sides.
BACKGROUND
The importance of human-machine interaction has been well recognized and studied in many
disciplines. One example of interactive systems is an information retrieval system or a search
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008.
engine. A search engine connects users to Web resources. It navigates searches, stores and
indexes resources and responses to users’ particular queries, and ranks and provides the most
relevant results to each query. Most of the time, a user initiates the interaction with a query.
Frequently, feedback will arouse the user’s particular interest, causing the user to refine the
query, and then change or adjust further interaction. Without this mutual connection, it would be
hard, if not impossible, for the user to access these resources, no matter how important and how
relevant they are. The search engine, as an interactive system, uses the combined power of the
user and the resources, to ultimately generate a new kind of power.
Though human-machine interaction has been emphasized for a variety of disciplines, until
recently it has not received enough attention in the domain of data mining (Ankerst, 2001;
Brachmann & Anand, 1996; Zhao & Yao, 2005). In particular, the human role in the data mining
processes has not received its due attention. Here, we identify two general problems in many of
the existing data mining systems:
1. Overemphasizing the automation and efficiency of the system, while neglecting the
adaptiveness and effectiveness of the system. Effectiveness includes human subjective
understanding, interpretation and evaluation.
2. A lack of explanations and interpretations of the discovered knowledge. Human-machine
interaction is always essential for constructing explanations and interpretations.
To study and implement an interactive data mining system, we need to pay more attention to the
connection between human users and computers. For cognitive science, Wang and Liu (2003)
suggest a relational metaphor, which assumes that relations and connections of neurons represent
information and knowledge in the human brain, rather than the neurons alone. Berners-Lee
(1999) explicitly states that “in an extreme view, the world can be seen as only connections,
nothing else.” Based on this statement, the World Wide Web was designed and implemented.
Following the same way of thinking, we believe that interactive data mining is sensitive to the
capacities and needs of both humans and machines. A critical issue is not how intelligent a user
is, or how efficient an algorithm is, but how well these two parts can be connected and
communicated, adapted, stimulated and improved.
MAIN THRUST
The design of interactive data mining systems is highlighted by the process, forms and
complexity issues of interaction.
Processes of interactive data mining
The entire knowledge discovery process includes data preparation, data selection and reduction,
data pre-processing and transformation, pattern discovery, pattern explanation and evaluation,
and pattern presentation (Brachmann & Anand, 1996; Fayyad, et al., 1996; Mannila, 1997; Yao,
Zhao & Maguire, 2003; Yao, Zhong & Zhao, 2004). In an interactive system, these phases can be
carried out as follows:
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008.
o Interactive data preparation observes raw data with a specific format. Data distribution and
relationships between attributes can be easily observed.
o Interactive data selection and reduction involves the reduction of the number of attributes
and/or the number of records. A user can specify the attributes of interest and/or data area,
and remove data that is outside of the area of concern.
o Interactive data pre-processing and transformation determines the number of intervals, as well
as cut-points for continuous datasets, and transforms the dataset into a workable dataset.
o Interactive pattern discovery interactively discovers patterns under the user’s guidance,
selection, monitoring and supervision. Interactive controls include decisions made on search
strategies, directions, heuristics, and the handling of abnormal situations.
o Interactive pattern explanation and evaluation explains and evaluates the discovered pattern if
the user requires it. The effectiveness and usefulness of this are subject to the user’s
judgement.
o Interactive pattern presentation visualizes the patterns that are perceived during the pattern
discovery phase, and/or the pattern explanation and evaluation phase.
Practice has shown that the process is virtually a loop, which is iterated until satisfying results are
obtained. Most of the existing interactive data mining systems add visual functionalities into
some phases, which enable users to invigilate the mining process at various stages, such as raw
data visualization and/or final results visualization (Brachmann & Anand, 1996; Elm, et al.,
2004). Graphical visualization makes it easy to identify and distinguish the trend and distribution.
This is a necessary feature for human-machine interaction, but is not sufficient on its own. To
implement a good interactive data mining system, we need to study the types of interactions users
expect, and the roles and responsibilities a computer system should take.
Forms of interaction
Users expect different kinds of human-computer interactions: proposition, information/guidance
acquisition, and manipulation. These interactions proceed with the entire data mining process we
mentioned above to arrive at desirable mining results.
Users should be allowed to make propositions, describe decisions and selections based on their
preference and judgement. For example, a user can state an interested class value for
classification tasks, express a target knowledge representation, indicate a question, infer features
for explanation, describe a preference order of attributes, set up the constraints, and so on.
Subjects of propositions differ among the varying views of individuals. One may initiate different
propositions at different times based on different considerations at different cognitive levels. The
potential value consideration enters in to the choice of proposition.
Information acquisition is a basic form of interaction associated with information analysis.
Information might be presented in various fashions and structures. Raw data is raw information.
Mined rules are extracted knowledge. Numerous measurements show the information of an
object from different aspects. Each data mining phase contains and generates much information.
An object might be changed; the information it holds might be erased, updated or manipulated by
the user in question. Benchmarks, official standards and de facto standards are valuable reference
knowledge, which can make it easier to learn and evaluate new applications. In general,
information acquisition can be conducted by granular computing and hierarchy theory. A granule
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008.
in a higher level can be decomposed into many granules in a lower level, and conversely, some
granules in a lower level can be combined into a granule in a higher level. A granule in a lower
level provides a more detailed description than that of a parent granule in the higher level, and a
granule in a higher level has a more abstract description than a child granule in the lower level.
Users need to retrieve the information in an interactive manner, namely, “show it correctly when
I want to or need to see it, and in an understandable format.”
Guidance acquisition is another form of interaction. A consultant role that an interactive system
can play is to provide knowledge or skills that the user does not have in-house, for example,
doing an evaluation or providing an analysis of the implications of environmental trends. To
achieve this expert role, the interactive system must be able to “understand” the human
proposition, and be able to make corresponding inferences. Guidance is especially useful while
the domain is complex and the search space is huge. To achieve guidance, the system needs to
store an extra rule base (usually serving as a standard or a reference), and be context aware. The
inference function helps users to pay attention to items that are easily ignored, considered as
“boundary” issues, or are important but not part of the current focus. The inference function takes
the role and responsibility of a consultant. It ensures the process develops in a more balanced
manner.
Manipulation is the form of interaction that includes selecting, retrieving, combining and
changing objects, using operated objects to obtain new objects. Different data mining phases
require different kinds of manipulations. Interactive manipulations obligate the computer system
to provide necessary cognitive supports, such as: a systematic approach that uses an exhaustive
search or a well-established, recursive search for solving a problem in a finite number of steps; a
heuristic approach that selectively searches a portion of a solution space, a sub-problem of the
whole problem, or a plausible solution according to the user’s special needs; and an analogy
approach that uses known solutions to solve an existing problem (Chiew & Wang, 2004; Matlin,
1998; Mayer, 1992; Ormrod, 1999). In addition, interactive systems should allow users to build
their own mental buildings using the standard blocks. The blocks can be connected by functions
similar to the pipe command in UNIX systems. What this means is that the standard output of
the command to the left of the pipe is sent as standard input of the command to the right of the
pipe. A result of this interaction is that users can define their own heuristics and algorithms.
The interaction should be directed to construct a reasonable and meaningful cognitive structure to
each user. To a novice, the constructive operation is the psychological paradigm in which one
constructs his/her own mental model of a given domain; to an expert, the constructive operation
is an experienced practice containing anticipation, estimation, understanding and management of
the domain.
Figure 1 illustrates the process and the forms of interactive data mining. A particular interactive
data mining system can involve interactions of all four forms at six different phases.
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008.
Information acquisition
Guidance acquisition
Pattern discovery
Data preprocessing
Pattern explanation
and evaluation
Data selection
Pattern representation
Data preparation
Data
Selected
data
Preprocessed
data
Patterns
Proposition
Explained and
evaluated patterns
Knowledge
Manipulation
Figure 1: Interactive data mining
Complexity of interactive data mining systems
Because of the special forms of interaction, complexity issues often raise concerns during
implementation. Weir (1991) identified three sources of complexity in interactive applications.
Complexity of the domain: The domain can be very complex because of the size and type of data,
the high dimensionality and high degree of linkage that exist in the data. Modelling the domain to
a particular search space is essential. Some search spaces may embody a larger number of
possible states than others. Knowledge may be not determined by a few discrete factors but by a
compound of interrelated factors.
Complexity of control: The complexity of a specific control studies how much time and memory
space a chosen computer routine/algorithm may take. It is characterized by its search direction,
heuristic, constraint and threshold. Different routines/algorithms have different complexities of
control. Normally, a complex domain yields a complex search space, and requires a complex
control for searching solutions in the search space.
Complexity of interaction: Complexity of interaction concerns the execution issues of the four
interaction forms, some of which are: deciding the degree of involvement of a specific form,
scheduling process, doing, undoing, iteration and rollback of a specific control, goal setting and
resetting, visualization and recommendation. The greater user demand is, the more complex the
overall system becomes.
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008.
Implementation examples
We have implemented an interactive classification system using a granule network (Zhao & Yao,
2005). A granule network systematically organizes all the subsets of the universe and formulas
that define the subsets. A consistent classification task can be understood as a search for the
distribution of classes in a granule network defined by the descriptive attribute set. Users can
freely decide to use a partition-based method, a covering-based method, or a hybrid method for
facilitating the search. Classification information can be easily retrieved in the form of a treeview, a pie chart, a bar chart and/or a pivot table representation. The measurements of attribute
and attribute-values are listed. These help the user to judge and select one for splitting. Measures
can be chosen from the pre-defined measurement set, or can be composed by the user. Users can
validate the mined classification rules at any given time, continue or cease the training process
according to the evaluation, split the tree node for higher accuracy, or remove one entire tree
branch for simplicity.
Another implementation for interactive attribute selection is currently under construction. In
order to keep the original interdependency and distribution of the attribute, the concept of reduct
in rough set theory is introduced (Pawlak, 1991). Therefore, the selected attribute set is
individually necessary and jointly sufficient for retaining all the information contained in the
original attribute set. In this system, users can state a preference order of attributes, satisfying a
weak order. Based on this order, a reduct that is most consistent, instead of a random reduct
among many, can be computed and presented. Different construction strategies, such as add, adddelete and delete approaches, can be selected. Users can set their preferred attribute order once,
or change the order dynamically in order to evaluate different results. In this case, users are
allowed to choose a target reduct that is able to preserve accuracy, cost and utility, or distribution
property. When a certain reduct is too complicated or too expensive to obtain, an approximate
reduct can be constructed.
An interactive explanation-oriented system is our third implementation. The subjects selected for
explanation, the explanation context, the explanation construction methods, as well as the
explanation evaluation methods all highly dependent upon the preference of an individual user.
Please refer to another paper (Yao, Zhao & Maguire, 2003) for further details on this topic.
FUTURE TRENDS
Interactive analysis and mining combines the power of both human users and computer systems.
It relies on powerful intuition, analytical skills, insight, and creativity of humans, and fast
processing speed, huge storage, and massive computational power of computers. Prototype
systems will be implemented to demonstrate the usefulness of the proposed theoretical
framework. The seamless integration of humans and computer systems may require the
development of multilevel interactive systems, i.e., interaction applied from a low level to a high
level, or from fully manual to fully automatic.
From the application point of view, interactive data analysis and mining plays a supporting role
for a user. This enables us to design and implement next generation systems that support effective
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008.
usage of data, for example, decision support systems, business support systems, research support
systems and teaching support systems. Considerable research remains to be done.
CONCLUSION
The huge volume of raw data is far beyond a user's processing capacity. One goal of data analysis
and mining is to discover, summarize and present information and knowledge from data in
concise and human-understandable forms. It should be realized that, at least in the near future,
insight about data, as well as its semantics, may not be achieved by a computer system alone.
Users, in fact, need to interact with and utilize computer systems as research tools to browse,
explore and understand data, and to search for knowledge and insight from data.
Implementing interactive computer systems is an emerging trend in the field of data mining. It
aims to have human involvement in the entire data mining process in order to achieve an
effective result. This interaction requires adaptive, autonomous systems and adaptive, active
users. The performance of these interactions depends upon the complexities of the domain,
control, and the available interactive approaches.
REFERENCES
Ankerst, M. (2001) Human involvement and interactivity of the next generations' data mining
tools, ACM SIGMOD Workshop on Research Issues in Data mining and Knowledge
Discovery, Santa Barbara, CA.
Berners-Lee, T. (1999) Weaving the Web - The Original Design and Ultimate Destiny of the
World Wide Web by its Inventor, Harper Collins Inc.
Brachmann, R. & Anand, T. (1996) The process of knowledge discovery in databases: a humancentered approach, Advances in Knowledge Discovery and Data mining, AAAI Press &
MIT Press, Menlo Park, CA, 37-57.
Chiew, V. & Wang, Y. (2004) Formal description of the cognitive process of problem solving,
Proceedings of International Conference of Cognitive Informatics, 74-83.
Elm, W.C., Cook, M.J., Greitzer, F.L., Hoffman, R.R., Moon, B. & Hutchins, S.G. (2004)
Designing support for intelligence analysis, Proceedings of the Human Factors and
Ergonomics Society, 20-24.
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P. & Uthurusamy, R. (Eds.) (1996) Advances in
Knowledge Discovery and Data mining, AAAI/MIT Press.
Hancock, P.A. and Scallen, S.F. (1996) The future of function allocation, Ergonomics in Design,
4(4), 24-29.
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008.
Maanen, P., Lindenberg, J. and Neerincx, M.A. (2005) Integrating human factors and artificial
intelligence in the development of human-machine cooperation, Proceedings of
International Conference on Artificial Intelligence, 10-16.
Mannila, H. (1997) Methods and problems in data mining, Proceedings of International
Conference on Database Theory, 41-55.
Matlin, M.V. (1998) Cognition, fourth edition, Harcount Brace Company.
Mayer, R.E. (1992) Thinking, Problem Solving, Cognition, second edition, W.H. Freeman
Company.
Ormrod, J.E. (1999) Human Learning, third edition, Prentice-Hall, Inc., Simon and Schuster/A
Viacom Company.
Pawlak, Z. (1991) Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic
Publishers, Dordrecht.
Shneiderman, B. (1998) Designing the User Interface: Strategies for Effective Human-Computer
Interaction, third edition, Addison-Wesley.
Wang, Y.X. & Liu, D. (2003) On information and knowledge representation in the brain,
Proceedings of International Conference of Cognitive Informatics, 26-29.
Weir, G.R. (1991) Living with complex interactive systems, in: Weir, G.R. and Alty, J.L. (Eds.)
Human-Computer Interaction and Complex Systems, Academic Press Ltd.
Yao, Y.Y., Zhao, Y. & Maguire, R.B. (2003) Explanation-oriented association mining using
rough set theory, Proceedings of Rough Sets, Fuzzy Sets and Granular Computing, 165172.
Yao, Y.Y., Zhong, N. & Zhao, Y. (2004) A three-layered conceptual framework of data mining,
Proceedings of ICDM Workshop of Foundation of Data mining, 215-221.
Zhao, Y. & Yao, Y.Y. (2005) Interactive user-driven classification using a granule network,
Proceedings of International Conference of Cognitive Informatics, 250-259.
Zhao, Y. & Yao, Y.Y. (2005) On interactive data mining, Proceedings of Indian International
Conference on Artificial Intelligence, 2444-2454.
TERMS AND DEFINITIONS
Interactive data mining: an integration of human factors and artificial intelligence. An
interactive system thus is an integration of a human user with a computer machine. The study of
interactive data mining and interactive systems is directly related to cognitive science.
Wang, J. (Ed.) Encyclopedia of Data Warehousing and Mining, 2nd edition, 1085-1090, Idea Group Inc., 2008.
Process of interactive data mining: interactive data preparation, interactive data selection and
reduction, interactive data pre-processing and transformation, interactive pattern discovery,
interactive pattern explanation and evaluation, and interactive pattern presentation.
Forms of interactive data mining: proposition, information and guidance acquisition, and
manipulation.
Complexity of interactive data mining: complexity of the domain, complexity of control and
complexity of interaction. The greater user demand, the more complex the overall system
becomes.